[comp.text.sgml] Record boundaries in SGML

garyp@csg.waterloo.edu (Gary Pianosi) (11/22/90)

I am having some problems importing and exporting SGML documents using
SoftQuad Author/Editor v1.1.  For various reasons, I need to transfer
SGML documents between a Mac (where they are edited using Author/Editor),
and a PC (where they are manually edited to contain "rigorous markup").
As a result, I need the marked-up text to be both readable and correct.

I am able to import hand-edited files into Author/Editor without error,
but when I try to validate the document, I get the error message:

        "Validation Error:  Text not allowed here"

wherever there is a new line between an end tag and a start tag.
(I can delete these 'characters' to get rid of the error message, but
there are too many of them.)  Strangely, no error is reported for new
lines between adjacent start tags or end tags, even though text is not
valid between the tags.  I should mention that the hand-edited files
can be been validated using MARK-IT and the canonical output can be
read into Author/Editor without any problems.

On the other side, when I try to export a document from Author/Editor,
the resulting text file has new-lines only within text and after some
start tags.  Aside from being difficult to read, some lines exceed 500
characters making these files next to impossible to edit on the PC.

Finally, my questions ...

    -   What is the significance of record boundaries in SGML?

    -   Is the Author/Editor's treatment of newlines between end
        and start tags incorrect?  Or is the SGML standard open
        to interpretation?

    -   Is is possible to get Author/Editor to export a readable SGML
        document?  Or, do I have to run it through some other SGML parser
        to make it readable?

Below I have included a short SGML document that I believe is correctly
and rigorously marked up in accordance with the AAP Article DTD. It
produces the aforementioned error when imported into Author/Editor. Can
someone tell me whether this is a valid AAP Article document?

Any comments would be appreciated. Thanks.
---------------------------------------------------------------------
<ARTICLE>
<FM>
<TIG>
<ATL>
A Trivial Article
</ATL>
</TIG>
<AU><FNM>G.M.</FNM><SNM>Pianosi</SNM></AU>
<AU><FNM>P.D.</FNM><SNM>Jones</SNM>
<AFF>
<ONM>University of Waterloo</ONM>
<ODV>Computer Systems Group</ODV>
<PC>N2L 3G1</PC>
<CTY>Waterloo</CTY>
<CNY>Canada</CNY>
</AFF>
</AU>
</FM>
<BDY>
<SEC>
<ST>Introduction</ST>
<P>This paper...</P>
</SEC>
</BDY>
</ARTICLE>
---------------------------------------------------------------------
--
--
Internet:  garyp@csg.UWaterloo.CA            Bitnet:  garyp@watcsg.BITNET
Computer Systems Group, University of Waterloo, Waterloo, Ontario, Canada

enag@ifi.uio.no (Erik Naggum) (11/29/90)

In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes:

   I am able to import hand-edited files into Author/Editor without error,
   but when I try to validate the document, I get the error message:

	   "Validation Error:  Text not allowed here"

   wherever there is a new line between an end tag and a start tag.
   (I can delete these 'characters' to get rid of the error message, but
   there are too many of them.)  Strangely, no error is reported for new
   lines between adjacent start tags or end tags, even though text is not
   valid between the tags.  I should mention that the hand-edited files
   can be been validated using MARK-IT and the canonical output can be
   read into Author/Editor without any problems.

I've to reply to Gary directly (to <garyp@csg.uwaterloo.ca>, the
!@#$%^&* mailer didn't recognize itself as "csg.waterloo.edu), quoting
section 7.6.1 from SGML (with the amendment applied), but won't post
that for petty copyright violation reasons.

The problem can be reduced, I think, to the problem of the treatment
of Record End in this contrived example:

	<!element foo (bar+)>
	<!element bar (#PCDATA)>

     1	`<foo>'
     2	`<bar>'
     3	`caninus'
     4	`</bar>'
     5	`<bar>'
     6	`felinus'
     7	`</bar>'
     8	`</foo>'

(where ` signifies Record Start, ' Record End for clarity, line
numbers for reference, only)

According to section 7.6.1, this will be interpreted at the outer
(foo) level as:

     1	`<foo>'
     2	`<bar>...</bar>'
     5	`<bar>...</bar>'
     8	`</foo>'

Now, the RE in line 1 is clearly the first RE in the content of foo,
and the RE in line 5 is clearly the last RE in the content of foo.
According to said section, these are to be ignored.

The problem is the RE in line 2, and the question boils down to this:

	Is this RE recognized as /content/ or as /markup/?

I believe I understand this to be markup, and thus that it should be
ignored.  It seems that Gary's problems stem from some decision
amounting to viewing this as content, in which the RE would imply the
start of a bar element, in which a new bar element is illegal (see
amended note to section 11.2.4), or in which data content is not
valid.

What am I missing here?  (I'm sure it's something.)

I've read the spec several times, but won't claim that I understand
and remember every single thing, due to the high number of references
and other spaghetti-coding style writing.

--
[Erik Naggum]	Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY
		Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no>
My opinions.	Wail: +47-2-836-863
--

jmd@dlogics.COM (Jens M. Dill) (12/05/90)

In article <ENAG.90Nov29012001@hild.ifi.uio.no>, (Erik Naggum) writes:
> In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes:
> 
> >  I am able to import hand-edited files into Author/Editor without error,
> >  but when I try to validate the document, I get the error message:
> >
> >	   "Validation Error:  Text not allowed here"
> >
> >  wherever there is a new line between an end tag and a start tag.
> >  ...
>
> The problem can be reduced, I think, to the problem of the treatment
> of Record End in this contrived example:
> 
> 	<!element foo (bar+)>
> 	<!element bar (#PCDATA)>
> 
>      1	`<foo>'
>      2	`<bar>'
>      3	`caninus'
>      4	`</bar>'
>      5	`<bar>'
>      6	`felinus'
>      7	`</bar>'
>      8	`</foo>'
> 
> (where ` signifies Record Start, ' Record End for clarity, line
> numbers for reference, only)
> 
> According to section 7.6.1, this will be interpreted at the outer
> (foo) level as:
> 
>      1	`<foo>'
>      2	`<bar>...</bar>'
>      5	`<bar>...</bar>'
>      8	`</foo>'
> 
> Now, the RE in line 1 is clearly the first RE in the content of foo,
> and the RE in line 5 is clearly the last RE in the content of foo.
> According to said section, these are to be ignored.
> 
> The problem is the RE in line 2, and the question boils down to this:
> 
> 	Is this RE recognized as /content/ or as /markup/?
> 
> I believe I understand this to be markup, and thus that it should be
> ignored.  It seems that Gary's problems stem from some decision
> amounting to viewing this as content, in which the RE would imply the
> start of a bar element, in which a new bar element is illegal (see
> amended note to section 11.2.4), or in which data content is not
> valid.
> 
> What am I missing here?  (I'm sure it's something.)
> 

What you are missing is a very obscure note added to section 11.2.4 by 
Amendment 1:

    NOTE -- It is recommended that "#PCDATA" be used only when data characters
    are to be permitted anywhere in the content of the element; that is, in a
    _content model_ where it is the sole token, or where _or_ is the only
    connector used in any _model group._

    This recomendation is made because separator characters, which are 
    recognized as separators in _element content_, are treated as data in
    _mixed content._ ...

The note just about says it all, but it seriously understates the gravity
of the problem, and both the example and the sample solution provided are
laughably simplistic.

The core problem is that Gary has defined an element with "mixed content"
(the content model contains both "#PCDATA" and GI's of sub-elements), and
has done so in such a way that somewhere in the content model, you come 
across the situation where sub-element A ends and the only legal things
that can follow sub-element A are other elements (#PCDATA is not legal at
this point in the model).  Now, IF the instance is set up so that

   -- Element A has an explicit end-tag
   -- There is a separator character (space, tab, record end)
      between the end-tag and the start-tag of the next sub-element

then the parser attempts to read the separator as data, discovers it cannot
match data at this point in the current element, and starts trying to infer
omitted start-tags or end-tags that would get it to a point where #PCDATA
would be acceptable.  At this point the parser gets so confused that the
eventual error message has no chance of bearing any relation at all to anything
involved in the original problem.

The example given in the note quoted above, (x, #PCDATA), is, in my opinion,
oversimplified because if this were the whole content model for an element,
TWO SUCCESSIVE record ends would be required (before the start-tag of "x")
to trigger the problem (As Erik points out, the first, since it follows the 
start-tag of the containing element, is attributable to markup, and therefore
ignored).  Some more illuminating examples:

    (x?, y, #PCDATA)    trouble at </X>&#RE;<Y>
    (#PCDATA | (x,y))   ditto
    (#PCDATA | x+)      trouble at </X>&#RE;<X>

The solution proposed is to "replace 'PCDATA' with the GI of an element whose
content is '#PCDATA' and both of whose tags can be omitted."  This ie effective
only in the first example above, because in the others, the #PCDATA reference
is not contextually required and therefore the start-tag of its replacement GI
cannot be omitted in practice.

My experience with other solutions is that they are equally weak.  The only
GOOD solution I know of is to become aware of the problem and avoid writing
a DTD that could cause it.

In my opinion, this is an area where SGML is seriously flawed, for the
following reasons:

1.  The problem is subtle and hard to predict.  It relies on a designation
    of "mixed content", which in turn relies on the presence or absence of
    a single "#PCDATA" in what may be a very complex content model constructed
    with heavy reliance on parameter entities.  I have seen cases where the
    problem was missed for weeks because there were two very similarly
    defined elements, one of which ignored record-ends after end-tags and
    one of which choked on them.  It is also a non-trivial problem to
    study a content model and determine if #PCDATA is, in fact, permitted
    between any two sub-elements.

2.  The problem is one that does not manifest itself until exactly the right
    combination of circumstances is encountered.  This means that a very large
    collection of instances could be built against a DTD before one of them
    demonstrated the flaw.  This means a lot of recoding of instances unless
    we can repair the DTD in such a way that existing instances will still
    parse.

3.  There is no good general way to fix the problem in an existing DTD without
    either requiring changes in the tag structure of existing documents or
    loosening the DTD so that it accepts #PCDATA in places where it formerly
    did not.

I know this issue has been a source of debate in the standards committee.
I may well have missed some important points.  But, to me, the note added
by amendment 1 has the look and feel of a compromise solution that resulted
from a failure to comprehend the full impact of the problem on a user.
I would (personally) urge the committee to take another look at the problem.

The opinions stated herein are my own; they are not to be interpreted as an
official opinion from Datalogics, Inc.

*=====* TIME CANNOT BE WASTED *=====*       -- Jens M. Dill
 \ But it can be used for purposes /           jmd@dlogics.com
  \ other than what was intended. /
   *=============================*

enag@ifi.uio.no (Erik Naggum) (12/06/90)

My most sincere thanks to both Jens M. Dill and Rodney Boyd for
helping me understand this difficult problem.

--
[Erik Naggum]	Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY
		Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no>
My opinions.	Wail: +47-2-836-863