garyp@csg.waterloo.edu (Gary Pianosi) (11/22/90)
I am having some problems importing and exporting SGML documents using SoftQuad Author/Editor v1.1. For various reasons, I need to transfer SGML documents between a Mac (where they are edited using Author/Editor), and a PC (where they are manually edited to contain "rigorous markup"). As a result, I need the marked-up text to be both readable and correct. I am able to import hand-edited files into Author/Editor without error, but when I try to validate the document, I get the error message: "Validation Error: Text not allowed here" wherever there is a new line between an end tag and a start tag. (I can delete these 'characters' to get rid of the error message, but there are too many of them.) Strangely, no error is reported for new lines between adjacent start tags or end tags, even though text is not valid between the tags. I should mention that the hand-edited files can be been validated using MARK-IT and the canonical output can be read into Author/Editor without any problems. On the other side, when I try to export a document from Author/Editor, the resulting text file has new-lines only within text and after some start tags. Aside from being difficult to read, some lines exceed 500 characters making these files next to impossible to edit on the PC. Finally, my questions ... - What is the significance of record boundaries in SGML? - Is the Author/Editor's treatment of newlines between end and start tags incorrect? Or is the SGML standard open to interpretation? - Is is possible to get Author/Editor to export a readable SGML document? Or, do I have to run it through some other SGML parser to make it readable? Below I have included a short SGML document that I believe is correctly and rigorously marked up in accordance with the AAP Article DTD. It produces the aforementioned error when imported into Author/Editor. Can someone tell me whether this is a valid AAP Article document? Any comments would be appreciated. Thanks. --------------------------------------------------------------------- <ARTICLE> <FM> <TIG> <ATL> A Trivial Article </ATL> </TIG> <AU><FNM>G.M.</FNM><SNM>Pianosi</SNM></AU> <AU><FNM>P.D.</FNM><SNM>Jones</SNM> <AFF> <ONM>University of Waterloo</ONM> <ODV>Computer Systems Group</ODV> <PC>N2L 3G1</PC> <CTY>Waterloo</CTY> <CNY>Canada</CNY> </AFF> </AU> </FM> <BDY> <SEC> <ST>Introduction</ST> <P>This paper...</P> </SEC> </BDY> </ARTICLE> --------------------------------------------------------------------- -- -- Internet: garyp@csg.UWaterloo.CA Bitnet: garyp@watcsg.BITNET Computer Systems Group, University of Waterloo, Waterloo, Ontario, Canada
enag@ifi.uio.no (Erik Naggum) (11/29/90)
In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes:
I am able to import hand-edited files into Author/Editor without error,
but when I try to validate the document, I get the error message:
"Validation Error: Text not allowed here"
wherever there is a new line between an end tag and a start tag.
(I can delete these 'characters' to get rid of the error message, but
there are too many of them.) Strangely, no error is reported for new
lines between adjacent start tags or end tags, even though text is not
valid between the tags. I should mention that the hand-edited files
can be been validated using MARK-IT and the canonical output can be
read into Author/Editor without any problems.
I've to reply to Gary directly (to <garyp@csg.uwaterloo.ca>, the
!@#$%^&* mailer didn't recognize itself as "csg.waterloo.edu), quoting
section 7.6.1 from SGML (with the amendment applied), but won't post
that for petty copyright violation reasons.
The problem can be reduced, I think, to the problem of the treatment
of Record End in this contrived example:
<!element foo (bar+)>
<!element bar (#PCDATA)>
1 `<foo>'
2 `<bar>'
3 `caninus'
4 `</bar>'
5 `<bar>'
6 `felinus'
7 `</bar>'
8 `</foo>'
(where ` signifies Record Start, ' Record End for clarity, line
numbers for reference, only)
According to section 7.6.1, this will be interpreted at the outer
(foo) level as:
1 `<foo>'
2 `<bar>...</bar>'
5 `<bar>...</bar>'
8 `</foo>'
Now, the RE in line 1 is clearly the first RE in the content of foo,
and the RE in line 5 is clearly the last RE in the content of foo.
According to said section, these are to be ignored.
The problem is the RE in line 2, and the question boils down to this:
Is this RE recognized as /content/ or as /markup/?
I believe I understand this to be markup, and thus that it should be
ignored. It seems that Gary's problems stem from some decision
amounting to viewing this as content, in which the RE would imply the
start of a bar element, in which a new bar element is illegal (see
amended note to section 11.2.4), or in which data content is not
valid.
What am I missing here? (I'm sure it's something.)
I've read the spec several times, but won't claim that I understand
and remember every single thing, due to the high number of references
and other spaghetti-coding style writing.
--
[Erik Naggum] Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY
Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no>
My opinions. Wail: +47-2-836-863
--
jmd@dlogics.COM (Jens M. Dill) (12/05/90)
In article <ENAG.90Nov29012001@hild.ifi.uio.no>, (Erik Naggum) writes: > In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes: > > > I am able to import hand-edited files into Author/Editor without error, > > but when I try to validate the document, I get the error message: > > > > "Validation Error: Text not allowed here" > > > > wherever there is a new line between an end tag and a start tag. > > ... > > The problem can be reduced, I think, to the problem of the treatment > of Record End in this contrived example: > > <!element foo (bar+)> > <!element bar (#PCDATA)> > > 1 `<foo>' > 2 `<bar>' > 3 `caninus' > 4 `</bar>' > 5 `<bar>' > 6 `felinus' > 7 `</bar>' > 8 `</foo>' > > (where ` signifies Record Start, ' Record End for clarity, line > numbers for reference, only) > > According to section 7.6.1, this will be interpreted at the outer > (foo) level as: > > 1 `<foo>' > 2 `<bar>...</bar>' > 5 `<bar>...</bar>' > 8 `</foo>' > > Now, the RE in line 1 is clearly the first RE in the content of foo, > and the RE in line 5 is clearly the last RE in the content of foo. > According to said section, these are to be ignored. > > The problem is the RE in line 2, and the question boils down to this: > > Is this RE recognized as /content/ or as /markup/? > > I believe I understand this to be markup, and thus that it should be > ignored. It seems that Gary's problems stem from some decision > amounting to viewing this as content, in which the RE would imply the > start of a bar element, in which a new bar element is illegal (see > amended note to section 11.2.4), or in which data content is not > valid. > > What am I missing here? (I'm sure it's something.) > What you are missing is a very obscure note added to section 11.2.4 by Amendment 1: NOTE -- It is recommended that "#PCDATA" be used only when data characters are to be permitted anywhere in the content of the element; that is, in a _content model_ where it is the sole token, or where _or_ is the only connector used in any _model group._ This recomendation is made because separator characters, which are recognized as separators in _element content_, are treated as data in _mixed content._ ... The note just about says it all, but it seriously understates the gravity of the problem, and both the example and the sample solution provided are laughably simplistic. The core problem is that Gary has defined an element with "mixed content" (the content model contains both "#PCDATA" and GI's of sub-elements), and has done so in such a way that somewhere in the content model, you come across the situation where sub-element A ends and the only legal things that can follow sub-element A are other elements (#PCDATA is not legal at this point in the model). Now, IF the instance is set up so that -- Element A has an explicit end-tag -- There is a separator character (space, tab, record end) between the end-tag and the start-tag of the next sub-element then the parser attempts to read the separator as data, discovers it cannot match data at this point in the current element, and starts trying to infer omitted start-tags or end-tags that would get it to a point where #PCDATA would be acceptable. At this point the parser gets so confused that the eventual error message has no chance of bearing any relation at all to anything involved in the original problem. The example given in the note quoted above, (x, #PCDATA), is, in my opinion, oversimplified because if this were the whole content model for an element, TWO SUCCESSIVE record ends would be required (before the start-tag of "x") to trigger the problem (As Erik points out, the first, since it follows the start-tag of the containing element, is attributable to markup, and therefore ignored). Some more illuminating examples: (x?, y, #PCDATA) trouble at </X>&#RE;<Y> (#PCDATA | (x,y)) ditto (#PCDATA | x+) trouble at </X>&#RE;<X> The solution proposed is to "replace 'PCDATA' with the GI of an element whose content is '#PCDATA' and both of whose tags can be omitted." This ie effective only in the first example above, because in the others, the #PCDATA reference is not contextually required and therefore the start-tag of its replacement GI cannot be omitted in practice. My experience with other solutions is that they are equally weak. The only GOOD solution I know of is to become aware of the problem and avoid writing a DTD that could cause it. In my opinion, this is an area where SGML is seriously flawed, for the following reasons: 1. The problem is subtle and hard to predict. It relies on a designation of "mixed content", which in turn relies on the presence or absence of a single "#PCDATA" in what may be a very complex content model constructed with heavy reliance on parameter entities. I have seen cases where the problem was missed for weeks because there were two very similarly defined elements, one of which ignored record-ends after end-tags and one of which choked on them. It is also a non-trivial problem to study a content model and determine if #PCDATA is, in fact, permitted between any two sub-elements. 2. The problem is one that does not manifest itself until exactly the right combination of circumstances is encountered. This means that a very large collection of instances could be built against a DTD before one of them demonstrated the flaw. This means a lot of recoding of instances unless we can repair the DTD in such a way that existing instances will still parse. 3. There is no good general way to fix the problem in an existing DTD without either requiring changes in the tag structure of existing documents or loosening the DTD so that it accepts #PCDATA in places where it formerly did not. I know this issue has been a source of debate in the standards committee. I may well have missed some important points. But, to me, the note added by amendment 1 has the look and feel of a compromise solution that resulted from a failure to comprehend the full impact of the problem on a user. I would (personally) urge the committee to take another look at the problem. The opinions stated herein are my own; they are not to be interpreted as an official opinion from Datalogics, Inc. *=====* TIME CANNOT BE WASTED *=====* -- Jens M. Dill \ But it can be used for purposes / jmd@dlogics.com \ other than what was intended. / *=============================*
enag@ifi.uio.no (Erik Naggum) (12/06/90)
My most sincere thanks to both Jens M. Dill and Rodney Boyd for helping me understand this difficult problem. -- [Erik Naggum] Snail: Naggum Software / BOX 1570 VIKA / 0118 OSLO / NORWAY Mail: <erik@naggum.uu.no>, <enag@ifi.uio.no> My opinions. Wail: +47-2-836-863