[comp.text.sgml] Looking for on-line DTD's and/or SGML document files

jpl@bat (Jeff Lankford) (04/03/91)

I want some files to feed to SGML parsers;
does anyone have any on-line DTD's and/or SGML document files?
I'd like to use them as regression tests of SGML parsers,
so the actual content is unimportant, but content model
should stress the parsers in some fashion (and they should be
complete, i.e. any necessary external files, etc.)
I poked around the NIST ftp anonymous machine,
but didn't find anything useful.
Does anyone know if SGML sources for MIL STD or SPEC documents are
available on-line anywhere (a pretty revolutionary idea, huh?)?

jpl

enag@ifi.uio.no (Erik Naggum) (04/03/91)

Jeff,

The CALS specifications (and DTDs) are available from some FTP server
somewhere.  I'll try to find out where it was that I found them.  (I
subsequently got yelled at for wasting disk space, so I may now only
be lucky enough to have them on a tape somewhere.)

In the meantime, I'd like to suggest paying careful attention to the
parameter separator (ps) and entity references, which in my experience
is mishandled by simple-minded parsers, and is somewhat hard to get
right unless you grok a few things about how the standard presumes
that things work.  It does tell you about them, briefly, in F.1.1.1
Entities...

For instance, given

    <!ENTITY foo "a"		>
    <!ENTITY bar "(b|c)"	>
    <!ENTITY zot "(d)"		>

This declaration is syntactically legal

    <!ELEMENT%foo%bar+%zot	>

but restricted due to these short paragraphs in 10.1.1 Parameter
Separator (page:line numbers refer to The SGML Handbook).

372:15	A required /ps/ that is adjacent to a delimiter or another
	/ps/ can be omitted if no ambiguity would be created thereby.

372:18	A /ps/ must begin with an /s/ if omitting it would create an
	ambiguity.

Recall the syntax production for the parameter separator:

	[65] ps = s | Ee | parameter entity reference | comment

The parsed result is the same as if the declaration had been:

    <!ELEMENT a (b|c) +(d)>

but is really

    <!ELEMENT%fooa_%bar(b|c)_+%zot(d)_ >
	     ----^ ----^^^^^  ----^^^

where underlined parts are treated as separators, ^'ed parts are the
text of the entity referenced, and _ indicates the Entity end signal.

Note that the parameter entity reference is a separator in and by
itself, but that an /s/ is required in the above element declaration
because of [372:18].  It is not always easy to determine when some-
thing would constitute "an ambiguity", since this is primarily in-
tended for the human reader, not the computer, which can handle these
cases perfectly well.

Some implementations get this wrong, since they treat entity refer-
ences as textual replacements, not calls to the entity manager which
will feed the parser from the entity until it ends.  (Such an imple-
mentation would get "<!ELEMENTa(b|c)+zot >", and would not be fully
able to check for entity ends, unless it had "quirks" in it solely for
this purpose, which some smaller parsers actually have.)

After studying the spec and Goldfarb's excellent book, I've come to
conclude that something needs to be said on the data flow model in an
SGML parser.  It's not intuitively evident from the spec itself, and
has posed some problems.  The effect is primarily on the conceptual
model, but this will invariably have major effect on the implementa-
tion techniques employed, and thus on the result, and introduce subtle
bugs which it will be difficult to remove.

Well, all of this may not apply to you, but you might still find it
useful.  I know for sure that I spent a lot of time "getting" the idea
why entity references weren't allowed everywhere, just like macro
calls in, e.g. C, and that they are actually handled by the parser.

In short, entities are not as straightforward as they might seem.
(Or is it only me?)

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

obusek@dtoa3.dt.navy.mil (Obusek) (04/04/91)

In article <ENAG.91Apr3141917@holmenkollen.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
>Jeff,
>
>The CALS specifications (and DTDs) are available from some FTP server
>somewhere.  I'll try to find out where it was that I found them.  (I

They are available from the CALS bulletin board operated by NIST,
internet address:

CALSBBS.CME.NIST.GOV or 129.6.32.173

Brenda O'Busek
obusek@dtrc.dt.navy.mil

enag@ifi.uio.no (Erik Naggum) (04/04/91)

In article <ENAG.91Apr3141917@holmenkollen.ifi.uio.no>, I wrote:

   In the meantime, I'd like to suggest paying careful attention to the
   parameter separator (ps) and entity references, which in my experience
   is mishandled by simple-minded parsers

Apparently, it's also mis-handled by simple-minded posters.  I
provided you all with a handsome collection of syntax violations.

    <!ENTITY foo "a"		>
    <!ENTITY bar "(b|c)"	>
    <!ENTITY zot "(d)"	>

should be

    <!ENTITY % foo "a"		>
    <!ENTITY % bar "(b|c)"	>
    <!ENTITY % zot "+(d)"	>

and

       <!ELEMENT%foo%bar+%zot	>

should be

    <!ELEMENT%foo%bar%zot	>

or canonically

    !<ELEMENT%foo;%bar;%zot;	>

Sorry about this.  Luckily, it wasn't increcibly important to my
point.  (But still incredibly annoying:  *I* made a mistake!  Awful.)

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>