[comp.text] An SGML Overview

dns@sq.uucp (David Slocombe) (07/29/88)

Bill Tuthill's article voicing objections to Standard Generalized Markup
Language (SGML), an ISO (and ANSI) standard for documents, gives me an
excellent opportunity to set a lot of issues straight.

Bill's scathing comments on SGML appear to be based largely on Charles
Goldfarb's June 1981 paper in "SIGPLAN Notices".  The language was
refined between 1981 and 1986 when it became a standard (ISO 8879), and
Dr. Goldfarb and everyone else associated with SGML have learned a
great deal since 1981 about how not to be misinterpreted.  The standard
has been amended once since 1986, and a second amendment is currently
being ballotted.

I should state up front that we at SoftQuad Inc. have a strong interest
in the success of SGML as a standard -- we sell three products so far
supporting the use of SGML, and more are on the way.  The reason
we committed ourselves to the support and promotion of SGML was that
our combined experience in the publishing industry convinced us that
SGML was a long-awaited solution to real problems, a solution whose
time had come.

* * * * * * * * * * * * * *

1. The SGML philosophy:

The following might form a manifesto for the SGML approach:

	- Documents are Data with Structure:
		A document is data in the same sense that an accounting
		file is data.  The data in a document is *structured*,
		in the same sense as an accounting file has records and
		fields.  Document structure is typically hierarchical.

	- Logical Content is Independent of Presentation:
		The visual presentation of the structured data in a
		document may take many forms depending on the
		application.  The document's logical structure and text
		should be distinguished from any visual-style-specific
		information so that the document may be readily
		formatted according to a style that is most suited to
		the application of the moment, and can be loaded
		conveniently into databases or otherwise processed.

	- Document Classes Need to be Defined:
	 	For the sake of both database applications and graphic arts
		applications, it is desirable to define classes of
		document-structures and the syntax which encodes them,
		and to assign documents to the appropriate class (or to
		create them in the first place so that they belong to a
		specified class).  The definition of a class of
		document-structures and the corresponding syntax
		amounts to a formal grammar, which the SGML world calls
		a Document Type Declaration (or DTD).

	- The DTD Goes With the Document:
		When a document is "interchanged", it is considered to
		consist of the appropriate DTD (or a reference to it)
		followed by the document itself (the "document-instance"),
		which must have a structure (and be encoded) according
		to the rules found in the DTD.

	- Document Access is Through a Parser:
	 	An "SGML Parser" is an application program which reads the
		DTD and then parses the document-instance according to
		the rules found in the DTD.  Such a parser will
		"validate" the document and report an error if the
		document does not actually parse correctly according to
		the DTD.  This parser can also be used as the front-end
		to a database-loading or text-formatting program or
		tools like Writer's Workbench.

* * * * * * * * * * * * * *

2. SGML is NOT a text formatting language:

Because I am a troff-hack and it is easy for me to think of suitable
examples from the troff language, and because I can reasonably assume
that most readers of this group have at least a passing acquaintance
with troff, I will be making comparisons to troff in what follows. But
please don't let that lead you into thinking that SGML is a way of
specifying the way a document is to *look* on some medium.  It isn't.

In fact, people new to SGML have to fight continually against the
tendency to express document structures in terms of specific *appearance*
instead of *information content*.  A good example would be a title in
a bibliographic reference.  It is likely to be typeset in italic
type, so people have a tendency at first to set up an element called <italic>
and use that for tagging this title (as well as emphasized text).
But anyone who wants to search the document for a title reference
is not interested in other things that might be in italic type, and does
not care about the typeface actually used for titles.  The element-type
and its form of presentation are simply not in the same universe.
The element is more appropriately called <title>.

(Minor point:  there *is* an "escape hatch" in SGML: it is called
the "processing instruction" which has -- in the reference concrete
syntax -- the form <? ... > where the three dots can be any garbage.
An SGML parser will not attempt to parse this, but will pass it on.
This mechanism is sometimes used to pass formatter-specific stuff
through to troff or whatever, but it can be misused.)

* * * * * * * * * * * * * * *

3. SGML allows flexibility without compromising portability:

This is a critical point.  Although SGML *allows* (but does not
require, of course!) variations in document structure and encoding that
are simply mind-boggling, this flexibility does *NOT* reduce the
portability of an SGML document.  This is because an SGML document
logically begins with its Document Type Declaration which must define
the allowable structural elements and other SGML syntax for the
following document-instance.  The DTD shows how to interpret the
information contained in the document in an abstract manner independent
of specific (concrete) coding.  With the SGML parser, the document
can always be reduced to only the *information* which its author intended
it to contain, and then re-encoded in an alternative manner.

There is a default "reference concrete syntax" which any SGML parser
must assume on startup, and if the document itself uses a variant
concrete syntax then the DTD must define that (using the reference
concrete syntax to do so, of course).

This is analogous to troff's ".ec", ".cc" and ".c2" requests for changing
the escape character and the two control characters.  Note that troff also
starts off with default values for these characters so that it can interpret
any troff document.

In SGML the comparable main defaults would be "<" and "&".  However, in
SGML there are many, many other characters or strings which are magical
following "<" or "&" and which have defaults that can be redefined.
Even the character-sets used for text, for non-text (perhaps binary)
data, and for SGML tags are definable.  Finally, the meaning of record
boundaries (e.g. the newline) is subject to fine control.

But there are more interesting methods for changing the form in which a
document is encoded.  SGML has "markup-minimization", "shortrefs" and
"datatags", which provide means to make documents with little or no
visible structural coding acceptable to an SGML parser.  People have
used Sobemap's parser, for example, to parse telex messages containing
bank transactions and to churn out database updates *plus* troff
instructions to print paper records of the transactions.  The telex
messages were in natural-language French.  (This kind of thing relies
on and takes advantage of the fact that in most application-areas
information is already structured and that structure is already encoded
in the document even if it is not obvious to anyone who is not intimate
with that application.  Writing an SGML DTD for a document encoded in
such a way, which requires familiarity with the implicit structure and
its encoding, makes that structure apparent and accessible to anyone
with an SGML parser.)

This flexibility is just fine as far as portability is concerned because
all the variations from plain-vanilla must be defined in the DTD, which
in turn is written in a language guaranteed to be understood by the
parser.

I expect that, in practice, most documents in the future will be coded
in something approaching plain-vanilla, at least once they get "into
the system".  From a data-processing point of view that is usually what
you want.  But two factors forced the SGML Committee to provide all the
flexibility:

	- Somewhere, somewhen, humans originate the documents and they
	don't want to work any harder than necessary.  I'm sure there
	were Production Managers of typesetting departments on the
	Committee who complained, "But that means N percent more
	keystrokes!  Keystrokes cost MONEY!"  When I worked for a
	newspaper I heard this all the time.

	- There are billions of documents already out there (ask the
	DOD) that have to be converted to SGML.  If the current form of
	those documents *can* actually be parsed unambiguously without
	change, then there is at least a good chance an SGML DTD can be
	written to specify that parsing.

Often the first thing a site should do with a document encoded with all
sorts of weird datatags and shortrefs and variant concrete syntaxes is
to parse it (preceeded, as it must be, by its DTD) with a full-blown
SGML parser like Sobemap's, in order to turn it into a straightforward
unfancy document with a much simpler DTD.  After that, only a very
simple parser will be needed to process and re-process this document.
Absolutely no information is lost by this translation:  you are just
getting rid of various forms of data-compression or alternative notation.

* * * * * * * * * * * * * * * * *

4. SGML handles graphics, tables, and equations:

Graphics is easy:  ISO already has standards for graphics -- more than
one!  So all the SGML Committee had to do was make sure that those
other standards could be used within SGML.  Basically they "punted the
problem" rather than get into a turf war with other ISO and ANSI
committees.  The SGML standard effectively #includes the existing
(and future) graphics standards.

Tables and equations are simply particular substructures of documents.
So SGML itself does not *need* to have any special facilities for these
kinds of material.

I have seen a number of different sub-DTDs to handle tables or
equations.  One of them was unabashedly specifying close to eqn's style
of input, partly to illustrate the power of SGML's data-compression
tools mentioned in the previous section:  shortrefs and datatags and
markup-minimization.

Right now, people are experimenting to see just how far they can push
SGML's flexibility -- "Look Ma, no hands!"  But simple approaches will
probably be best in the end.

The AAP Electronic Manuscript Project has a booklet on each of "tabular
matter" and "mathematical formulas" describing *their* particular sub-DTDs
for tables and equations respectively. (AAP is Assoc. of American Publishers.)

* * * * * * * * * * * * * * * * *

5. SGML has an elegant symbolic reference mechanism:

In troff you define a string with ".ds x ..." and reference it with "\*x".
For a macro you say ".de XX", and reference it with ".XX".
Special symbols for which there is no single ASCII character are referenced
with \(xx . If you want to #include a file you do ".so filename".
If you want to get some information from the system or another program
you use ".sy commandline".

In SGML there is a single mechanism to do all these things:  the Entity.
You refer to an Entity in the same way no matter which kind of thing it is.
When you define the Entity is when it takes on the character of a string,
or special symbol, or macro, or file-to-be-included, or system-generated data.

Note that this is of particular significance for portability when the writer
is using many uncommon symbols.  At one site a symbol may be on a font.
At another site it has to be constructed, as in eqnchar, out of other symbols.
At yet another site it must be slurped in by the typesetting program in the
form of a bitmap.  The writer cannot know which technique will be needed
at some arbitrary time in the future.  With Entities he doesn't need to know.

* * * * * * * * * * * * * * * *

To sum up:  a good standard -- one that is really *needed* -- is one that,
by imposing a discipline, creates freedom to do things that couldn't be
done reliably (or at all) before.  The SGML standard means that people
can write or buy SGML parsing technology and then can be confident that
their systems can understand documents written to meet the SGML standard
no matter where (or when) they were generated.  This is the first critical
condition to be met if you are going to be able to format these documents
or put them into databases.

There is room for many SGML Document Type Definitions (some small, some
very very large).  However, if your organization wants to guarantee that
a document created in one department is processible without change in
another department, you probably want to create, centrally, one fairly
large and complex DTD which is adequate to the needs of all parts of the
organization.  Then some groups within the organization may use subsets
of this DTD, but you know that documents created with one subset can be
combined with documents created with another subset and that they will
be supported by your typesetting system (assuming you have figured out
how to format every aspect of the jumbo DTD).  This is equivalent to
the data definitions of a corporate management information system, with
"views" which are subsets of the overall data definition set.

SGML is working for people now:  last year, McGraw-Hill took a hugh
collection of SGML-encoded articles and produced a 16 volume Encyclopedia
of Science and Technology.  Simultaneously, they published a CD-ROM
containing the same materials and also made the textual component
available on-line via the Westlaw database.  The same source files
were used to create all three editions.  This project could not
have taken place without an SGML approach.

The main problem with SGML right now is that the Standard is difficult
to understand (aren't all standards like this?), and books explaining
SGML are still "in preparation".  Time will take care of this.  In the
meantime the AAP Electronic Manuscript Project books are the best that
are available.  Also, the Graphic Communications Association runs courses on
SGML.

UPCOMING CONFERENCES (where you'd hear lots about SGML):
	TechDoc XII, August 23 to 26, San Diego, CA
	SGML for the Desktop, November 16 to 18, Boston, MA

--------------------------------------------------------------------
David Slocombe				uucp: {utai,utzoo}!sq!dns
Vice-Pres., R&D
SoftQuad Inc.				Internet: dns@sq.com
720 Spadina Avenue
Toronto, Ontario, Canada M5S 2T9
(416) 963-8337