[comp.text] What SGML is and isn't

tbray@watsol.waterloo.edu (Tim Bray) (07/14/89)

Some SGML talk churning around the newsgroup again.  This is a good thing, as
people who are in the text-by-computer trade need to be thinking about the
issues that SGML raises.

Here at the New OED project, we have grappled with the structuring and
management of many different kinds of text, the most interesting perhaps being
the 550-Mb, highly structured dictionary itself.  We have used embedded markup,
and have built (& are now selling) a variety of software tools which are
designed for such data resources.  We describe the markup and our software as
`SGML-like', and indeed it looks like SGML, and some of the things our tools do
are reminiscent of the services provided by the SGML-oids.

But we are not, nor will be be, fully SGML-compliant.  The SGML concept is
based on descriptive markup, a wonderful idea [see Coombs et al in Nov. '87
CACM] and one which is necessary for any serious computer text processing.  But
the SGML standard itself is horribly flawed and permits some things which are
unhelpful and even dangerous.  The details are too lengthy and sordid to go
into here, but I can talk in detail on request.

It also has one serious design flaw which I shall discuss briefly here: that
for any document to be SGML-compliant, there must exist a Document Type Def
(DTD), which is a formal grammar prescribing the syntax and structure of the
document.  Sounds fine.  But there is a large class of existing documents
(including dictionaries, other reference works, legislation, technical
documentation) for which it is either impossible or prohibitively difficult to
write a DTD, simply because the number of inconsistencies is so great (even if
proportionally speaking their frequency of occurrence is low).  The OED is an
example of such a document.  I know lots of others.

So are we to discard such documents just because we can't write DTD's for them?
Don't be silly.  They are structured, and in fact generally well structured.
What is required is software tools that use that structure to support getting
work done without letting a requirement for a prescriptive grammar getting in
the way.  (This is what we try to do).

There is also the philosophical issue that arises when the editor of the OED
comes to me and says: "I want to put an author's name here because this is a
special case in the English language."  Do I say:

 1. "You can't.  You have to live by the grammar we predefined" or
 2. "OK, we'll fix the grammar for this one special case".

Yecch and yecch.  Especially given the fact that a person such as an OED editor
usually has a pretty good grasp of what he or she is doing and I, a computer
weenie, feel pretty uncomfortable telling him or her how to structure a
dictionary.

So what about SGML?  As several others have mentioned, its most important
potential role is as a truly portable interchange medium.  Something that is
desperately needed.  And it may succeed there.  But I still worry about the low
quality of its design getting in the way.
Cheers, Tim Bray, New OED Project, U of Waterloo (tbray@watsol.waterlool.edu)

dougcc@csv.viccol.edu.au (Douglas Miller) (07/18/89)

In article <15110@watdragon.waterloo.edu>,
tbray@watsol.waterloo.edu (Tim Bray) writes:
> Some SGML talk churning around the newsgroup again.  This is a good thing, as
> people who are in the text-by-computer trade need to be thinking about the
> issues that SGML raises.

Yup, especially for people like me who weren't on the net when it was churning
last time.

> But
> the SGML standard itself is horribly flawed and permits some things which are
> unhelpful and even dangerous.  The details are too lengthy and sordid to go
> into here, but I can talk in detail on request.

Yes please, more detail!