[comp.text] SGML defended

tbray@watsol.waterloo.edu (Tim Bray) (07/26/88)

In article <61024@sun.uucp> tut%cairo@Sun.COM (Bill "Bill" Tuthill) writes:
> I'm moving a discussion of SGML started in comp.text.desktop into this
> newsgroup, because I think the issues are larger than a desktop.
and so on.

I have to disagree with nearly every line of Bill Tuthill's contribution.
There are real problems with SGML, but they are not the ones he
identifies.  I think the problem is that he considers SGML strictly as a
typesetting system, which is really beside the point.  Detailed
discussion follows, but the important points are:

1. If any on-line use for a document other than printing it out
(Hypertext, information retrieval, on-line documentation) is
contemplated, structural rather than typographical markup is a
necessity.  The arguments for this are many and are overpowering in
their force.  Rather than run through them, I refer everyone to the
excellent article `Markup Systems and the Future of Scholarly Text
Processing', by Coombs, Renear, and DeRose in the Nov. '87 CACM.

2. The SGML standard is a crock.  I have not read it, but this is the
unanimous consensus of everyone I know who has tried to work with it.
The basic SGML syntax and concepts, however, are sound.  I think the
logical conclusion should be: let's not let the failure of the standards
drafters deter us from using this basically good idea.

Now, to address Mr. Tuthill's points:

>Instead, SGML should be compared to decent
>procedural languages such as troff and TeX.  There are good reasons why
>troff and TeX macro packages were invented: well-designed macros provide
>writers with a descriptive layer ...
No, SGML shouldn't be compared to these things.  SGML and the
typesetting packages exist to solve different problems.  When you want
to typeset your SGML document, you should translate it into troff or TeX
or PostScript or something that's good at that job.  SGML exists to
prevent typographical nits from getting in the way of structural
document design decisions.  See the CACM article.

>SGML is no panacea for portability.  Being a metalanguage, SGML does not
>provide one syntax, but only a method for describing different syntaxes.
>On p. 68 Goldfarb states, "SGML allows variant concrete syntaxes."  This
>is tantamount to saying it isn't really standard.  It would probably be
>as difficult to translate between variant syntaxes as to translate between
>troff and Interleaf or Frame.
The great virtue of SGML is that it is very easy for computers to parse
and is probably the most flexible form in which it is possible to store
text.  Our practical experience on the New OED project is that the first
thing to do with input text is to do away with all the typesetting
gibberish and get some approximation of SGML tags in there.  You don't
have to worry too much about getting them right; once the basic
structure is there, it's remarkably easy to transform the text into the
right setup, once you figure out what that should be.

>SGML was born obsolete.  Graphics are missing from the specification, as
>are provisions for tables and equations.  
It is certainly possible in SGML to make a reference to an
externally-stored graphic.  Then at typesetting time, you copy in the
appropriate PostScript/pic/rasterfile or whatever.  SGML does indeed
allow the specification of tables and equations, in a
typography-independent way that lends itself to a variety of
information-retrieval applications.  Try to make automatic sense out of
tbl or eqn source!  On the other hand, it's easy to translate SGML
structures *into* tbl or eqn or whatever.

>SGML:
> This added information, called <q>markup</q>, serves two purposes:
> <ol>
> <li>Separating the logical elements of the document; and
> <li>Specifying the processing functions to be performed on those elements.
> </ol>
> This figure represents divine document intervention.
--------
>troff
> This added information, called \*Qmarkup\*U, serves two purposes:
> .NP
> Separating the logical elements of the document; and
> .NP
> Specifying the processing functions to be performed on those elements.
> .LP
> This figure represents divine document intervention.

Which of these, do you think, lends itself better to online IR
applications?  Which is more easily automatically translated to the
other?  Both answers are obvious.

>In the concrete syntax described, the
>ASCII characters < > & % ; appear to be reserved symbols, but Goldfarb
>offers no method for printing these characters literally.  
'<': &lt.
'>': &gt.
'&': &amp.
etc...  

>SGML documents are supposed to be rigorous, but
>rigorous means inflexible.  
A good point, and one of the big problems with the SGML standard.  ISO
SGML requires that one prepare what amounts to a *prescriptive* grammar
for your document.  This may be appropriate for airplane checkout
manuals (maybe), but most document creators, when you get right down to
it, know what they're doing pretty well and don't need a grammar getting
in their way.  Also there is the (common) problem of wanting to markup
an existing body of text (for example the Oxford English Dictionary)
which just ain't gonna always follow the rules.  Does this mean one
gives up the descriptive power of structural markup?

Hey, I like troff/TeX and so on for doing typesetting.  But typesetting
is just one of many things that can be done with an electronic document.
If you want enough flexibility to do some of those other things, don't
limit yourself to typographical markup.
Cheers,
Tim Bray, New Oxford English Dictionary Project

romwa@gpu.utcs.toronto.edu (Mark Dornfeld) (07/27/88)

In article <7986@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
>In article <61024@sun.uucp> tut%cairo@Sun.COM (Bill "Bill" Tuthill) writes:
Tim has accurately argued the points that Bill brought up.
SGML is NOT a typesetting system; Bill really missed that one.
We (Royal Ontario Museum) are trying to standardize on SGML
for many of our editorial projects.  By standardizing on a
markup language, we can write filters to troff, TeX,
Pagemaker, Ventura and also ask for bids from typesetters who
can read SGML.  This multiplies our options tremendously.

>2. The SGML standard is a crock.  I have not read it, but this is the
No, it isn't, well, at least not completely.  It's usable and the
standard is flexible as
long as the DTD (Document Type Description) is complete.

>Hey, I like troff/TeX and so on for doing typesetting.  But typesetting
>is just one of many things that can be done with an electronic document.
>If you want enough flexibility to do some of those other things, don't
>limit yourself to typographical markup.
There's the key right there.

Mark T. Dornfeld
Royal Ontario Museum
100 Queens Park
Toronto, Ontario, CANADA
M5S 2C6

mark@utgpu!rom      - or -     romwa@utgpu

tbray@watsol.waterloo.edu (Tim Bray) (07/28/88)

In article <61454@sun.uucp>, sears@sun.uucp (Daniel Sears) writes
>SGML provides rules for describing tag sets.  So let's create two very simple
>tag sets that we will assume are SGML conforming.  The first has two tags and
>the second has three.
>...
>it is possible to translate a document from the first tag set to the second.
>But it is not possible to translate a document from the second tag set to
>the first because there isn't an equivalent tag for <mark3>.  What SGML
>tries to guarantee is a way of describing the different tag sets, but it
>does not guarantee that the tag sets will be rich enough to hold all the
>objects that other tag sets may contain.

Uh, am I missing something?.  The second has more structural information
than the first.  Clearly you can't translate without 

1. Losing information, or
2. Passing the extra info through; maybe the text processor for the first
   example can be made to understand it.

If you want complete interchangability of documents, you need a general
agreement on all possible structural components of all documents.  This
is clearly impossible.  Not that we shouldn't try - the AAP effort is
worthwhile.

The next best thing is a standard, flexible, easily-parsed syntax for
marking up the structural components that do exist so that we maximize
our ability to translate them.  This is all SGML is.  But it's a *lot*
better than the alternative - parsing troff/TeX gibberish.

>In summary, the goal of structured document systems is
>quite laudable, but I think it is necessary to distinguish a system like
>SGML from that goal.
I'll buy that.  But for the time being, structural markup is the only
safe way to store text for which there may be unanticipated future uses.
This includes nearly all on-line text.
Tim Bray, New Oxford English Dictionary Project, U. of Waterloo

tut%cairo@Sun.COM (Bill "Bill" Tuthill) (07/30/88)

So I'm learning that SGML is not

	Standard	(multiple tag sets exist)
	General		(doesn't do graphics, tables, or equations)
    for	Markup		(not intended for typesetting)
      a	Language	(rather a syntax for describing a language)

Could SGML possibly be misnamed?  Was it hopelessly naive of me to assume
that the word Markup in its name indicates it is used for markup?

At least I'm learning something from this discussion.

romwa@gpu.utcs.toronto.edu (Mark Dornfeld) (08/01/88)

In article <62137@sun.uucp> tut%cairo@Sun.COM (Bill "Bill" Tuthill) writes:
>So I'm learning that SGML is not
>
>	Standard	(multiple tag sets exist)
 The standard is comprised of rules, not tag sets.  Similar to
 the C language "standard", which is not a set of programs, but rules
 by which to write programs.
>	General		(doesn't do graphics, tables, or equations)
 File handling and processing is taken care of by the host system.
 An SGML document could be a bitmap graphic with appropriate
 description associated with it.  The AAP implementation
 includes a very workable method for describing tables, not
 setting them.  Entity sets for describing equation symbols are
 supplied in the annex to the standard.
>    for	Markup		(not intended for typesetting)
 That's right.  Let typesetting programs do the typesetting.
>      a	Language	(rather a syntax for describing a language)
 I quote from "The Standard" (ISO 8879-1986(E), Page 1): "This
International Standard specifies a language for document
representation referred to as the "Standard Generalized Markup
Language" (SGML).  SGML can be used for publishing in its
broadest definition, ranging from single medium conventional
publishing to multi-media data base publishing.  SGML can also
be used in office document processing when the benefits of
human readability and interchange with publishing systems are
required."
>
>Could SGML possibly be misnamed?  Was it hopelessly naive of me to assume
>that the word Markup in its name indicates it is used for markup?
>
>At least I'm learning something from this discussion.

My background is Troff.  When I began to see that our
institution could not depend on a single typesetting system,
but would be using graphical programs such as Ventura
Publisher, Pagemaker, and would also have to send out material
to be typeset at commercial typesetters, the value of SGML
became clear.  Since the same editors/writers would be
producing text for any one of these systems, it was important
that we teach them a single markup system.  SGML seems to be
the one.

It is a trivial matter to filter SGML to troff, Ventura,
Pagemaker and even to some commercial typesetters' "markup."
Since many of our typesetting jobs are repetitive, but the
typesetter isn't, we can achieve a level of control not
possible before.  Our documents become more consistent and
less time is taken in editing and markup.

When we begin a massive records management project sometime in
the future and wish to store data on optical media, we will
want a system that is not tied to a particular processing
system.  Rather we will want the information to be described
in an independant way.  SGML again seems to fit the bill.

SGML doesn't care whether you indent your paragraphs one or
two ems.  It just want to tell you there is a paragraph there
and leave the formatting to the designer.

I used to think if the whole world would just learn troff,
we'd be in great shape.  It's easy to get trapped by such a
flexible and powerful system as troff, but it just doesn't
answer all our needs.


Mark T. Dornfeld
Royal Ontario Museum
100 Queens Park
Toronto, Ontario, CANADA
M5S 2C6

mark@utgpu!rom      - or -     romwa@utgpu