[comp.text] SGML: a different kind of markup

dns@sq.sq.com (David Slocombe) (03/04/90)

In article <2547@castle.ed.ac.uk>
sean@castle.ed.ac.uk (S Matthews) writes:

> . . . you may think that the results of
> your favorite wysiwig wp are up to TeX standards, but in larger
> documents you quickly start to lose stylistic coherence.
>
> To keep that coherence, you need a markup language, be it troff, sgml or
> TeX. 
>

[SGML == Standard Generalized Markup Language (IS 8879)]

SGML is not a "markup language" in the same sense as troff and TeX
are:  there is historical justification for the use of the term
"markup" in SGML, but it *does* cause much confusion.

Markup was originally the marks put on a manuscript by the copy-editor
or book-designer to indicate to the compositor how the manuscript
was to be formatted:  in words, things like "Compose this text in
10-point Times Medium Italic on a 12-point body, justified on a
22-Pica slug with indents of 1 en on left and right".

With the introduction of computers, these things got coded into
arcane escape sequences which make even raw troff or TeX
look pretty easy.  Many compositors were glad to be retiring then!

(Typically the code-set used wasn't ASCII:  it was TTS, or
"Teletypesetting code", with strange control-codes like "elevate",
"paper-feed", and "quad-left", designed for paper-tape-driven
linecasters made by Linotype and Intertype.  As people have done
with the codes below 040 in ASCII, the photocomposer manufacturers
felt free to redefine the semantics of these linecaster-specific
codes in any way they saw fit -- differently for each brand of
machine, of course!  TTS was actually a 6-bit extension of the
5-bit Baudot code, and was designed originally (I think) for
automating the typesetting of stock market quotations at major
newspapers.  At least that's the first use *I* heard of.)

Each new phototypesetter to come on the market had its own proprietary
"language" of escape sequences.  Since typesetting houses usually
stuck with one manufacturer, they didn't mind this too much (at
first).  But their clients, especially their *large* clients (like
the U.S. Govt), soon got the idea of doing the original "keyboarding"
on their own computers and sending machine-readable files to their
typesetting vendors.

Then the catch became obvious:  you couldn't keyboard a manuscript
until you knew the escape sequences you had to use (called "markup"
by analogy), and you couldn't do that until you had chosen your
typesetting vendor so you knew what "markup language" to use.  So,
on large projects, you tended to get locked into a specific vendor
because you had on-going keyboarding operations.  Not to mention
the operator-training problems if you switched vendors and had a
new phototypesetter language to learn.

So there was a movement to create a standard markup language
(initially called "GenCode") which typesetting vendors would all
be persuaded to accept as input.  Then, purchasers of typesetting
could keyboard their manuscripts in only one way regardless of the
phototypesetter that would ultimately be used.  It would be the
type houses' problem to translate the GenCode into the language of
their own photocomposer machines.

The GenCode effort was spearheaded by the Graphic Communications
Association (GCA) -- an industry group -- who own the trademark "GenCode".

The codes proposed were in plain, printable ASCII and had names
that reflected the human function of the piece of the manuscript,
just like troff macros: "<P>" instead of "ESC]@#22&" or something
similar in TTS.  But still, there was a one-to-one mapping intended
between physical changes in the typesetting parameters and the
"generic" codes that had to be typed.  The model was a state-model:
one code-string put in for each change to the state of the typesetter
(as with troff and TeX today).

Concurrently, an ANSI committee started work on a standard for coding
documents, based on Charles Goldfarb's GML computer typesetting language
at IBM.  The model in GML was hierarchical rather than a state-model:
the approach was to represent the document as a hierarchical data-structure
first and foremost, then to specify how each of the pieces ("elements")
of that data-structure (within their context) was to be typeset for a
particular book-design, and then to generate the state-model codes required
by the phototypesetter to accomplish that result.

Because the document was represented as a hierarchy, it was a small
step to think of it with its [structural] "markup" -- see the
evolution of the term? -- as an instance of a language with a
context-free grammar.  Then, by defining the grammar, you could
define the class of hierarchical structures that the documents
could have.  If you made this grammar user-definable, with a
parser-generator to create a parser for that grammar,  you gained
*both* enormous flexibility in the kinds of document-structures
you could cope with in practise, *and*, at the same time, an
elegant way to validate documents for consistent structural usage
before they were typeset.

At this point the GCA committee working on GenCode and the ANSI
committee working on a standard based on GML combined their efforts,
the work was upgraded to the ISO level, and Standard Generalized
Markup Language (SGML) was on the way to being born.

BUT.... how things had changed by this point!  SGML had become a
way of encoding the logical [hierarchical] structure of a document
and there was no mention at all of how the individual parts of that
structure were to be represented on the printed page!  Somehow,
*that* had become an exercise left to the implementor!

There is, today, *no* standard method for encoding the way things
are to *look* on the page.  SGML *is* a major achievement, because
-- apart from its obvious uses in document databases -- it at least
provides well-behaved "handles" for each piece of a document that
a formatter needs to refer to, and an SGML parser can guarantee
for the formatter that there are no unexpected surprises in the
structure of the input document.  But it provides no way to specify
the *design* of the printed document (i.e. how it is to be formatted).
This is ironic, because the whole SGML effort started out with
formatting very definitely in the forefront.

So there is a *new* ISO committee.  (Some members of the old SGML
committee simply moved over to this new committee when SGML was
finished.)  It meets under the auspices of ISO/IEC JTC1/SC18/W8 and it
is attempting to define a second, complementary, standard with the
awkward title:

	Information Processing -- Text Composition --
	Document Style Semantics and Specification Language

Or DSSSL for short.

The job of this new standard is to define a language that can be
used to specify -- for any document that is an instance of
a *given* SGML document-type-definition (i.e. grammar) -- how that
document is to be transformed.  Usually (but not necessarily) the
transformation wanted is into a visual representation of the
document, or, rather, into a data-file that directly describes that
visual representation.

The form of the data-file usually assumed in the committee's
deliberations is "SPDL":  Standard Page Description Language --
think "PostScript/Interpress" and you've got the idea.  SPDL is
the product of yet another ISO standards committee with co-chairpersons
who just happen to work for Adobe and Xerox respectively.

SPDL is very close to completion.

In my opinion, DSSSL is a long way from completion.

But, when it is done, you'll have:

	-- a standard way (SGML) to specify the logical structure
	   of a class of documents, as well as a way to encode
	   a document which is an instance of that class, and

	-- a standard way (DSSSL) to specify how documents
	   in a given SGML class are to be represented visually, and

	-- a standard way (SPDL) to represent in the computer
	   the concrete visualization of a specific document.

	SGML-document ==> parser/formatter ==> SPDL-print-file.
	                    |         |		      
	                   SGML      DSSSL
	                 doctype   specification

SGML parser-generators -- lex/yacc just isn't up to the task --
are not *really* hard to write.  (The hard part is understanding
the Standard!)  But DSSSL formatters do not yet exist, and the
technology for writing them has yet to be invented.

In article <RUSTY.90Mar1112619@garnet.berkeley.edu>
rusty@garnet.berkeley.edu (rusty wright) writes:

> (1) With a wysiwyg system you are constantly made aware of the
> formatting; some would say that you are being distracted by the
> formatting.  During the writing you should only be worrying about the
> content and leave worrying about the appearance until just before the
> final draft.
>

One of the proven advantages of SGML is precisely that writers can
keep their minds off formatting issues while doing their jobs.

Let writers write, and designers design!

----------------------------------------------------------------
David Slocombe				(416) 963-8337
Vice-President, Research & Development  (800) 387-2777 (from U.S. only)
SoftQuad Inc.				uucp: {uunet,utzoo}!sq!dns
720 Spadina Ave.			Internet: dns@sq.com
Toronto, Ontario, Canada M5S 2T9	Fax: (416) 963-9575

ken@cs.rochester.edu (Ken Yap) (03/04/90)

|SGML is not a "markup language" in the same sense as troff and TeX
|are:  there is historical justification for the use of the term
|"markup" in SGML, but it *does* cause much confusion.

Coombs, in his CACM article on markup languages of about two years ago
calls these descriptive markup and procedural markup respectively.

I wish I had the reference online. I also wish there was a standard for
machine readable references and that journals would use this standard
so that we could run our light pens over the strips and thousands of
readers worldwide would not have to key these in manually.

cso@organ.cis.ohio-state.edu (Conleth O'Connell) (03/05/90)

In article <1990Mar4.045813.14391@cs.rochester.edu> ken@cs.rochester.edu writes:
>|SGML is not a "markup language" in the same sense as troff and TeX
>|are:  there is historical justification for the use of the term
>|"markup" in SGML, but it *does* cause much confusion.
>
>Coombs, in his CACM article on markup languages of about two years ago
>calls these descriptive markup and procedural markup respectively.
>
>I wish I had the reference online. I also wish there was a standard for
>machine readable references and that journals would use this standard
>so that we could run our light pens over the strips and thousands of
>readers worldwide would not have to key these in manually.

The Coombs reference is:
   author ="J.H. Coombs and A.H. Renear and S.J. DeRose",
   title  ="Markup Systems and the Future of Scholarly Text Processing",
   year   ="1987",
   month  ="November",
   journal="Communications of the ACM",
   volume ="30",
   number ="11",
   pages  ="933-947",

Hope this helps,
Con
-=-
Conleth S. O'Connell	Department of Computer and Information Science
				 The Ohio State University
cso@cis.ohio-state.edu	  2036 Neil Ave., Columbus, OH USA 43210-1277

pedersen@philmtl.philips.ca (Paul Pedersen) (03/05/90)

In article <1990Mar3.224625.2621@sq.sq.com> dns@sq.com (David Slocombe) writes:

> [a lot of stuff deleted..]

>There is, today, *no* standard method for encoding the way things
>are to *look* on the page.  SGML *is* a major achievement, because

> [more stuff deleted..]

I must object. There is such a standard ISO 8613 "Office Document Architecture"
or if you prefer CCITT Rec. T.41x "Open Document Architecture". In this 
standard, both the logical structure and the "layout structure" are part of
the interchanged document, including "styles" for presentation and selection
of layout.

While ODA (currently) cannot handle the complexity of layout hoped for in
DSSSL, it is quite suitable for normal "rectangular" layout.

Paul