[comp.text.sgml] What is SGML?

E.Ireland@massey.ac.nz (Evan Ireland) (02/07/91)

Hi,

I recently saw mention of SGML in comp.emacs, but the poster of that article
could not provide me with a reference, and suggested that I ask here.


(1) ISO standard number and ordering information (I was told there is
    a related ISO standard).

(2) Reasonably comprehensive information that could be emailed to me.

(3) Other references.

	Thanks in advance,

-------------------------------------------------------------------------------
E.Ireland@massey.ac.nz          Evan Ireland, School of Information Sciences,
  (063) 69-099 x8541          Massey University, Palmerston North, New Zealand.

enag@ifi.uio.no (Erik Naggum) (02/08/91)

Anybody want to take up the task of writing an authoritative answer to
this question?  Perhaps we need a comp.text.sgml FAQ, just like some
other groups?

Evan, this is of course not to deride your question, but it seems that
it's very difficult to write a good summary of what SGML is, and
spread it around sufficiently.  I'll try, after a brief pause (a.k.a.
sleep).  Stay tuned.

To answer your questions:

(1) ISO 8879:1986 + ISO 8879A1 (year forgotten).  You can order ISO
    standards from your national standards association or body.  I
    have no idea who this is for New Zealand.  However, see below
    (particularly the lines with --> on them).

(2) Forthcoming...

(3) SGML: The/An Author's Guide, Martin Bryan, Addison Wesley, 1987;
    a book which tries to encompass everything in the standard and
    doesn't succeed very well.  I wouldn't recommend this book, but
    it was one of the earliest out, and many people have read it.
    It's fixation on keeping the language and acronymitis of the
    standard itself is its main drawback.

    Practical SGML, Eric van Herwijnen (sp?), Kluwer, 1990; a more
    practical book (!) with a more limited scope and much more suc-
    cessful than the above at meeting its stated goal.  There are lots
    of small, annoying inaccuracies in it (I'm preparing a bug list),
    but you get a real good grip on SGML through it.  It's quite easy
    reading, comes in three parts for escalating user needs (SGML
    user, SGML application, SGML parser), but is a little "simplistic"
    on some of the difficult points.  It skips the things you don't
    (or won't) need to know, and this is very important with an uruly
    beast like SGML.

    The SGML Handbook, Charles F. Goldfarb, Oxford, 1991; an
    incredibly throrough book on the standard itself.  I've found one
    very minor (but consistent) cosmetic bug, and a reference to H.400
    which should have been X.400.  That's the bugs.  Now the reasons
    I've fallen in love with the book: ISO 8879 is a jungle.  ISO 8879
    has an amendment, no more than a diff in Unix parlance, which you
    have to consult to be certain of what "the" standard says.  ISO
    8879 has no index, no cross-references, separates the concrete
    reference syntax and the syntactic variables to an extent which
    makes you need a blackboard on which to scrawl "MDO = '<!', MSC =
    ']]', ...", or take numerous copies of pages from the standard and
    hang them on your wall and squint at, and you end up cluttering
    the standard with Post-It notes all over the place.  CFG has done
    all that for you, with an expert hand.  His book is the first I've
    seen with "paper hypertext", having "buttons" with page numbers
    and syntax production numbers on them, which makes it possible to
    look them up easily.  It also has small boxes with the reference
    concrete syntax representation of the delimiter roles in the right
    margin whenever the roles appear.  The book contains a heavily
    annotated, complete ISO 8879 plus amendment, plus suggested
    improvements.  The text is full of wit and humor, which is badly
    needed with ISO standards in general.  You'll find what you're
    looking for in this book, if your interest lies with the standard
    itself.  It is no beginner's book, however.

--> Several people have predicted that The SGML Handbook will replace
--> the ISO standard as the reference work of choice.  I fully support
--> this prediction.

    The bibliographies in Eric van Herwijnen's book leave little to be
    desired (except, perhaps, issue numbers for <TAG>The SGML News-
    letter), and the section on SGML sources in The SGML Handbook is
    nearly complete.  It's some time ago that I read Martin Bryan's
    book, but I was not that impressed with the bibliography, either.

I hope this has been of some help.  Your request for a "reasonably
comprehensive information" I shall attempt to fulfull by the end of
the day (T minus 21 hours, and counting...).

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

ct2g@plaid.acc.Virginia.EDU (Cheng Tang) (02/13/91)

I am trying to find anonymous ftp sites with
software proclaiming to adhere to SGML, or
with the same functionality as SGML.

I do not know of the ambiguity of my request,
but am willing to learn.

Any thoughts/comments/answers are welcome.

Thanks,
---Cheng

enag@ifi.uio.no (Erik Naggum) (02/19/91)

Hi, all, and sorry for the delay.

The following is an unfinished introductory article that I don't have
the time to finish quite as soon as I had hoped.  Please comment on
it by sending me a message at <erik@naggum.uu.no>.  Thanks.

Evan, I hope this will give you a flavor, if not the Whole Answer.

[Erik]


What is SGML?

SGML is the abbreviation for Standard Generalized Markup Language.  If
you know what Generalized Markup is and what Markup Languages are and
do, the only remaining element is Standard.  SGML is International
Standard 8879 as issued by ISO - the International Organization for
Standardization - in 1986 and later amended.  Many people apparently
knows what Generalized Markup and Markup Language means, since SGML
quickly became popular and is rumored to be one of ISO's best selling
(or fastest selling, according to other rumors) standards.  However,
it seems even more people do not know what generalized markup is or
what markup languages are or do, but nonetheless, they've heard that
SGML is going to have an impact on how they do their work.  Those who
work with SGML certainly hope so.

In this article I will try to explain some of the key concepts that
led to the birth of SGML and how they manifest themselves in the
language.  In addition, some key concepts of the language are
presented for the casual observer.  Later articles will deal with
these issues in more detail.


Markup

The specification of typefaces, sizes, weights, spacing, indentation,
justification, etc, for a given text before it is printed (or typeset)
is called marking up the text.  Traditionally, this has been done by
writing small cryptic signs in the margin of the copy (text), much
like proofreaders marks, for those who have seen them.  For instance,
the abbreviation "SGML" could be set in small capitals, keywords could
be set in italics, paragraph titles in boldface, a point or two
smaller than running text.  Examples could be set in some monospaced
font such as courier, quotes in the same typeface as running text but
two points smaller on a narrower line spacing with a shorter line
length and a little indented from the left margin.  All of these
rather detailed specifications were instructions to the typesetter.
Incidentally, we now call such markup "specific markup".

Authors have generally not had to know anything too specific about
typography.  Most of them used typewriters or had someone type their
manuscript from dictated tapes, if they didn't write it out in
handwriting.  However, they had to communicate the structure of the
text to the person doing the markup, even though that person added
many choices of his own.  A minimum would be to separate paragraphs
from each other, the headings from the text, examples from running
text, etc.  Some of the more obnoxious authors no doubt had specific
ideas on how it should look, as well.

As one comes to learn of the art of printing, one becomes awestruck
(or should, anyhow) by the magnitude of the possibilities and choices.
Typographers in the field of advertising exploit this to the fullest,
but if you produced a book with the same typography as in ads, it
would look horrible, and be unreadable.  Typographers deal with the
aesthetics of the printed page, while the copywriters concentrate on
the message.  Similarly, authors concentrate on the content of books
and articles, deferring typographical decisions to people who master
that part of the process.  This is not always the case, as we shall
see, and some software for popular hardware platforms deliberately
confuse the issues of contents and style.


Information and presentation

Ironically, in a world of ever more specializing, it's important to
distinguish between two aspects of any communicated piece of
information that few would ahve thought could be confused.  When you
pick up a newspaper or a book, you don't do so to admire the clean
typography, the beauty of the typeface, or the orderly layout (unless,
of course, you've spent some time under the wing of an old-school
typographer, and write articles about what SGML is); you pick it up
because the information looks interesting and worth reading.  Many
would argue that people are unlikely to do so if a whole 100-page
newspaper consisted of page upon page of 8 columns of text packed
together with no change in type, no superfluous spacing, no paragraph
breaks, etc.  (The Frankfurter Allgemeine to the contrary notwith-
standing.)

On the printed page, information and presentation work together to
bring the author's ideas and observations to the reader, as they do
with the lecture, where a live lecturer with a rich voice and varied
diction can do wonders to otherwise heavy material.  In the result,
information and presentation invariably coexist, and it would indeed
be difficult to imagine either without the other.  This does not imply
that the two are not radically different processes, and that the basic
tenet is that information must have some form of presentation, but it
can be any form of presentation, and that a form of presentation must
by applied to some information, but it can be any information.  Thus
we deal with easily abstractable concepts.  Let's take them each in
their turn.

Information usually has some structure.  Books, e.g., may have parts,
chapters, sections, and paragraphs.  They may have front matter, such
as a foreword, a preface, table of contents, etc, and back matter such
as bibliography, author biography, index, etc.  Further, paragraphs
can't contain chapters or a table of contents, but may contain other
tables, figures, quotes, footnotes, etc.  This forms a hierarchy of
elements.  (There are also non-hierarchical elements, such as index
entries, references to chapters, paragraphs, tables, figures, and so
on.)  This structure is almost always in the head of the author, and
not put down on paper in detail.  This unfortunately lends itself to
inconsistencies, and in particular where special conventions are used
to indicate different parts of the structure.

A style, or consistent form of presentation, has a close relationship
to the structure of the information to which it applies.  Titles of
chapters look the same, as do figure and table captions.  Footnotes
obey a certain regularity, etc.  It should be evident that a style
associates each element in the structure with a particular set of
typographic instructions or specific markup as mentioned above.

Because of the relationship between structure and style, we talk of
text which only carry indications of the structure as being marked up
with "generalized markup".  The implication is that we can choose any
style for the structure in question, and that the author won't have to
make any changes to the document if he (or the typographer) changes
his mind about, say, footnote presentations, or the particular size of
the typeface chosen for chapter titles.  It should even be possible to
produce the table of contents directly from the document by listing
the titles of the chapters, sections, paragraphs, etc, and ignoring
all other content.  Similarly, of course, with list of figures and
tables.

It should be quite obvious that if all a document contains is specific
markup, it would be impossible, without vast amounts of human inter-
vention and artificial intelligence, to figure out what is a chapter
title and what is a figure caption, mechanically.  To the reader, this
is of course very simple, but we must remember that this is courtesy
of a long tradition among typographers who have managed to present it
all to us in a clean and consistent fashion.  Humans also recognize
what's a title and what's content by the form of the text therein,
which a computer program would be hard pressed to do.  Physical
arrangement on the page also helps, especially when more novel or
experimental layout is used to represent keywords in the margin,
titles in boxes on one side of the page, and so on.

The separation of information and presentation is thus most helpful
for machine processing of documents.  The argument is that most new
documents are written with the aid of computers and that even for
small amounts of machine processing, consistency is a virtue.  It is
cruel to demand of the user that he maintain the consistency with no
aid from the otherwise useful computer.  That is why languages are
designed to help the author inform the computer of his intentions, so
that the computer both can check consistency and be of the most help.
After all, such chores are what computers are well suited to perform.


Markup Languages

Before discussing generalized markup languages, a quick presentation
of markup languages in general is called for.  Instances of markup
languages abound, and many have seen WordStar's dot-commands, TeX's
backslash-introduced magic, and the intertwined text and control lines
of nroff and troff.  Less intuitively, the special codes used by the
soi disant WYSIWYG ("What You See Is What You Get") editors are very
simple markup languages, because only the editor is supposed to see
them.  Users are "relieved" from knowing what's going on, and are thus
not able to use the document in anything but the environment in which
it was created.  Personally, I regard such closed environments as an
abomination, and want nothing of it.  Text and documents are much too
important to be left to "integrated packages".

nroff and troff, to take but one example, or two of a kind, really,
since they take the same input language and prepare output for two
different classes of devices, take as input a document consisting of
interspersed text lines and control lines, where the control lines are
distinguished from the text lines by having the first character on the
line be either a period or an apostrophe (or acute accent).  Other
lines are text lines and are processed according to the instructions
provided by the control lines.  The primitive operations in nroff and
troff are based on typesetter primitives, in addition to some powerful
operations on registers, text justification and line drawing.  The
input language also supports expressions and conditionals.  Another
important feature of the input language is the ability to define
macros for oft-repeated sequences of primitives.  More than that, the
macros may take arguments and pave the way for a generalized markup.
Indeed, several such exist for many applications.  E.g., the UNIX
manual pages are all described with a highly generalized markup.
Unfortunately, the language is somewhat restricted and is considered
difficult to learn by the people who decide such things.  There exists
preprocessors for both equations, tables and pictures, and the result
can be found in a large number of books.  Most notably, the results
are in line with sound typographical conventions, and even though it's
possible to recognize that a book was set with the aid of troff,
typographers are pleased with the result.

TeX, on the other hand, contains a gargantuan amount of primitives (or
not so primitives) and is a vast and complex language which is hard to
learn and master, but which is said to be particularly well suited to
typeset mathematics.  It is extremely cumbersome to use in its raw
version.  TeX also supports macro definitions, and indeed this is what
most TeX coding is about.  However, the confusion of name spaces in
TeX (something which is not present in nroff or troff), makes reading
the macro libraries hard and changing them error-prone.  Large macro
packages and preprocessors exist, e.g., LaTeX, which function as
generalized markup, except that is it difficult to use the input for
anything but the particular preprocessing software.  From the stand-
point of a seasoned typographer, the output is gross and can easily be
recognized even by the untrained eye.

Both of these, and undoubtedly most other, markup languages are
concerned with producing output on some (range of) devices.  In this
manner, they are really specific markup languages with generalization
support, instead of being generalized markup languages with specific
markup support.

The Standard Generalized Markup Language is the first major attempt at
a truly generalized markup language; the main feature of which is the
ability to specify the structure and check it for consistency.  Any
document which use a particular structure can be validated with
respect thereto.  SGML does not concern itself with producing output
on any device; that is handled by an SGML application, external to
both the language and the document.  What then, does SGML do?

An SGML document consists of several parts, which need not be found in
the same file.  (Powerful means exist to enable any organization of
the document.)  From SGML's point of view, the DTD (Document Type
Definition) is the most important.  It defines a type of document in
terms of its structure.  From the authors's point of view, the
document instance is the most important.  This is the authors's
content within the structural definition defined by the DTD.

<< More to come >>