[comp.text.sgml] Attributes: For or Against?

erik@naggum.uu.no (Erik Naggum) (05/14/91)

James Clark's recent question prompted me to look for articles on the
use of attributes in general, and I've found a number of references
which appear negative or take a firm stand against using them.

Without mentioning names, two large projects in which SGML is used
take a firm stand against using attributes.  The main argument in both
is that the application will be able to sort out what's content and
what's not, and that using attributes only complicate the parser -
application interface, and also the parsing process.

1.  What do you think about attributes versus nesting tags?

2.  If you have decided against using attributes, what were the main
    argument(s) against using them?  How did you represent the data
    that would otherwise have been in attributes?

3.  Since NOTATION info is frequently provided in an attribute, does
    this mean you don't use NOTATIONs?

4.  Are there parsers or other SGML software out there unable to
    process attributes?

5.  If you have used attributes extensively, what were the major
    design choices you made between content and attribute for a given
    piece of information?


I'm hoping for a discussion of the de/merits of attributes, and will
summarize any private replies I get.

Thanks for your time.

--
[Erik Naggum]           Professional Programmer        <enag@ifi.uio.no>
Naggum Software             Electronic Text          <erik@naggum.uu.no>
0118 OSLO, NORWAY       Computer Communications            +47-2-836-863

sjd%kirk%ebt-inc@uunet.UU.NET (Steve DeRose) (05/16/91)

Erik:  could you cross-post this to comp.text.sgml for me?  I'm
having mailer troubles here.  Thanks!

Steve

[Done.  Headers juggled so replies should now go to the author.  </Erik>]
- - - - - -

Usually there is another agenda behind outlawing attributes (although
there are legitimate arguments in that direction).
One agenda is to simplify parsers, but that is a silly place to simplify SGML;
parsing attributes is trivial compared with several other, less useful,
constructs (I know since I've written them).  Same goes for the interface;
there are *much* better places to simplify.

Another is to simplify searching; people using systems like PAT, which
operates on strings rather than tags per se, are less happy with things
that can move around (the order of attributes in a given start-tag
is arbitrary).  This is not PAT's fault, it's just an observation heard
from some later users of it who want to do certain kinds of searches.

I'm surprised that someone argues that *avoiding* attributes would make
it easier to separate content from non-content.  Consider this example:
    <LIST TYPE=BULLETED>
       <LI>...</LI>
       ...
    </LIST>

versus the attribute-less:
    <LIST>
       <TYPE>BULLETED</>
       <LI>...</LI>
       ...
    </LIST>

Eliminating the attribute in this way *worsens* the confusion of markup
with content, by putting the word "bulleted" in content.  This, to
my mind, is very bad indeed.  The only alternatives that don't have 
this problem involve multiplying the number of tags, for example:
    <LIST>
       <BULLETED>
       <LI>...</LI>
       ...
    </LIST>
 or
    <BULLETLIST>
       <LI>...</LI>
       ...
    </LIST>

This class of approaches needlessly complicates the DTD, and also
loses the generalization that there is a class of object we might call
a list, which has sub-types.

Also, the former of these last two examples has the bad additional
effect of dissociating the attribute's scope of applicability
from that of its element.  One of the criteria for attribute-hood
(IMHO) should be that the attribute share precisely the scope of the
element it modifies; only by making it part of the same tag can that
relationship be syntactically demonstrated.

The sub-type argument is especially important with things like tables,
where each instance may have locally-significant information to be
specified, such as column placements (I'd rather avoid putting that in
the file at all, but this is what everyone does, including AAP, CALS,
telecommunications, legal, and many other important DTDs).  Certainly
one shouldn't have to touch the DTD in order to define a new sort of table.


Another major argument in favor of attributes is the need for unique
identifiers for referencing.  Heaven help us if we have to put anchor
id's for links into content!

As for NOTATION, there are two ways it's typically used.  In the first,
you declare a notation, then have some tags that represent the nonSGML
data and have attributes (which may take entities as values), which
specify where the data is:
   <ART filename=fig27>

In the second, you have an NDATA entity, which is the sole representation
of the nonSGML data (no attribute needed):
   <!ENTITY fig27 SYSTEM "/bin/fig27" NDATA tiff>
   ...
   &fig27;

The latter eventually becomes messy because:
   1) you cannot associate any differential information per instance
      of the notation; you cannot, for example, attach any information
      about how you want 'tiff' things treated in the document.  For
      example, some figures can be floated to the next page if space
      is short, and some cannot; the only way to encode this is to 
      put an ad hoc tag near the entity reference, with some ad hoc
      semantics that your application (not SGML) knows means that.
   2) you have to add an ENTITY declaration for every new figure; 
      authors shouldn't have to know that there even is such a thing
      as a markup declaration, and commonly they have neither the
      knowledge nor the software to do so.

In passing, I should mention that there is one syntactic hassle with
attributes which I hope will be addressed in a later version of the
standard.  There is no way, in a single line, to declare that a 
particular attribute can either
   a) apply to any tag at all, or
   b) apply to a tag which has already had attributes declared.

Thus, you cannot say:
   <!ATTLIST #ANY ID ID #IMPLIED>
or
   <ATTLIST p  type CDATA #IMPLIED>
      and then add elsewhere as an afterthought:
   <ATTLIST (p|q)  type CDATA #IMPLIED>
This is merely syntactic convenience; you can still get the effect.


As for Erik's excellent question about real software, awareness of
attributes varies greatly between products.

Certainly someone will correct me vociferously if I'm wrong, but the
last I knew the Chameleon system was reluctant to handle attributes.
Many authoring systems cannot decide formatting based on attributes,
but that is typically seen as a limitation rather than a desire.
My company's product lets you refer to attributes in style sheets,
so that you can apply a completely different style definition to
instances of an element that have different attribute values.  I believe
there are other products that can do this also.  A few useful cases
are shown below.

Some encoding guidelines, including those of the Text Encoding Initiative,
presume some such capability, and use attributes extensively.  This is
true of the majority of the large SGML projects I have worked with.
   <SEC   SECURITY=TOP>
   <P     LEVEL=EXPERT>
   <LIST  TYPE=BULLETED>
   <TABLE LAYOUT=PARTSLIST>

In short, I would say attributes are essential, although they should
be used judiciously.

Steven J. DeRose, Ph.D.
Sr. System Architect
Electronic Book Technologies
One Richmond Square
Providence, RI 02906
(401) 421-9550, fax -9551
sjd%ebt-inc@uunet.uu.net
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text            <enag@ifi.uio.no>
0118 OSLO, NORWAY       Computer Communications      <erik@naggum.uu.no>