[net.internat] Keeping information where it belongs.

kosower@harvard.UUCP (David A. Kosower) (02/09/86)

   In article <343@ur-tut.UUCP> tuba@ur-tut.UUCP writes:
>What's all this business about linkers that only support monofont
>symbols?  When I name a variable "counter" in 10 point Helvetica,
>the last thing I mean is "counter" in 14 point Courier.  Linkers ...

   In article <832@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) adds,
> But are you *WRONG* if you think that is a joke. Tuba is RIGHT
> RIGHT RIGHT, except for missing out the other essential attributes that we
> will need if we are going to get an environment fit for the 1960's, let
> alone the late 70's or even 80's. What about colour? I want to be able to grep
> for all the 10 point helvetica non-alphabetic strings in red, because that
> is the way that my application displays error messages.

   NO!!!!  I must disagree.  As I will demonstrate below, you are confusing
two different aspects of a document (or pieces thereof): the `intrinsic'
content, and its formatting.  The boundary between the two may be a little
fuzzy, but things like the font, point size, and color are most definitely
formatting information, not intrinsic content.

   Before giving my criterion for distinguishing the two aspects, let's 
take the idea proposed in the quoted articles a little further, to truly
silly extremes.  Now, on dumb terminals, you can't adjust the spacing
between letters in a variable name to arbitrary accuracy; you can only
insert `hard' (i.e. required) spaces.  But in the brave new world of modern
laser printers, you could insert, say, a 1 pt space between the first and
the third letters.  Or even a 0.1 pt space, on a laser typesetter.  
Does that mean we need a new character to indicate the spacing?  Or that
each character should now be preceeded by a *floating point* number
indicating the amount of spacing preceeding it?  Do you truly believe
that `InformUser' with 0.1 pts of space between the `r' and the `m' means
something different from plain old `InformUser'?  Or how about the attributes
of the surrounding text, say the inter-line spacing?  Isn't that something
one ought to be able to grep in our utopian operating system?  It's clear
that such an approach quickly sucks you in to a specific WYSIWYG
representation of text, and is completely inadequate for anything else. 
Embedding this information in every character would also result in a
grossly inefficient encoding; space may be cheap on modern computers,
but it isn't THAT cheap. [For those souls don't know what a printer's
point is: there are ~72/inch, or ~28/mm, so that 0.1 pts of anything are
virtually invisible, and in any event unprintable on a printer with only
300 dots/inch resolution.]

   And consider for a minute the specific proposals made by the two
authors quoted above.  What *exactly* does `14-point' Courier mean,
anyway?  Does it mean `somewhere between 13.75 and 14.02-point Courier,
as suits my laser printer' or does it mean `14.00-point Courier'?
What does `Courier' mean?  Does it mean the IBM version?  Or the
Little Font Company's version?  Does a variable name no longer mean
the same thing if it isn't *exactly* the same font?  And if it still
*is* the same variable when the font changes, *how* much change can
occur before the variable name changes in meaning?  Color, likewise,
depends very much on the output device.  It's all very nice to grep
for things in red sitting in front of my shiny new Sun-3, but what
does that mean when I'm back in front of my neighbor's Sun-2, which
is black-and-white?  Does `red' still mean red (i.e. an invisible
attribute), or is it subverted into `black'?  And of course, we
could argue about shades of red until the sun comes up.

   What do all these attributes have in common?  They vary contiuously,
they depend on detailed properties of the output device, and (in spite
of the claim by tuba@ur-tut) they *modify* rather than *create* information.
Furthermore, that modification is often context-sensitive.  Conversely,
removing these additional attributes may *degrade* but should not *remove*
information entirely.  If you had to read a typeset document at your
terminal, you might lose some of the nuances, but the meaning (if any...)
will not disappear entirely.  If your laser-printer were missing `14-point
Helvetica', and were forced to substitute `12-point Helvetica' instead,
surely you do not claim you will lose as much information as you would
if, say, your line-printer were missing the letters a-m, and were forced
to substitute the letters n-z instead...

   In contrast to a possible disagreement over the meaning of `14-point
Courier', we will all agree on what an English `a' is.  It doesn't 
matter how it's represented (printed, hand-written, ASCII, EBCDIC, &c.),
it corresponds to the same abstract object.  Indeed (unlike the notion
of a specific font), `a' already *is* in some ways an abstract object.

   The basic criterion I propose for distinguishing `intrinsic' aspects of
a document from its formatting is thus as follows: those aspects of a
document which change discretely, have abstract properties independent
of an output device, and contain `inherent information', should be
considered the `intrinsic properties' of a document.  These, and only
these, should be embedded in the character codes.  Everything else is
formatting information, which may be (and will be, in reasonable systems)
supplied in a variety of manners (via WYSIWIG editors, sophisticated
formatters, &c.), but should *NOT* be embedded in the character codes.
These `intrinsic properties' include (of course) the abstract character
represented, and *perhaps* the language, but not much else.

   All of this may sound rather formal and dry.  Here are a few examples
to play with.  Ask yourself: is "counter" in 14-point Helvetica really
completely different from "counter" in 10-point Helvetica?  How about
"counter" in 14.00-point Helvetica vs "counter" in 13.75-point Helvetica?
13.99 point Helvetica?  How about "counter" in red vs. "counter" in
green, seen on a black-and-white display?

   Now consider examples that are `intrinsically' different: does "counter"
mean something different from "aardvark"?  Is a "counter" different from
a "count"?  And here's a more subtle example to mull over: capitalization
is NOT a formatting attribute.  It conveys inherent meaning: an "Afghan"
is a person from Afghanistan, while an "afghan" is a sweater.  They are
different words.  My favorite examples along these lines (flames to 
/dev/null, please) are those where capitalizing a word turns it into
its antonym (or nearly so): e.g. "catholic" vs. "Catholic".

   To summarize: many of the visual aspects of documents produced
via modern printing devices are determined by formatting.  Formatting
information does not belong in the character set encoding.  Period.

                                       David A. Kosower
                                       kosower@harvard.Harvard.Edu.Arpa

lamy@utai.UUCP (Jean-Francois Lamy) (02/11/86)

It may already be to late for clear-cut distinctions. My file names on my
Macintosh already have accents in them, and my Pascal programs at Universite
de Montreal have variable names with accents (which are simply folded on the
corresponding unaccented letter, which would not work for all languages).
Given ISO Latin 1, doing this folding, or even defining a proper collating
sequence is essentially trivial (a simple indexation in a language dependant
table - none needed for English).

The advent of cheap typography and high-resolution bit-mapped displays means
that it is becomming customary to see programs where keywords appear in
bold on the screen (e.g.  MacPascal); typographic conventions for displaying
program and emphasize their structure may involve displaying comments in
italic, or strings in a "typewriter" font.  In such a case delimiters become
superfluous.  But I can see a lot of resistance to their removal, e.g. we
still use begin-end pairs even though indentation alone provides sufficient
clues to establish the block structure.

Similarily, if I remember properly, the Occam language manual distinguishes
terminals from non-terminals in syntax specifications by using colour.

I guess the point I'm trying to make is that there is indeed information in
the font, style variation and colour/grey scale used. But then even 16bits are
not enough, so keeping it in every char. is not a sensible idea.
-- 

Jean-Francois Lamy              
Department of Computer Science, University of Toronto,         
Departement d'informatique et de recherche operationnelle,  U. de Montreal.

CSNet: lamy@toronto.csnet  UUCP: {utzoo,ihnp4,decwrl,uw-beaver}!utcsri!utai!lamy
EAN: lamy@iro.udem.cdn     ARPA: lamy%toronto.csnet@CSNET-RELAY.arpa