kosower@harvard.UUCP (David A. Kosower) (02/09/86)
In article <343@ur-tut.UUCP> tuba@ur-tut.UUCP writes: >What's all this business about linkers that only support monofont >symbols? When I name a variable "counter" in 10 point Helvetica, >the last thing I mean is "counter" in 14 point Courier. Linkers ... In article <832@inset.UUCP>, mikeb@inset.UUCP (Mike Banahan) adds, > But are you *WRONG* if you think that is a joke. Tuba is RIGHT > RIGHT RIGHT, except for missing out the other essential attributes that we > will need if we are going to get an environment fit for the 1960's, let > alone the late 70's or even 80's. What about colour? I want to be able to grep > for all the 10 point helvetica non-alphabetic strings in red, because that > is the way that my application displays error messages. NO!!!! I must disagree. As I will demonstrate below, you are confusing two different aspects of a document (or pieces thereof): the `intrinsic' content, and its formatting. The boundary between the two may be a little fuzzy, but things like the font, point size, and color are most definitely formatting information, not intrinsic content. Before giving my criterion for distinguishing the two aspects, let's take the idea proposed in the quoted articles a little further, to truly silly extremes. Now, on dumb terminals, you can't adjust the spacing between letters in a variable name to arbitrary accuracy; you can only insert `hard' (i.e. required) spaces. But in the brave new world of modern laser printers, you could insert, say, a 1 pt space between the first and the third letters. Or even a 0.1 pt space, on a laser typesetter. Does that mean we need a new character to indicate the spacing? Or that each character should now be preceeded by a *floating point* number indicating the amount of spacing preceeding it? Do you truly believe that `InformUser' with 0.1 pts of space between the `r' and the `m' means something different from plain old `InformUser'? Or how about the attributes of the surrounding text, say the inter-line spacing? Isn't that something one ought to be able to grep in our utopian operating system? It's clear that such an approach quickly sucks you in to a specific WYSIWYG representation of text, and is completely inadequate for anything else. Embedding this information in every character would also result in a grossly inefficient encoding; space may be cheap on modern computers, but it isn't THAT cheap. [For those souls don't know what a printer's point is: there are ~72/inch, or ~28/mm, so that 0.1 pts of anything are virtually invisible, and in any event unprintable on a printer with only 300 dots/inch resolution.] And consider for a minute the specific proposals made by the two authors quoted above. What *exactly* does `14-point' Courier mean, anyway? Does it mean `somewhere between 13.75 and 14.02-point Courier, as suits my laser printer' or does it mean `14.00-point Courier'? What does `Courier' mean? Does it mean the IBM version? Or the Little Font Company's version? Does a variable name no longer mean the same thing if it isn't *exactly* the same font? And if it still *is* the same variable when the font changes, *how* much change can occur before the variable name changes in meaning? Color, likewise, depends very much on the output device. It's all very nice to grep for things in red sitting in front of my shiny new Sun-3, but what does that mean when I'm back in front of my neighbor's Sun-2, which is black-and-white? Does `red' still mean red (i.e. an invisible attribute), or is it subverted into `black'? And of course, we could argue about shades of red until the sun comes up. What do all these attributes have in common? They vary contiuously, they depend on detailed properties of the output device, and (in spite of the claim by tuba@ur-tut) they *modify* rather than *create* information. Furthermore, that modification is often context-sensitive. Conversely, removing these additional attributes may *degrade* but should not *remove* information entirely. If you had to read a typeset document at your terminal, you might lose some of the nuances, but the meaning (if any...) will not disappear entirely. If your laser-printer were missing `14-point Helvetica', and were forced to substitute `12-point Helvetica' instead, surely you do not claim you will lose as much information as you would if, say, your line-printer were missing the letters a-m, and were forced to substitute the letters n-z instead... In contrast to a possible disagreement over the meaning of `14-point Courier', we will all agree on what an English `a' is. It doesn't matter how it's represented (printed, hand-written, ASCII, EBCDIC, &c.), it corresponds to the same abstract object. Indeed (unlike the notion of a specific font), `a' already *is* in some ways an abstract object. The basic criterion I propose for distinguishing `intrinsic' aspects of a document from its formatting is thus as follows: those aspects of a document which change discretely, have abstract properties independent of an output device, and contain `inherent information', should be considered the `intrinsic properties' of a document. These, and only these, should be embedded in the character codes. Everything else is formatting information, which may be (and will be, in reasonable systems) supplied in a variety of manners (via WYSIWIG editors, sophisticated formatters, &c.), but should *NOT* be embedded in the character codes. These `intrinsic properties' include (of course) the abstract character represented, and *perhaps* the language, but not much else. All of this may sound rather formal and dry. Here are a few examples to play with. Ask yourself: is "counter" in 14-point Helvetica really completely different from "counter" in 10-point Helvetica? How about "counter" in 14.00-point Helvetica vs "counter" in 13.75-point Helvetica? 13.99 point Helvetica? How about "counter" in red vs. "counter" in green, seen on a black-and-white display? Now consider examples that are `intrinsically' different: does "counter" mean something different from "aardvark"? Is a "counter" different from a "count"? And here's a more subtle example to mull over: capitalization is NOT a formatting attribute. It conveys inherent meaning: an "Afghan" is a person from Afghanistan, while an "afghan" is a sweater. They are different words. My favorite examples along these lines (flames to /dev/null, please) are those where capitalizing a word turns it into its antonym (or nearly so): e.g. "catholic" vs. "Catholic". To summarize: many of the visual aspects of documents produced via modern printing devices are determined by formatting. Formatting information does not belong in the character set encoding. Period. David A. Kosower kosower@harvard.Harvard.Edu.Arpa
lamy@utai.UUCP (Jean-Francois Lamy) (02/11/86)
It may already be to late for clear-cut distinctions. My file names on my Macintosh already have accents in them, and my Pascal programs at Universite de Montreal have variable names with accents (which are simply folded on the corresponding unaccented letter, which would not work for all languages). Given ISO Latin 1, doing this folding, or even defining a proper collating sequence is essentially trivial (a simple indexation in a language dependant table - none needed for English). The advent of cheap typography and high-resolution bit-mapped displays means that it is becomming customary to see programs where keywords appear in bold on the screen (e.g. MacPascal); typographic conventions for displaying program and emphasize their structure may involve displaying comments in italic, or strings in a "typewriter" font. In such a case delimiters become superfluous. But I can see a lot of resistance to their removal, e.g. we still use begin-end pairs even though indentation alone provides sufficient clues to establish the block structure. Similarily, if I remember properly, the Occam language manual distinguishes terminals from non-terminals in syntax specifications by using colour. I guess the point I'm trying to make is that there is indeed information in the font, style variation and colour/grey scale used. But then even 16bits are not enough, so keeping it in every char. is not a sensible idea. -- Jean-Francois Lamy Department of Computer Science, University of Toronto, Departement d'informatique et de recherche operationnelle, U. de Montreal. CSNet: lamy@toronto.csnet UUCP: {utzoo,ihnp4,decwrl,uw-beaver}!utcsri!utai!lamy EAN: lamy@iro.udem.cdn ARPA: lamy%toronto.csnet@CSNET-RELAY.arpa