kosower@harvard.ARPA (David A. Kosower) (11/03/85)
[Munch, munch] Recently, there has been a fair amount of discussion in this newsgroup on the dual subjects of [human] languages and character sets. Several points that ought to be made have not been, and so I would like to make them here. At the moment, there are two basic kinds of activities people use computers for: programming, and text-processing. This is a crass generalization, but I believe it captures the distinction between handling material intended primarily for a mechanical or technical audience and handling material intended primarily for a human audience. There is no question that the primary human language in the first activity is English, and that the widely-used computer languages have their roots in an English-speaking milieu. There is little doubt in my mind that this situation will continue for a long time, certainly well into the coming century. The reasons are varied but powerful, including the near-universal use of English in scientific research of any consequence, and the enormous size of and intense activity within the American computer industry and markets. The relative ease with which one can introduce new terminology into the language, and the lack of barriers to acquiring a basic facility in English also play a role. This suggests that every reasonable computer system in coming years will have to handle English and the ASCII character set (witness that even IBM was forced to use ASCII for its ventures into the PC market -- and there aren't any other companies able to impose their own low-level standards on some other segment of the market). Programming and system internals will continue to be done as they are done in the US, if only because the overwhelming fraction of programs written will continue to be written by English-speaking individuals. Folks doing things differently elsewhere would be wasting their time and dooming themselves to incompatibility for its own sake. On the other hand, there is a vast Babel of languages used by people to communicate with other people, and I do not expect that situation to change in my lifetime, either. I wouldn't want it to; we would lose many cultural riches were that to happen. In order that that non-technical folk be willing to use computers as intermediaries, computers must be able to handle their *language*. This certainly includes the ability to handle non-ASCII character sets, but much more than that: editors that give help messages in the native language, dictionaries, spelling and syntax checkers in that language, text processors that understand how to hyphenate in the native language, and much more. This suggests that foreign-language utilities will appear as a veneer on top of an English-based system. (The Xerox Star is an example of a well-designed system of this kind). People will certainly want the ability to use *languages* and character sets other than their own (this is far more important outside the US than inside). But the importance, and need for efficiency, are not independent of distance: it is much more important for a Dane, say, to be able to write in German than it is for him to be able to write in Korean. Thus the Dane's text-handling utilities should be optimized for handling European languages and character sets, though they should be *able* to handle any [reasonable] language or character set. The situation is not symmetrical, because of the world-wide importance of English, but the general idea is that "local" languages should be handled more fully and more efficiently. This has direct bearing on the question of character set representation; it is likely that the correct choice for intra- computer representation of a text is different on machines in different parts of the world: the Dane might use an 8-bit representation for characters, with escape sequences for Oriental characters, while the Korean would use 16-bit characters internally, perhaps mixed with an 8-bit representation for English. As an aside, note that the information density comes out to be roughly the same (actually, I can't vouch for Korean, but this is more-or-less true for Japanese and Chinese): although each character takes up more bits, it also conveys more information, typically a phoneme (2-3 characters in English) rather than merely a letter. What about inter-computer transmission of information? Where should the conversion take place? On both sides, of course: neither representation is the most efficient for transmitting information, so a standard allowing for data compression should be utilized for this purpose. Such a standard must also be able to handle the meta-questions of transmission: what happens if the recipient cannot represent all of the information in the file being transmitted? This could happen because the recipient's system is more limited in its ability to handle foreign character sets, or because the information transmitted contained characters private to the transmitting site (e.g. logos -- even 16 bits won't be enough if we want our universal character set to include everyone's logos and special symbols!). I am suggesting that the low-level internal character-set standards -- the questions of which bit patterns represent which abstract characters, other than the omni-present ASCII, are not likely to be consistent from one machine to another. Nor is it really that important; the more abstract issues of designing and implementing foreign-language and multi-lingual applications to sit on top of the operating system are much more crucial. There is, for example, a large body of knowledge that has been built up over the years on the proper and elegant way to handle hyphenation automatically in English. There are a variety of algorithms and methods that text formatters use. Unless I am mistaken, there is a good deal less known about such questions even in most European languages, let alone others (say, Hebrew). It is these questions we ought to be applying ourselves to. To pick another example, lexical sorting, a trivial task in ASCII-encoded English, becomes somewhat non-trivial for foreign languages. Of course, for a *single* foreign language, one can handle this by replacing the character-comparison loop with a table- lookup scheme, or a mixed scheme. There are even ways to order kanji (after all, the Japanese have dictionaries, too), and these can be taught to a computer. The real problems arise in mixed-language situations; what is the appropriate ordering in that case? One way out is to have a [hierarchical] notion of `default' or `environment' language. Thus, a few foreign words, representable in a Latin alphabet, appearing in an English document on an American computer system, should probably sort in the index as though they were English words; this is what is most intuitive to an English speaker. A user on a foreign system might want the index to be sorted in the same fashion (on the grounds that the he must understand English anyhow), or he might want it to be sorted according to his native language's sorting rules, since he feels more comfortable with those. It is quite likely that different users will feel differently about such issues, indeed even the same user might make different decisions about different documents. The text-processing applications will thus have to be more flexible in dealing with such issues; a sorting order cannot be hard-coded into a specific bit representation, since it may be context-sensitive. Implicit in this attitude, incidentally, is the viewpoint that a document manipulated on a computer is a far more fluid and flexible object than some instantiation of it produced by a laser printer. After all, we may have different, even mutually incompatible, views of a given document in different contexts; why shouldn't our computers be the same? (This is NOT a plea for some utopian God-and-reality AI system; I am not expecting the computer to *understand* my document, merely to be able to show it differently when contexts change). I close with a list, intended to be thought-provoking, not exhaustive, of problems for automated or automation-assisted tools that I believe are interesting and merit attention: Multi-lingual sorting. Hyphenation in foreign languages. Checking spelling, syntax, agreement, and usage in foreign languages and in multi-lingual documents. Language-to-language dictionaries and other translation aids. Aesthetics of text formatting, especially in Oriental languages. Handling dialects. Transmission of information to and through systems with a more limited linguistic ability. What is a property of a document, and what is a property of a local computer system? (E.g., do the special symbols, and fonts associated with a document travel along with it?) Efficient search and match algorithms in foreign languages. (Especially in Oriental languages, where users may want to search for characters that are part of other characters, perhaps in a [visually] altered form). David A. Kosower kosower@harvard.HARVARD.EDU.ARPA