[net.internat] Languages, Computers, and Problems

kosower@harvard.ARPA (David A. Kosower) (11/03/85)

[Munch, munch]

   Recently, there has been a fair amount of discussion in this 
newsgroup on the dual subjects of [human] languages and character
sets.  Several points that ought to be made have not been, and so
I would like to make them here.

   At the moment, there are two basic kinds of activities people
use computers for: programming, and text-processing.  This
is a crass generalization, but I believe it captures the distinction
between handling material intended primarily for a mechanical or
technical audience and handling material intended primarily for a
human audience.  

   There is no question that the primary human language in the
first activity is English, and that the widely-used computer languages
have their roots in an English-speaking milieu.  There is little doubt
in my mind that this situation will continue for a long time,
certainly well into the coming century.  The reasons are varied but
powerful, including the near-universal use of English in scientific
research of any consequence, and the enormous size of and intense
activity within the American computer industry and markets.  The
relative ease with which one can introduce new terminology into
the language, and the lack of barriers to acquiring a basic
facility in English also play a role.

   This suggests that every reasonable computer system in coming
years will have to handle English and the ASCII character set
(witness that even IBM was forced to use ASCII for
its ventures into the PC market -- and there aren't any other
companies able to impose their own low-level standards on some
other segment of the market).  Programming and system internals
will continue to be done as they are done in the US, if only 
because the overwhelming fraction of programs written will 
continue to be written by English-speaking individuals.  Folks
doing things differently elsewhere would be wasting their time
and dooming themselves to incompatibility for its own sake.

   On the other hand, there is a vast Babel of languages used by 
people to communicate with other people, and I do not expect that
situation to change in my lifetime, either.  I wouldn't want it to;
we would lose many cultural riches were that to happen.  In order
that that non-technical folk be willing to use computers as intermediaries,
computers must be able to handle their *language*.  This certainly
includes the ability to handle non-ASCII character sets, but much
more than that: editors that give help messages in the native language,
dictionaries, spelling and syntax checkers in that language, text
processors that understand how to hyphenate in the native language,
and much more.

   This suggests that foreign-language utilities will appear as a
veneer on top of an English-based system.  (The Xerox Star is an example of
a well-designed system of this kind).

   People will certainly want the ability to use *languages* and
character sets other than their own (this is far more important
outside the US than inside).
But the importance, and need for efficiency, are not independent of
distance: it is much more important for a Dane, say, to be able to
write in German than it is for him to be able to write in Korean.
Thus the Dane's text-handling utilities should be optimized for
handling European languages and character sets, though they should
be *able* to handle any [reasonable] language or character set.
The situation is not symmetrical, because of the world-wide 
importance of English, but the general idea is that "local"
languages should be handled more fully and more efficiently.
This has direct bearing on the question of character set
representation; it is likely that the correct choice for intra-
computer representation of a text is different on machines in
different parts of the world: the Dane might use an 8-bit
representation for characters, with escape sequences for Oriental
characters, while the Korean would use 16-bit characters internally,
perhaps mixed with an 8-bit representation for English.  As an aside,
note that the information density comes out to be roughly the same
(actually, I can't vouch for Korean, but this is more-or-less true
for Japanese and Chinese): although each character takes up more
bits, it also conveys more information, typically a phoneme (2-3
characters in English) rather than merely a letter.

   What about inter-computer transmission of information?  Where
should the conversion take place?  On both sides, of course: neither
representation is the most efficient for transmitting information,
so a standard allowing for data compression should be utilized for
this purpose.

   Such a standard must also be able to handle the meta-questions
of transmission: what happens if the recipient cannot represent all
of the information in the file being transmitted?  This could happen
because the recipient's system is more limited in its ability to
handle foreign character sets, or because the information transmitted
contained characters private to the transmitting site (e.g. logos --
even 16 bits won't be enough if we want our universal character set
to include everyone's logos and special symbols!).

   I am suggesting that the low-level internal character-set
standards -- the questions of which bit patterns represent which
abstract characters, other than the omni-present ASCII, are not 
likely to be consistent from one machine to another.  Nor is it 
really that important; the more abstract issues of designing and 
implementing foreign-language and multi-lingual applications to sit 
on top of the operating system are much more crucial.  There is, for 
example, a large body of knowledge that has been built up over the years
on the proper and elegant way to handle hyphenation automatically
in English.  There are a variety of algorithms and methods that
text formatters use.  Unless I am mistaken, there is a good deal
less known about such questions even in most European languages,
let alone others (say, Hebrew).  It is these questions we ought
to be applying ourselves to.

  To pick another example, lexical sorting, a trivial task in
ASCII-encoded English, becomes somewhat non-trivial for foreign
languages.  Of course, for a *single* foreign language, one can
handle this by replacing the character-comparison loop with a table-
lookup scheme, or a mixed scheme.  There are even ways to order kanji
(after all, the Japanese have dictionaries, too), and these can be
taught to a computer.  The real problems arise in mixed-language situations;
what is the appropriate ordering in that case?  One way out is
to have a [hierarchical] notion of `default' or `environment'
language.  Thus, a few foreign words, representable in a Latin
alphabet, appearing in an English document on an American computer
system, should probably sort in the index as though they were
English words; this is what is most intuitive to an English
speaker.  A user on a foreign system might want the index
to be sorted in the same fashion (on the grounds that the he
must understand English anyhow), or he might want it to be sorted
according to his native language's sorting rules, since he feels
more comfortable with those.  It is quite likely that different
users will feel differently about such issues, indeed even the
same user might make different decisions about different documents.
The text-processing applications will thus have to be more
flexible in dealing with such issues; a sorting order cannot be
hard-coded into a specific bit representation, since it may be
context-sensitive.  Implicit in this attitude, incidentally, is
the viewpoint that a document manipulated on a computer is a far
more fluid and flexible object than some instantiation of it
produced by a laser printer.  After all, we may have different,
even mutually incompatible, views of a given document in different
contexts; why shouldn't our computers be the same?  (This is NOT
a plea for some utopian God-and-reality AI system; I am not expecting
the computer to *understand* my document, merely to be able to
show it differently when contexts change).

  I close with a list, intended to be thought-provoking, not
exhaustive, of problems for automated or automation-assisted tools
that I believe are interesting and merit attention:

   Multi-lingual sorting.
   Hyphenation in foreign languages.
   Checking spelling, syntax, agreement, and usage in foreign
      languages and in multi-lingual documents.
   Language-to-language dictionaries and other translation aids.
   Aesthetics of text formatting, especially in Oriental languages.
   Handling dialects.
   Transmission of information to and through systems with a more
      limited linguistic ability.
   What is a property of a document, and what is a property of
      a local computer system?  (E.g., do the special symbols, and fonts
      associated with a document travel along with it?)
   Efficient search and match algorithms in foreign languages.  
      (Especially in Oriental languages, where users may want to search
       for characters that are part of other characters, perhaps in
       a [visually] altered form).

                                     David A. Kosower
                                     kosower@harvard.HARVARD.EDU.ARPA