[comp.std.internat] What is a byte

dupuy@amsterdam.columbia.edu (Alexander Dupuy) (01/01/70)

In article <20131@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP
 (David Phillip Oster) writes:

>  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
>meaning that the next 16-bit word had the 65534 next most common, and
>so on.  
>
>That way, the average length of a run of chinese text is
>likely to be about 10 bits per ideogram, and any single ideogram would
>have canonical 64 bit representation: its bit pattern in the left of
>the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
>patterns and padded out with filler nybbles.

This underscores the central tradeoff in a code for Chinese or Chinese/Japanese
- compact respresentation to save disk space versus consistent (same character
size) representation for processing.

But there is really no reason we have to trade these off against each other.
We can just define a consistent representation for processing (24 or 32 bits
will suffice - I don't think we need 64) and use a compresseion algorithm
(Lempel-Ziv, Huffman, whatever, as long as it's standard, and not too expensive
to decode/encode) when we aren't manipulating individual characters.  Some
languages even have rudimentary forms of support for this (packed array of char
vs. array of char in Pascal).

It's clear that operating system support has to be much better than it is now
for there to be any hope of writing programs which are portable between
Latin-only, Chinese/Japanese-only, and Chinese/Japanese/Latin environments.
I don't see the programming language constructs as being the major problem.

@alex
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia, and i

lambert@cwi.nl (Lambert Meertens) (01/01/70)

In article <970@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
) In article <51@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes:
) >There seems to be a good reason for [using Kanji]: after romanization,
) >words written differently in Kanji may become the same.
) 
) The romanized form is phonetic, right?  I presume that Japanese speakers can
) understand each other when conversing by telephone; doesn't this have the same
) level of ambiguity?

In some cases the intonation pattern (accent) may help to disambiguate words
that are otherwise homonyms.  Generally, spoken text has more clues to help
interpretation than written text (sentence melody, pauses, stresses).  And
it tends to be more redundant anyway, and in a telephone conversation the
other party continually signals to you if they are still with you (which
Japanese speakers tend to do much more strongly than English speakers).

Nevertheless, one native Japanese speaker told me that he expected to be
able to figure out the meaning of a technical text written in say hiragana,
on the condition that at least the word boundaries are marked (which is not
done in normal Japanese writing).  It seems to be a matter of preference
rather than of strict necessity.

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

andersa@kuling.UUCP (Anders Andersson) (01/01/70)

In article <1384@ogcvax.UUCP> dinucci@ogcvax.UUCP (David C. DiNucci) writes:
>I do not know how the final Kanji is actually stored, but it could
>conceivably be stored as the sequence of hiragana followed by some
>special index telling which "view" of that sequence should be used
>when displaying the character.  This would seem to take care of some
>of the problems discussed in this group.  It could cause problems if

I'm not sure I understand exactly what problems this solution would
take care of. How long can a hiragana word be? The entire word (plus
the "view" index) would be the key when looking up the display bitmap,
which seems pretty much like using any variable-bytesize character set
to represent Kanji. Is this representation particularly suited to
sorting and/or searching Japanese text?
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)

karl@haddock.UUCP (08/07/87)

[I probably should have included comp.std.internat earlier, but I didn't think
of it.  c.s.internat readers can pick up context from comp.lang.c if desired.]

In article <6216@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>>[For example,] on a bit-addressible machine in an Arabic- or Japanese-
>>language environment, one might have "short char" be 1 bit, "char" be 8,
>>and "long char" be 16.
>
>... I would prefer that a (char) be capable of holding an entire basic
>textual unit, since many applications are already based on that assumption.
>...might as well simply make (char) be the right thing and not introduce a
>new type. ... most international implementations could make (short char)
>8 bits and (char) or (long char) 16 bits.

>>If this is to be phased in without breaking a lot of programs, X3J11 should
>>immediately bless all three names, but insist that they all be the same size.
>>(Which restriction should be deprecated, to be removed in the next standard.)
>
>I don't think it's within the realm of practical politics to say that the
>problem will not be solved until the next issue of the standard.

The problem with your proposal is that it would break existing code that
assumes sizeof(char) == 1.  If a user wants to write a portable program that
refers to objects smaller than 16 bits%, he can't use (short char) because
existing compilers won't accept it, and he can't use (char) because new ones
might make it too big.  That's why I suggested the temporary restriction.

Also, in the world of international text processing I don't think we have all
the questions yet, let alone the answers.  I figure X3J11 should take care of
one thing we do know (that "char" as commonly implemented nowdays won't
suffice) and pave the way for a real fix later.

(Hmm.  If I were a Japanese user, using a VAX, and I was told that, because
Japanese characters require more than 8 bits, and because (char) is the
obvious datatype for characters, and because C requires that nothing be
smaller than (char), my compiler couldn't address individual bytes, then I
think I'd start looking for a new vendor or a new programming language.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
%Assuming the implementation allows such an object to exist at all.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/08/87)

In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>The problem with your proposal is that it would break existing code that
>assumes sizeof(char) == 1.

Of course, such code is already broken in the international environment.
In fact, in an 8-bit (char) implementation, such code would continue to
work.  In other words, something has to give for internationalized
implementations; the question is what?  With my proposal,
sizeof(short char)==1, so there could be a transition period during
which implementations would make sizeof(char)==sizeof(short char) until
application source has been cleaned up.  (Some developers have been
careful to not rely on sizeof(char)==1 all along, anticipating the day
when this assumption may have to be changed.)

>If I were a Japanese user, using a VAX, and I was told that, because
>Japanese characters require more than 8 bits, and because (char) is the
>obvious datatype for characters, and because C requires that nothing be
>smaller than (char), my compiler couldn't address individual bytes, then I
>think I'd start looking for a new vendor or a new programming language.

That's why something has to be done.

As I reported recently, X3J11 has agreed in principle with Bill Plauger's
proposal for a typedef letter_t and a few conversion-oriented functions,
but NO library for letter_t analogous to the standard str*() routines.
This necessitates source-level kludgery for any application for which
portability into a multi-byte character environment is a possibility.
I don't like that very much, but since I'm not expecting to sell software
products to the Japanese I'll go along with it if the vendors think it
will fly.  This seems to be another case of not wanting to do things
technically correctly if that would require a radical change to previous
practice.  That's a legitimate concern, of course.

If *I* were a Japanese programmer, I think I'd resent being treated as
a second-class citizen by the programming language.

kent@xanth.UUCP (Kent Paul Dolan) (08/09/87)

While we're developing nightmares about the number of bits the Japanese
need in a char, remember for text processing that for 1 billion of the
earth's residents, the smallest unit of text processing is the ideograph,
and that even 21 bits is probably barely sufficient to represent the number
of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
want 24 bit ones!  ;-)

(Of course, one _could_ always write off the market, but a billion customers
is rather a lot at which to turn up ones nose!)

Kent, the man from xanth.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/10/87)

In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
>While we're developing nightmares about the number of bits the Japanese
>need in a char, remember for text processing that for 1 billion of the
>earth's residents, the smallest unit of text processing is the ideograph ...

I'm no expert, but I seem to recall that Chinese ideographs (which
as I understand it come in several varieties) are pretty much made
from a (relatively) small set of basic strokes placed in different
positions.  I think there are even Chinese typewriters, or at least
type compositors.  If this is correct, then one possibility would
be to devise a suitable (acceptable to technical Chinese)
representation for ideographs in terms of basic strokes and
placement instructions, which could be treated as text units.

After all, the letter "w" doesn't mean much when taken out of
English context; we too need the whole word-symbol, not just a
letter-component to express a meaning.  It's just that our
alphabet is simpler and is combined in 1 dimension instead of 2.

lambert@cwi.nl (Lambert Meertens) (08/10/87)

In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
) 
) While we're developing nightmares about the number of bits the Japanese
) need in a char, remember for text processing that for 1 billion of the
) earth's residents, the smallest unit of text processing is the ideograph,
) and that even 21 bits is probably barely sufficient to represent the number
) of written words in Chinese.

Are you suggesting that there are more than 2**20 = 1048576 different
written words in Chinese?  At typically 60 entries on a page, their
dictionaries must have then some 17500 pages or more.  I think that 16 bits
are enough to accommodate all Chinese characters, and certainly ample for
the about 5000 that are in actual use.

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

dougg@vice.TEK.COM (Doug Grant) (08/10/87)

In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes:
> 
> While we're developing nightmares about the number of bits the Japanese
> need in a char, remember for text processing that for 1 billion of the
> earth's residents, the smallest unit of text processing is the ideograph,
> and that even 21 bits is probably barely sufficient to represent the number
> of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
> want 24 bit ones!  ;-)

Great idea, Kent!  But with so many characters and attributes in common
usage, even 32 bits isn't enough for everyone to communicate exactly what
they mean.

I would like to propose an ASCII-compatible 64-bit character set (really!).

Here's my suggestion for how to divvy up the bits:

	24 bits - character
	 8 bits - font
	 8 bits - size
	 8 bits - color
	 4 bits - intensity (boldness)
	 2 bits - blink rate (00 = don't blink)
	 1 bit  - normal/reverse
	 8 bits - sync
	 1 bit  - left over - any suggestions?

Here's how it would be ASCII compatible:

	The eighth bit of the first byte received would be used as an
	ASCII/extended character set flag.  If it is zero, the character
	is normal 7-bit ascii.  If it is 1, the next seven bytes are used
	to complete the eight-byte character.  Only the eighth bit of the first
	byte is set to one - the eighth bit of the remaining seven bytes
	is set to zero, thus assuring that when "Extended Character Set"
	characters come in, their bytes can be kept in sync.

	For those who say "but too much bandwidth would be used for
	64-bit characters!" I say hang on - fiber optic communications
	are coming!

	I'd sure like to see one standard character set that can
	accomodate the whole world!

Doug Grant
dougg @vice.TEK.COM

disclaimer:  These opinions are my own, but my employer is welcome
	     to adopt them.

guy%gorodish@Sun.COM (Guy Harris) (08/11/87)

> Are you suggesting that there are more than 2**20 = 1048576 different
> written words in Chinese?  At typically 60 entries on a page, their
> dictionaries must have then some 17500 pages or more.  I think that 16 bits
> are enough to accommodate all Chinese characters, and certainly ample for
> the about 5000 that are in actual use.

According to a document called "USMARC Character Set: Chinese Japanese Korean",
from the Library of Congress, Washington, a 24-bit character was developed to
"represent and store in machine-readable form all the Chinese, Japanese, and
Korean characters used with the USMARC format."

It says that the character sets incorporated into this character set (the
RLIN - Research Libraries Information Network - East Asian Character Code, or
REACC) are:

	+ *Symbol and Character Tables of Chinese Character Code for
	  Information Interchnage*, vol. 1 and 2 (2nd ed., Nov. 1982) and
	  *Variant Forms of Chinese Character Code for Information Interchange*
	  (2nd ed., Dec. 1982) (CCCII)  Editor:  The Chinese Character Analysis
	  Group.  Total:  33,000 characters.

	  REACC contains all of the 4,807 "most ocmmon" Chinese characters in
	  volume 1 (as listed by the Ministry of Education in Taiwan) and about
	  5,000 of the 17,000 characters taken from a compilation of data from
	  different computer centers (mostly personal names) in volume 2.
	  REACC also contains about 3,000 of the approximately 11,000
	  characters in the CCCII *Variant Forms*, which lists PRC simplified
	  forms and other variants, some of which are also used in modern
	  Japanese.

	+ *Code of Chinese Graphic Character Set for Information Interchange
	  Primary Set:  The People's Republic of China National Standard* (GB
	  2312-80) (1st ed., 1981).  Total:  6,763 characters.  All the
	  characters in this set are in REACC.

	+ *Code of the Japanese Graphic Character Set for Information
	  Interchange:  Japanese Industrial Standard* (JIS C 6226)  (1983).
	  Total:  6,349 characters.  All the characters in this set are in
	  REACC.

	+ *Korean Information Processing System* (KIPS).  Total: 2,392 Chinese
	  characters and 2,058 Korean Hangul.  Chinese characters in this set
	  are in REACC; all hangul are also incoroporated in REACC, as well as
	  some hangul *not* in KIPS.

One characteristic of this character set is that it tries to permit a simple
rule to get the codes for various variant forms of characters from the code for
the traditional form of the character.

So, while you can probably stuff the major Chinese characters into 16 bits (the
CCCII, including variant characters, contains 33,000 characters), you may not
want to.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

howard@COS.COM (Howard C. Berkowitz) (08/11/87)

In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes:
> While we're developing nightmares about the number of bits the Japanese
> need in a char, remember for text processing that for 1 billion of the
> earth's residents, the smallest unit of text processing is the ideograph,
> and that even 21 bits is probably barely sufficient to represent the number
> of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
> want 24 bit ones!  ;-)


I worked at the Library of Congress in the late 70's, and was 
responsible for the hardware and systems software aspects of
experimental terminals for the 140 or so fonts (700 or so
languages and dialects) in which the Library has materials.

Chinese, of course, was the nightmare.  Several authorities
said we should assume about 50K distinct ideographs, but the
language scholars in the Orientalia Division said 100K was
a more correct number.  When the outside experts challenged
this, saying that the additional 50K appear in only esoteric
documents used by very specialized scholars, Orientalia responded
with "who do you think use the Orientalia collection at the
Library of Congress?"

It developed, however, that the Chinese ideograph problem could
be simplified.  While there are a very large number of distinct
ideographs, these ideographs are composed of a much smaller
(<100) number of superimposed radicals.  Chinese dictionaries
use radicals as a means of lexical ordering.  

While I am out of touch with current research, it was felt at
the time that Chinese (and full Japanese Kanji) could be approached
by using a mixture of codes for common ideographs and escapes
to strings of radicals (to be superimposed), or purely by
radical strings.

When discussing the Oriental language problem, do distinguish
the linguistic problem of ideograph uniqueness from the graphic
problem of ideograph display.  This differentiation is similar
to the difference between a code and a cipher.

-- 
-- howard(Howard C. Berkowitz) @cos.com
 {seismo!sundc, hadron, hqda-ai}!cos!howard
(703) 883-2812 [ofc] (703) 998-5017 [home]
DISCLAIMER:  I explicitly identify COS official positions.

john@frog.UUCP (John Woods, Software) (08/12/87)

In article <34@piring.cwi.nl>, lambert@cwi.nl (Lambert Meertens) writes:
I>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
N>)While we're developing nightmares about the number of bits the Japanese
C>)need in a char, remember for text processing that for 1 billion of the
L>)earth's residents, the smallest unit of text processing is the ideograph,
U>)and that even 21 bits is probably barely sufficient to represent the number
D>)of written words in Chinese.
E> 
D>Are you suggesting that there are more than 2**20 = 1048576 different
 >written words in Chinese?  At typically 60 entries on a page, their
T>dictionaries must have then some 17500 pages or more.  I think that 16 bits
E>are enough to accommodate all Chinese characters, and certainly ample for
X>the about 5000 that are in actual use.
T> 
In the English dictionary that the documentation department here uses, there
are 320,000 words.  I am told that the Oxford English Dictionary has
approaching 1,000,000 words, and that the the total English language has just
over 1,000,000 words.  Chinese is probably about the same.

I can see asking the Chinese to adopt some limited alphabet scheme (such as
Romaji used by the Japanese (if I remember correctly, a 3-Roman-character
spelling for each syllable of Kanji), or perhaps Roman phonetic spelling),
but telling them that some microscopic fraction of their language has to be
selected for interaction with computers is just flatly bogus.

(a side note to provoke more chuckles than thought:  are ideographs the CISCs
of language?  Perhaps that makes Morse code the RISC...)

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

"The Unicorn is a pony that has been tragically
disfigured by radiation burns." -- Dr. Science

henry@utzoo.UUCP (Henry Spencer) (08/12/87)

>	 8 bits - color

Surely you jest.  Any color-graphics type will tell you that you need at
least 24 bits, maybe 36 or 48. :-)

More seriously, your all-inclusive scheme falls down like this in several
areas.  8 bits may not be enough for a font size when things like fractional
sizes come in (yes, there are fractional sizes).  8 bits certainly is not
enough for a font in demanding applications -- ever looked at a font catalog?
Finally, it's not common to change things like color and font from one
character to the next (unless one is a Mac user intoxicated with the joy of
font changing, sigh...), so a lot of those bits are being wasted.  Better
to use some sort of font-switch (etc.) sequences, simultaneously giving
more compact coding and more flexibility.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

peter@sugar.UUCP (Peter da Silva) (08/13/87)

In Japan programming languages are the least of the problems their written
language causes them. An incredible amount of data is never stored anywhere
but on the original form, photocopies of said form, or faxed copies of said
form. Even with the best tools available it's just too hard to keypunch.

This, of course, makes it even more amazing that they have been so succesful
in the world community. It seems likely to me, though, that at some point
they're going to have to break down and drop Kanji for professional use.
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter (I said, NO PHOTOS!)

pom@under..ARPA (Peter O. Mikes) (08/13/87)

To: henry@utzoo.UUCP
Subject: Re: What is a byte
Newsgroups: comp.lang.c,comp.std.internat
In-Reply-To: <8404@utzoo.UUCP>
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP>
Organization: S-1 Project, LLNL
 
In article <8404@utzoo.UUCP> you  ( Henry Spencer ) write:
>
>font changing, sigh...), so a lot of those bits are being wasted.  Better
>to use some sort of font-switch (etc.) sequences, simultaneously giving
>more compact coding and more flexibility.
>--    ^ 
  This | is An IMPORTANT IDEA  :-| . The so called 'Daisy Sort' - a sequence 
  of  characters on the printwheel is optimized - using the frequency 
  if bigrams in English language - in such a manner that characters which 
  are frequent neighbours are near to each other ( that makes for a faster 
  printer). NOW, if I recall correctly, about 90%  of movements are within  
  ten-spokes-distance  and (another statistical fact) the special symbols 
  and capitals are so rare that their spacing is irrelevant ( except that 
  digits tend to follow digits - so you place all digits next to each other)
  | |
   v
    => It is very wasteful to store English text using ASCII. 
   
  ergo:  
        There are really just few 'rational alternatives' for storing text: 
         
    1)  4 bits: sign + 3bit distance in the sort (of imaginary standard  
                                                   printwheel)   
          with one code ( 0+000) being reserved to mean: The following                    4-bit word has another meaning (namely : e.g long jump) or jump to 
          another subset of the character set ( such as - switch the cases
          UPPER/lower,  digits+aritmetics signs, carriage motion controls... 
           
   2) 6 bits: 1bit sign +1bit ( distance/font switch) + 4bit (either distance   
        to next character within given sort or one of 16 other subfont sorts) 
  
   3) ...


	Naturally, languages such as c, would have a different statistics
  and should probably merit a special sort (which would be marked by a six?
  bit code on the beginning of the file/document (since unix command 'file'
  would not work {it does not work too well anyway} ) specifying the 
 (type ot the file) = (apropriate daisy sort), such as: english text,
  numerical data, Post-Script file, c-source,...

    => It is ALSO very wastefull to store numerical data sets using ASCII. 

     Of course, in the numerical_data character-subset  we need characters
  for over-flow and undefined (NaN, Infinity, missing data point, end-of-file,
  end-of-row = end_of_vector, another-data-set ..
  and characters for decimal point/comma  E and triplet-separator so that
  I can write 
 
   6_234,567 = 6_234.567  = 0.623_567_7E4 to mean

  six thousands and  234.567  ( The decimal comma  (European way ) is preferred
  by the ISO SI standard, while decimal point (US way) is tolerated) 
  and the  (current ISO) triplet separator (namely blank i.e. 1 000  for
  one thousand ) MUST be changed ( since blank is used in parsing ). 

 Perhaps  1_000=E3 (and 10 = E3.101  ?) or 1:000 = 1E3 
                                    (with / only, being used for division?)
   
 Actually, for speed of parsing it would be highly preferable to AVOID
 alphabetic separators (. and ,) and letters to express numbers. 

  Perhaps we can write   3:456::4    to express
     three thousands and  four hundred fifty six and  four tenth 

    and perhaps  1:+3  = 1:000 ( 1E3) and 5:-3 for  .005  ( 5e-3) ?

In any case, we should be able to express all numbers using sixteen digit-type
    characters:  + -   0..9, ( decimal sign ) (exponent sign) (thats 13 or 14)
    and then perhaps  ether | or { } for c-style sets, 
    and ( one triplet separator) ( e.g. : or_     ( not blank )
    We then can represent   Infinity as ::: or +++ and NaN as +_+  etc

  Anyway, I just wanted to say, that Henry's  pertinent reminder that 
  character sets and grouping of characters into sets (or sub-fonts)
  affects compactness of information storage really points a way to an
  objective measure of suitability of different coding  methods for different
  uses - and that several categories  of use , namely 

1) english text or just any plain text  (i.e. prose),   (4 or 6 bits)
   numerical data sets ( i.e. number or point sets)     (4 bits)
3)  c or just 'any programing language' 
4)  carriage motions ( tabs, form feeds, cursor addressing ,..??
    modifiers ( highlight, underline, typography...)
?)  ...?...
 

    are frequent enough and universal enough  to merit their own 
    character families or subfonts, binary representations
    and an international standard.
 
         
     
   

henry@utzoo.UUCP (Henry Spencer) (08/14/87)

> In the English dictionary that the documentation department here uses, there
> are 320,000 words.  I am told that the Oxford English Dictionary has
> approaching 1,000,000 words, and that the the total English language has just
> over 1,000,000 words.  Chinese is probably about the same.

Remember that the OED includes an awful lot of words that are obsolete or
terminally obscure by anyone's standards.  It is not a dictionary of current
English.

I would also wonder about the assumption that Chinese would be about the
same size.  I have heard it said that English has the largest vocabulary of
any human language by a wide margin, because of its dominant position and
its unusually extensive borrowing from other languages.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

devine@vianet.UUCP (Bob Devine) (08/15/87)

>   The so called 'Daisy Sort' [...]    points a way to an
>   objective measure of suitability of different coding  methods for different
>   uses - and that several categories  of use [text, programming code]
>   are frequent enough and universal enough  to merit their own 
>   character families or subfonts, binary representations
>   and an international standard.

  Unfortunately, it would be very difficult to write any general
text manipulating programs.  With ASCII encoding it is very easy
to yank sections out of random files and assemble a consistent
file.  The problems of dealing with a mixed-mode file will quickly
eliminate advantages gained in the single-mode case.  Likewise
consider how a program like 'grep' would operate; it would need
a switch to handle different reprensentations of the same string.

oster@dewey.soe.berkeley.edu (David Phillip Oster) (08/15/87)

In article <8409@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Remember that the OED includes an awful lot of words that are obsolete or
>terminally obscure by anyone's standards.  It is not a dictionary of current
>English.

That's part of the point. Would you support an encoding scheme that
prevented me from using English documents, even those containg
obselete or obscure words, on my computer? Well if we are going to
standardize on an encoding for Chinese, it should be able to cover ALL
of Chinese.  There is no reason why we couldn't use a huffman encoding
scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
pattern is a filler, and the 16th pattern means that the next byte
encodes the 254 next most common ideograms, the 255 bit pattern
meaning that the next 16-bit word had the 65534 next most common, and
so on.  

That way, the average length of a run of chinese text is
likely to be about 10 bits per ideogram, and any single ideogram would
have canonical 64 bit representation: its bit pattern in the left of
the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
patterns and padded out with filler nybbles.

 
Now, all we have to do is pick an ideogram frequency standard.  Say,
this idea would also work for English. Assuming that the average
English word takes 6*8 bits (average length of 5 + terminating space
* 8 bit ascii) you could cut the disk space required for computer
storage by a factor of close to 5 by using this encoding scheme. Too
bad that you'd have a mammoth word list in main memory to unpack it
speedily. Might be a nice way to increase the effective bandwidth of
all those modems pushing UseNet around though.
--- David Phillip Oster            --My Good News: "I'm a perfectionist."
Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
Uucp: {seismo,decvax,...}!ucbvax!oster%dewey.soe.berkeleye yoe

guy@gorodish.UUCP (08/16/87)

> In Japan programming languages are the least of the problems their written
> language causes them. An incredible amount of data is never stored anywhere
> but on the original form, photocopies of said form, or faxed copies of said
> form. Even with the best tools available it's just too hard to keypunch.
> 
> This, of course, makes it even more amazing that they have been so succesful
> in the world community. It seems likely to me, though, that at some point
> they're going to have to break down and drop Kanji for professional use.

I don't know about that.  More and more machines are adding support for Kanji.
There are a large number of Japan-only (Japan-mostly?  I seem to remember Jun
Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which
most of the traffic is in Japanese, represented in Kanji.  (He said they added
Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a
Kanji terminal.)  The NEC PC also includes Kanji support; it is often used as a
Kanji terminal.

These machines may not be able to handle every single Kanji character, but the
90/10 rule may apply.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

lambert@mcvax.UUCP (08/16/87)

[I have removed comp.std.c from the Newsgroups line and added sci.lang.japan]

In article <479@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
) This, of course, makes it even more amazing that they have been so succesful
) in the world community. It seems likely to me, though, that at some point
) they're going to have to break down and drop Kanji for professional use.

There seems to be a good reason for not doing this: after romanization,
words written differently in Kanji may become the same.  Although
ambiguities caused by homonymy occur in all languages (like English "drill"
= 1. [the use of] a tool for boring holes, metaphorically also boring
exercise; 2. [[a tool for] sowing in] a furrow; 3. twilled cotton; 4. a
baboon), these seem nothing compared to what the Japanese would face.

For example, the word "kanji" itself can mean: 1. feeling, sensation,
impression; 2. Kanji (Chinese character); 3. manager, secretary;
4. inspector, superintendent; 5. smilingly.  These are all written
differently now.

A particularly bad example: "ko^ka" = 1. Faculty of Engineering;
2. consideration of services; 3. a high price; 4. an official price;
5. overhead, elevated; 6. merits and demerits; 7. effect, efficiency;
8. descent, fall; 9. the marriage of an Imperial princess to a subject;
10. mineralization; 11. colloid degeneration, gelatination; 12. hardening,
cementation, vulcanization, stiffening; 13. hard money, cash; 14. a leave
of absence; 15. taxes; 16. an evil effect; 17. the Yellow Peril; 18. an
unfortunate slip of the tongue; 19. amalgamation; 20. a school song; 21. a
high-rise building.

This high degree of ambiguity is the combined result of two characteristics
of Japanese.  One is that there are say 1850 Kanji characters in common
use, each having an independent semantic content and usually a one-syllable
"reading", the so-called On reading, derived from the Chinese
pronunciation.  There may be more than one On reading and there are some
bisyllabic On readings.  There is also a Kun (original Japanese) reading,
which is completely unrelated (like On = "chu^", Kun = "hiru"), and which
more often than not is polysyllabic, but most single syllables occur as a
Kun reading.  I haven't counted them, but say there are about ninety
syllables for readings of these 1850 characters, so typically a single
syllable may be the reading of 20 different characters.  The second
characteristic that is important here is the ease with which compound words
are formed in Japanese, often by stringing some Ons together.  Thus, all
the different "ko^ka"s above are the result of combining a highly ambiguous
"ko^" with a highly ambiguous "ka", and there are hundreds of other
potential meanings for this compound than the few given above (culled from
a dictionary).  Written in Kanji, there is no ambiguity.

Not all words are that ambiguous if spelled in Romaji, but glancing through
my dictionary I estimate that about one third to one half of the entries
have the same romanization as another entry, and the number of clusters of
four or five homonymous entries may be as many as one thousand (as I find
one on almost every page of 1000  pages, sometimes two or three).

It may be that I am overestimating the problem and that the context would
suffice well enough to disambiguate romanized Japanese to make it
acceptable for professional use.  Perhaps a Japanese reader of this article
may care to comment.

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

henry@utzoo.UUCP (Henry Spencer) (08/17/87)

> >Remember that the OED includes an awful lot of words that are obsolete or
> >terminally obscure by anyone's standards.  It is not a dictionary of current
> >English.
> 
> That's part of the point. Would you support an encoding scheme that
> prevented me from using English documents, even those containg
> obselete or obscure words, on my computer?...

Depends.  The current encoding scheme (ASCII) already prevents you from
using English documents beyond a certain age -- the English alphabet has
changed!  Just try typing in a document that uses the thorn (vertical
stroke with a loop on the side) or the long 's' (like an integral sign).
The lack of these symbols is a real nuisance to certain scholars, but
somehow it doesn't interfere with most uses of ASCII.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

dinucci@ogcvax.UUCP (David C. DiNucci) (08/17/87)

<Rain-iitaa, tabenasai>

In article <piring.51> lambert@cwi.nl (Lambert Meertens) writes:
>
>In article <479@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
>) This, of course, makes it even more amazing that they have been so succesful
>) in the world community. It seems likely to me, though, that at some point
>) they're going to have to break down and drop Kanji for professional use.
>
>There seems to be a good reason for not doing this: after romanization,
>words written differently in Kanji may become the same.  Although

A fairly important breakthrough was made in the
area of Japanese word processing some years ago when it was realized
that characters could be translated from a phonetic alphabet to Kanji
after entering the word.  Japanese word processors now allow the
user to enter a word as one or more hiragana (one of the phonetic
alphabets with only about 55 basic characters), then at the touch of a
key, cycle through the Kanji corresponding to that pronunciation,
starting with the most commonly used.  The user stops when the desired
Kanji appears, then continues with the next word. 

I do not know how the final Kanji is actually stored, but it could
conceivably be stored as the sequence of hiragana followed by some
special index telling which "view" of that sequence should be used
when displaying the character.  This would seem to take care of some
of the problems discussed in this group.  It could cause problems if
the "reference" dictionary (conversion tables from kana to Kanji)
changed, but I don't think this is likely to happen often.  A standard
set of reference tables would need to be adopted (if it hasn't already
been) if this were to actually be used as a data interchange format.

A Chinese word processor would be a different story, however, since
I do not believe Chinese has a phonetic alphabet like Japanese's
hiragana and katakana.

More info about a specific Japanese word processor is available to us
English-only readers by reading a review of one for the Mac a year or
so ago (I don't remember exactly which magazine it was in).

Disclaimer:  I know only a little nihongo, but my wife is a native
Japanese, and often uses a Japanese word processor made by Fujitsu.
-- 
Dave DiNucci                    UUCP:  ..ucbvax!tektronix!ogcvax!dinucci
Oregon Graduate Center          CSNET: dinucci@Oregon-Grad.csnet
Beaverton, Oregon

tony@artecon.UUCP (08/18/87)

In article <25736@sun.uucp>, guy%gorodish@Sun.COM (Guy Harris) writes:

>In article Peter Da Silva writes:

> > In Japan programming languages are the least of the problems their written
> > language causes them. An incredible amount of data is never stored anywhere
> > but on the original form, photocopies of said form, or faxed copies of said
> > form. Even with the best tools available it's just too hard to keypunch.
> > 
> > This, of course, makes it even more amazing that they have been so succesful
> > in the world community. It seems likely to me, though, that at some point
> > they're going to have to break down and drop Kanji for professional use.
> 
> I don't know about that.  More and more machines are adding support for Kanji.
> There are a large number of Japan-only (Japan-mostly?  I seem to remember Jun
> Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which
> most of the traffic is in Japanese, represented in Kanji.  (He said they added
> Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a
> Kanji terminal.)  The NEC PC also includes Kanji support; it is often used as a
> Kanji terminal.
> 
> These machines may not be able to handle every single Kanji character, but the
> 90/10 rule may apply.
> 	Guy Harris

Yes, it is true that Kanji is getting more support.  Hewlett-Packard has
a new drafting plotter (HP-7595) which has a Kanji option.  The form of
specification is that when you invoke the Kanji font, you go into a two-byte
mode.  That is, it takes two bytes to specify one Kanji character.  Control
bytes are used as control bytes, but the 94 printing bytes are used in the
Kanji specification.  So, 94 * 94 = 8836 different characters you can use.
This is a good way of doing it since you never know how your OS is going
to muck with control codes or full 8-bit binary data going to I/O devices.
I believe that this is a fairly standard way of doing this for printers.

8836 may not seem like a lot of Kanji (which is known to go to about 50000 in
Japanese), but only 1850 are needed to graduate from high school, and usually
about 3000 are used in college texts.  There are two "JIS" standards set
by the Japanese Ministry of Education.  JIS level 1 is about 3000 characters
(including the basic 1850, KataKana, HiraGana, English alphabet, Cyrillic,
Greek, and special symbols), and JIS level 2 is about 8000 (including the
3000 JIS level 1).  As a rule, one is supposed to try to stick to JIS level 1,
but use JIS level 2 for Proper names and just a few other execptions.

So, in reply to above: 

	1) You may not be able to handle all 50000 Kanji, but JIS level 2 
	   is more than enough, 

	2) It really isn't that difficult to implement because:

		a)  It is a well defined font, accessed easily in two-byte
			sequencing (you don't even need 8-bits, 7 will do)

		b)  You can get already masked ROMS which contain Kanji
		    in a rasterized form for raster printers.

		c)  The Japanese are more than happy to help you implement
		    Kanji in your products.  They will digitize Kanji for
		    whatever reasonable form you need it.

-- Tony

BTW, I am not Japanese...but.."I think I'm turning Japanese, I really think so!"

"Konnichi-wa"
-- 
**************** Insert 'Standard' Disclaimer here:  OOP ACK! *****************
*  Tony Parkhurst -- {hplabs|sdcsvax|ncr-sd|hpfcla|ihnp4}!hp-sdd!artecon!adp  *
*                -OR-      hp-sdd!artecon!adp@nosc.ARPA                       *
*******************************************************************************

john@frog.UUCP (John Woods, Software) (08/18/87)

In article <8409@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
(and many others as well)
>>[English has] over 1,000,000 words.  Chinese is probably about the same.
> 
Many people (including Henry) have pointed out that (A) English is larger
than most languages (having borrowed "one of everything" from everyone), and
(B) Chinese ideographs are not one-per-word, but one-per-concept (hence most
words are two or more ideographs).  So, I went back to the source I first
read about this topic in:  "Multilingual Word Processing", Joseph D. Becker,
Scientific American July 1984.

In this article, he doesn't give an actual count of Chinese ideographs (just
the statement "tens of thousands"), but in the "flexible encoding" he and
other Xerox denizens developed (using alphabet shift-codes), to encode Chinese
you send the "shift superalphabet (for 16 bit characters)", the 8-bit "super-
alphabet number", and then 16-bit character sequences.  "The main
superalphabet, designated 00000000, is all one needs except for very rare
Chinese characters."  A little later in the article is the implication that
about 7000 ideographs are "commonly seen" in Chinese publishing.

So, there we have it:  not as bad as I thought, but still indicating that
8 bits is woefully inadequate.

Also, I seem to have slipped up in my understanding of Kanji:  Kanji is the
set of Chinese ideographs borrowed by the Japanese, of which "about 3500"
are in common use (and the number is declining).  The phonetic letters (which
can spell words in entirety, and are used to indicate grammatical endings for
Kanji roots) are collectively called "kana", and come in two sets:  "hiragana"
and "katakana" (it is probably more complicated than that, but that is about
all the article gives).  There used to be Kanji "typewriters" which scarcely
anyone used (using several hundred keys); now, computerised systems exist in
which one can type phonetic hiragana symbols (or, for those who prefer, the
Romaji phonetics), and press a "lookup key" to have the computer turn the
just-typed word into proper Kanji.

The Bibliography in that Scientific American says the following publications
may be helpful:

_Writing Systems of the World_, Akira Nakanishi.  Charles E. Tuttle, 1980.
"A Historical Study of Typewriters and Typing Methods:  From the Position
of Planning Japanese Parallels", Hisao Yamada in _Journal of Information
Processing_, Vol. 2, No. 4, pp 175-202; February, 1980.

Can we all now consider the statement "7 bits is enough" most sincerely dead?

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu

"The Unicorn is a pony that has been tragically
disfigured by radiation burns." -- Dr. Science

karl@haddock.ISC.COM (Karl Heuer) (08/19/87)

In article <51@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes:
>There seems to be a good reason for [using Kanji]: after romanization,
>words written differently in Kanji may become the same.  [Gives examples of
>Japanese homonymy]

The romanized form is phonetic, right?  I presume that Japanese speakers can
understand each other when conversing by telephone; doesn't this have the same
level of ambiguity?

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

peterg@felix.UUCP (Peter Gruenbeck) (08/20/87)

---------------
Disclaimer: I may not know what I'm talking about. Batteries not included.
---------------

I have difficulty getting used to the idea of a 32 bit byte. What would
happen the the nybble (half a byte - get it?). Would we be biting off
more than we could chew? I would be in favor of leaving a byte as
8 bits and using the term WORD to represent a unit of addressable
memory. That way we limit the confusion of how many bits something
has to a term which is already confusing.

For example:
   Many small computers (6502, 808x, 68000) have a word = 8 bits.

   Some older mainframe systems (IBM 370, Cyber, DEC) have word
   lengths of 32 bits, 12 bits, 60 bits.

   Some specialized engines (Itel 370 compatable) have 80 bit
   words for the microcode intrepreters. Also, some PC ram disks
   may be considered to have 128 byte words since that is what you
   address and then take the rest sequentially.

New machines to handle the complex multi-language problem may
have a 32 bit word if that is what a character takes. Maybe we should
define a new term called a CHAR to define the number of bits required
to represent a character.  

I'm told I'm quite a character myself at times. I suspect you'll need
more than 32 bits to define me (I hope). This is not to say that
this is the final WORD.

mouse@mcgill-vision.UUCP (08/23/87)

In article <6252@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
[stuff about assuming sizeof(char)==1]
[>> uses an example of a Japanese programmer having problems]

> If *I* were a Japanese programmer, I think I'd resent being treated
> as a second-class citizen by the programming language.

If you insist on taking a language designed in the English world for
the English world and using it in Japan, it wouldn't surprise me a bit
if it made a poor showing.

Why do we all assume that C must be twisted and bent to fit the
international environment?  Are there *no* computer languages designed
by Japanese for a Japanese environment (or Chinese or Arabic or Hindi
or etc)?  Perhaps it is time for one.

(Not that I have anything against extending C to such an environment; I
like C too.  But it's beginning to look as though the result of such
attempts "ain't C", to coin a phrase.)

					der Mouse

				(mouse@mcgill-vision.uucp)

guy@gorodish.UUCP (08/31/87)

> Why do we all assume that C must be twisted and bent to fit the
> international environment?

Gee, *I* don't assume that.  Making the language support comments in foreign
languages doesn't seem too hard; with some more work (and cooperating linkers -
I suspect the UNIX linker can handle 8-bit characters in symbol names) you
could even have it support object names in foreign languages (but then again, a
hell of a lot of object names are in a foreign language NOW; quick, in what
natural language is "strpbrk" a word?).  It might even be possible to support
*keywords* in foreign languages - I'm told there are compilers for some
languages that do this - but the trouble with C is that a lot of "keywords" are
routine names, and it'd be kind of a pain to put N different translations of
"exit" into the C library.

As for making programs written in the language support foreign languages,
there are no massive changes to C required here, either.  Most of the support
can be done in library routines.  It is not *required* that "char" be increased
in size beyond one byte to support other languages, nor would it be *required*
that "strcmp" understand the collation rules for all languages.

> Are there *no* computer languages designed by Japanese for a Japanese
> environment (or Chinese or Arabic or Hindi or etc)?

Not that I know of.  There may be some, but I suspect they are VERY minor
languages.

> Perhaps it is time for one.

The trouble is that "one" wouldn't be enough!  You'd need languages for
Japanese *and* Chinese *and* Korean *and* Arabic *and* Hebrew *and* Hindi
*and*... if the languages were really designed for *that particular* language's
environment.

If this were the *only* way to write programs that can handle those languages,
you would have to write the same program N times over for all those
environments.  If you wanted your system to be sold in all those different
environments, you would either have to supply compilers for *all* those
languages or make the compiler suite be something selected on a per-country
basis.  I can't see how this would do anything other than impose costs that far
outweigh the putative benefits of such a scheme.

> (Not that I have anything against extending C to such an environment; I
> like C too.  But it's beginning to look as though the result of such
> attempts "ain't C", to coin a phrase.)

No, the "ain't C" phrase is properly applied only when something contradicts
some C standard, or perhaps when it grossly violates assumptions made by
reasonable C programmers.  There may be some problems with changing the mapping
between "char"s and bytes (problems caused by C's unfortunate conflation of the
notion of "byte", "very small integral type", and "character" into the type
"char" - they should have been separate types), but I see no contradiction or
gross violation in the internationalization proposals I've seen.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.UUCP (08/31/87)

In article <866@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes:
>Why do we all assume that C must be twisted and bent to fit the
>international environment?

First, I am not proposing that C be "bent and twisted".  I think it is
possible to cleanly address the needs of the international programming
community.  (I think my proposal for this was much cleaner than the one
that is likely to be adopted, but at least you can ignore the latter if
you're sure that you don't need to worry about such matters.)

Second, if you rephrase the question "Why do we have to consider the
international community?", the answer is:  Because ISO or JIS will come
up with something for THEIR version of the C standard if we don't come
up with it for OURS.  Having different standards, particularly if one
of them is likely to be unappealing to us, is a situation to be avoided.

You should also observe that most large companies you think of as U.S.-
based actually have a significant percentage of their market overseas.
They certainly feel the need for international standards.

peter@sugar.UUCP (Peter da Silva) (09/02/87)

> languages that do this - but the trouble with C is that a lot of "keywords" are
> routine names, and it'd be kind of a pain to put N different translations of
> "exit" into the C library.

Simple...

#include <japan.h>
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter
--                  U   <--- not a copyrighted cartoon :->