[comp.lang.c] What is a byte

dupuy@amsterdam.columbia.edu (Alexander Dupuy) (01/01/70)

In article <20131@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP
 (David Phillip Oster) writes:

>  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
>meaning that the next 16-bit word had the 65534 next most common, and
>so on.  
>
>That way, the average length of a run of chinese text is
>likely to be about 10 bits per ideogram, and any single ideogram would
>have canonical 64 bit representation: its bit pattern in the left of
>the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
>patterns and padded out with filler nybbles.

This underscores the central tradeoff in a code for Chinese or Chinese/Japanese
- compact respresentation to save disk space versus consistent (same character
size) representation for processing.

But there is really no reason we have to trade these off against each other.
We can just define a consistent representation for processing (24 or 32 bits
will suffice - I don't think we need 64) and use a compresseion algorithm
(Lempel-Ziv, Huffman, whatever, as long as it's standard, and not too expensive
to decode/encode) when we aren't manipulating individual characters.  Some
languages even have rudimentary forms of support for this (packed array of char
vs. array of char in Pascal).

It's clear that operating system support has to be much better than it is now
for there to be any hope of writing programs which are portable between
Latin-only, Chinese/Japanese-only, and Chinese/Japanese/Latin environments.
I don't see the programming language constructs as being the major problem.

@alex
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia, and i

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/02/87)

In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>I haven't given this issue a whole lot of thought, but it seems to me that
>"short char" should be the smallest object which is addressible in C, and
>should define the units of sizeof; "long char" should denote whatever is
>necessary to represent the native character set.  On a bit-addressible machine
>in an Arabic- or Japanese-language environment, one might have "short char" be
>1 bit, "char" be 8, and "long char" be 16.

That is a bit more generous than my proposal, but it follows the same line
of thought.  I would prefer that a (char) be capable of holding an entire
basic textual unit, since many applications are already based on that
assumption.  A separate (long char) would necessitate a whole extra
collection of str*()-like library routines, which the portable programmer
would have to be careful to use instead of the str*() functions; might as
well simply make (char) be the right thing and not introduce a new type.

Using all three possible char lengths would not pose a serious problem
if the str*() functions were changed to rquire (long char) and if
implementations made (char) and (long char) the same size, at least for
now.  There aren't many bit-addressable architectures at present (more's
the pity), so most international implementations could make (short char)
8 bits and (char) or (long char) 16 bits.

>If this is to be phased in without breaking a lot of programs, X3J11 should
>immediately bless all three names, but insist that they all be the same size.
>(Which restriction should be deprecated, to be removed in the next standard.)

I don't think it's within the realm of practical politics to say that the
problem will not be solved until the next issue of the standard.  It would
be better if it can be solved now without too much breakage of existing
non-internationalized code.  (Internationalized code is already vendor-
specific, due to lack of agreement on a universal approach.  Any good
solution will require at least some vendors to eventually change.)

On a related note, I see that the /usr/group people are trying to change
regular expressions from character-code based to language alphabet based
(as though there were always a universal collating order for a given
language!).  This is a most unfortunate direction, since it ruins simple,
well-behaved algorithms written in terms of (sufficiently wide) (char)s.
I wish they would not plow ahead with this until the multi-byte character
issue is resolved, since that may well affect the practical possibilities.

karl@haddock.UUCP (08/07/87)

[I probably should have included comp.std.internat earlier, but I didn't think
of it.  c.s.internat readers can pick up context from comp.lang.c if desired.]

In article <6216@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>>[For example,] on a bit-addressible machine in an Arabic- or Japanese-
>>language environment, one might have "short char" be 1 bit, "char" be 8,
>>and "long char" be 16.
>
>... I would prefer that a (char) be capable of holding an entire basic
>textual unit, since many applications are already based on that assumption.
>...might as well simply make (char) be the right thing and not introduce a
>new type. ... most international implementations could make (short char)
>8 bits and (char) or (long char) 16 bits.

>>If this is to be phased in without breaking a lot of programs, X3J11 should
>>immediately bless all three names, but insist that they all be the same size.
>>(Which restriction should be deprecated, to be removed in the next standard.)
>
>I don't think it's within the realm of practical politics to say that the
>problem will not be solved until the next issue of the standard.

The problem with your proposal is that it would break existing code that
assumes sizeof(char) == 1.  If a user wants to write a portable program that
refers to objects smaller than 16 bits%, he can't use (short char) because
existing compilers won't accept it, and he can't use (char) because new ones
might make it too big.  That's why I suggested the temporary restriction.

Also, in the world of international text processing I don't think we have all
the questions yet, let alone the answers.  I figure X3J11 should take care of
one thing we do know (that "char" as commonly implemented nowdays won't
suffice) and pave the way for a real fix later.

(Hmm.  If I were a Japanese user, using a VAX, and I was told that, because
Japanese characters require more than 8 bits, and because (char) is the
obvious datatype for characters, and because C requires that nothing be
smaller than (char), my compiler couldn't address individual bytes, then I
think I'd start looking for a new vendor or a new programming language.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
%Assuming the implementation allows such an object to exist at all.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/08/87)

In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>The problem with your proposal is that it would break existing code that
>assumes sizeof(char) == 1.

Of course, such code is already broken in the international environment.
In fact, in an 8-bit (char) implementation, such code would continue to
work.  In other words, something has to give for internationalized
implementations; the question is what?  With my proposal,
sizeof(short char)==1, so there could be a transition period during
which implementations would make sizeof(char)==sizeof(short char) until
application source has been cleaned up.  (Some developers have been
careful to not rely on sizeof(char)==1 all along, anticipating the day
when this assumption may have to be changed.)

>If I were a Japanese user, using a VAX, and I was told that, because
>Japanese characters require more than 8 bits, and because (char) is the
>obvious datatype for characters, and because C requires that nothing be
>smaller than (char), my compiler couldn't address individual bytes, then I
>think I'd start looking for a new vendor or a new programming language.

That's why something has to be done.

As I reported recently, X3J11 has agreed in principle with Bill Plauger's
proposal for a typedef letter_t and a few conversion-oriented functions,
but NO library for letter_t analogous to the standard str*() routines.
This necessitates source-level kludgery for any application for which
portability into a multi-byte character environment is a possibility.
I don't like that very much, but since I'm not expecting to sell software
products to the Japanese I'll go along with it if the vendors think it
will fly.  This seems to be another case of not wanting to do things
technically correctly if that would require a radical change to previous
practice.  That's a legitimate concern, of course.

If *I* were a Japanese programmer, I think I'd resent being treated as
a second-class citizen by the programming language.

kent@xanth.UUCP (Kent Paul Dolan) (08/09/87)

While we're developing nightmares about the number of bits the Japanese
need in a char, remember for text processing that for 1 billion of the
earth's residents, the smallest unit of text processing is the ideograph,
and that even 21 bits is probably barely sufficient to represent the number
of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
want 24 bit ones!  ;-)

(Of course, one _could_ always write off the market, but a billion customers
is rather a lot at which to turn up ones nose!)

Kent, the man from xanth.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/10/87)

In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
>While we're developing nightmares about the number of bits the Japanese
>need in a char, remember for text processing that for 1 billion of the
>earth's residents, the smallest unit of text processing is the ideograph ...

I'm no expert, but I seem to recall that Chinese ideographs (which
as I understand it come in several varieties) are pretty much made
from a (relatively) small set of basic strokes placed in different
positions.  I think there are even Chinese typewriters, or at least
type compositors.  If this is correct, then one possibility would
be to devise a suitable (acceptable to technical Chinese)
representation for ideographs in terms of basic strokes and
placement instructions, which could be treated as text units.

After all, the letter "w" doesn't mean much when taken out of
English context; we too need the whole word-symbol, not just a
letter-component to express a meaning.  It's just that our
alphabet is simpler and is combined in 1 dimension instead of 2.

lambert@cwi.nl (Lambert Meertens) (08/10/87)

In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
) 
) While we're developing nightmares about the number of bits the Japanese
) need in a char, remember for text processing that for 1 billion of the
) earth's residents, the smallest unit of text processing is the ideograph,
) and that even 21 bits is probably barely sufficient to represent the number
) of written words in Chinese.

Are you suggesting that there are more than 2**20 = 1048576 different
written words in Chinese?  At typically 60 entries on a page, their
dictionaries must have then some 17500 pages or more.  I think that 16 bits
are enough to accommodate all Chinese characters, and certainly ample for
the about 5000 that are in actual use.

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

dougg@vice.TEK.COM (Doug Grant) (08/10/87)

In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes:
> 
> While we're developing nightmares about the number of bits the Japanese
> need in a char, remember for text processing that for 1 billion of the
> earth's residents, the smallest unit of text processing is the ideograph,
> and that even 21 bits is probably barely sufficient to represent the number
> of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
> want 24 bit ones!  ;-)

Great idea, Kent!  But with so many characters and attributes in common
usage, even 32 bits isn't enough for everyone to communicate exactly what
they mean.

I would like to propose an ASCII-compatible 64-bit character set (really!).

Here's my suggestion for how to divvy up the bits:

	24 bits - character
	 8 bits - font
	 8 bits - size
	 8 bits - color
	 4 bits - intensity (boldness)
	 2 bits - blink rate (00 = don't blink)
	 1 bit  - normal/reverse
	 8 bits - sync
	 1 bit  - left over - any suggestions?

Here's how it would be ASCII compatible:

	The eighth bit of the first byte received would be used as an
	ASCII/extended character set flag.  If it is zero, the character
	is normal 7-bit ascii.  If it is 1, the next seven bytes are used
	to complete the eight-byte character.  Only the eighth bit of the first
	byte is set to one - the eighth bit of the remaining seven bytes
	is set to zero, thus assuring that when "Extended Character Set"
	characters come in, their bytes can be kept in sync.

	For those who say "but too much bandwidth would be used for
	64-bit characters!" I say hang on - fiber optic communications
	are coming!

	I'd sure like to see one standard character set that can
	accomodate the whole world!

Doug Grant
dougg @vice.TEK.COM

disclaimer:  These opinions are my own, but my employer is welcome
	     to adopt them.

guy%gorodish@Sun.COM (Guy Harris) (08/11/87)

> Are you suggesting that there are more than 2**20 = 1048576 different
> written words in Chinese?  At typically 60 entries on a page, their
> dictionaries must have then some 17500 pages or more.  I think that 16 bits
> are enough to accommodate all Chinese characters, and certainly ample for
> the about 5000 that are in actual use.

According to a document called "USMARC Character Set: Chinese Japanese Korean",
from the Library of Congress, Washington, a 24-bit character was developed to
"represent and store in machine-readable form all the Chinese, Japanese, and
Korean characters used with the USMARC format."

It says that the character sets incorporated into this character set (the
RLIN - Research Libraries Information Network - East Asian Character Code, or
REACC) are:

	+ *Symbol and Character Tables of Chinese Character Code for
	  Information Interchnage*, vol. 1 and 2 (2nd ed., Nov. 1982) and
	  *Variant Forms of Chinese Character Code for Information Interchange*
	  (2nd ed., Dec. 1982) (CCCII)  Editor:  The Chinese Character Analysis
	  Group.  Total:  33,000 characters.

	  REACC contains all of the 4,807 "most ocmmon" Chinese characters in
	  volume 1 (as listed by the Ministry of Education in Taiwan) and about
	  5,000 of the 17,000 characters taken from a compilation of data from
	  different computer centers (mostly personal names) in volume 2.
	  REACC also contains about 3,000 of the approximately 11,000
	  characters in the CCCII *Variant Forms*, which lists PRC simplified
	  forms and other variants, some of which are also used in modern
	  Japanese.

	+ *Code of Chinese Graphic Character Set for Information Interchange
	  Primary Set:  The People's Republic of China National Standard* (GB
	  2312-80) (1st ed., 1981).  Total:  6,763 characters.  All the
	  characters in this set are in REACC.

	+ *Code of the Japanese Graphic Character Set for Information
	  Interchange:  Japanese Industrial Standard* (JIS C 6226)  (1983).
	  Total:  6,349 characters.  All the characters in this set are in
	  REACC.

	+ *Korean Information Processing System* (KIPS).  Total: 2,392 Chinese
	  characters and 2,058 Korean Hangul.  Chinese characters in this set
	  are in REACC; all hangul are also incoroporated in REACC, as well as
	  some hangul *not* in KIPS.

One characteristic of this character set is that it tries to permit a simple
rule to get the codes for various variant forms of characters from the code for
the traditional form of the character.

So, while you can probably stuff the major Chinese characters into 16 bits (the
CCCII, including variant characters, contains 33,000 characters), you may not
want to.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

howard@COS.COM (Howard C. Berkowitz) (08/11/87)

In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes:
> While we're developing nightmares about the number of bits the Japanese
> need in a char, remember for text processing that for 1 billion of the
> earth's residents, the smallest unit of text processing is the ideograph,
> and that even 21 bits is probably barely sufficient to represent the number
> of written words in Chinese.  Anyone for 32 bit characters?  I sure don't
> want 24 bit ones!  ;-)


I worked at the Library of Congress in the late 70's, and was 
responsible for the hardware and systems software aspects of
experimental terminals for the 140 or so fonts (700 or so
languages and dialects) in which the Library has materials.

Chinese, of course, was the nightmare.  Several authorities
said we should assume about 50K distinct ideographs, but the
language scholars in the Orientalia Division said 100K was
a more correct number.  When the outside experts challenged
this, saying that the additional 50K appear in only esoteric
documents used by very specialized scholars, Orientalia responded
with "who do you think use the Orientalia collection at the
Library of Congress?"

It developed, however, that the Chinese ideograph problem could
be simplified.  While there are a very large number of distinct
ideographs, these ideographs are composed of a much smaller
(<100) number of superimposed radicals.  Chinese dictionaries
use radicals as a means of lexical ordering.  

While I am out of touch with current research, it was felt at
the time that Chinese (and full Japanese Kanji) could be approached
by using a mixture of codes for common ideographs and escapes
to strings of radicals (to be superimposed), or purely by
radical strings.

When discussing the Oriental language problem, do distinguish
the linguistic problem of ideograph uniqueness from the graphic
problem of ideograph display.  This differentiation is similar
to the difference between a code and a cipher.

-- 
-- howard(Howard C. Berkowitz) @cos.com
 {seismo!sundc, hadron, hqda-ai}!cos!howard
(703) 883-2812 [ofc] (703) 998-5017 [home]
DISCLAIMER:  I explicitly identify COS official positions.

rlk@think.COM (Robert Krawitz) (08/11/87)

In article <1804@vice.TEK.COM> dougg@vice.TEK.COM (Doug Grant) writes:

]	24 bits - character
]	 8 bits - font
]	 8 bits - size
]	 8 bits - color
]	 4 bits - intensity (boldness)
]	 2 bits - blink rate (00 = don't blink)
]	 1 bit  - normal/reverse
]	 8 bits - sync
]	 1 bit  - left over - any suggestions?

8 bits really isn't enough for color, and it may not be enough for
font.  There's plenty of screens out there with 24 bit planes.  You
don't really want to lock out these folks, do you?

]	The eighth bit of the first byte received would be used as an
]	ASCII/extended character set flag.  If it is zero, the character
]	is normal 7-bit ascii.  If it is 1, the next seven bytes are used
]	to complete the eight-byte character.  Only the eighth bit of the first
]	byte is set to one - the eighth bit of the remaining seven bytes
]	is set to zero, thus assuring that when "Extended Character Set"
]	characters come in, their bytes can be kept in sync.

This is worse!  Now you have only 7 bits available.  And why do you
need the sync byte?  The hardware should take care of that!

]	For those who say "but too much bandwidth would be used for
]	64-bit characters!" I say hang on - fiber optic communications
]	are coming!

Well, I suppose memory is cheap these days.  Still, we're only gaining
a factor of 4 every 3 years or so in memory capacity, and even less in
disk storage.

]	I'd sure like to see one standard character set that can
]	accomodate the whole world!

What next, GASCII (Galactic Standard Code for Information
Interchange)?  Can't afford to be parochial about such things.
Robert^Z

john@frog.UUCP (John Woods, Software) (08/12/87)

In article <34@piring.cwi.nl>, lambert@cwi.nl (Lambert Meertens) writes:
I>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
N>)While we're developing nightmares about the number of bits the Japanese
C>)need in a char, remember for text processing that for 1 billion of the
L>)earth's residents, the smallest unit of text processing is the ideograph,
U>)and that even 21 bits is probably barely sufficient to represent the number
D>)of written words in Chinese.
E> 
D>Are you suggesting that there are more than 2**20 = 1048576 different
 >written words in Chinese?  At typically 60 entries on a page, their
T>dictionaries must have then some 17500 pages or more.  I think that 16 bits
E>are enough to accommodate all Chinese characters, and certainly ample for
X>the about 5000 that are in actual use.
T> 
In the English dictionary that the documentation department here uses, there
are 320,000 words.  I am told that the Oxford English Dictionary has
approaching 1,000,000 words, and that the the total English language has just
over 1,000,000 words.  Chinese is probably about the same.

I can see asking the Chinese to adopt some limited alphabet scheme (such as
Romaji used by the Japanese (if I remember correctly, a 3-Roman-character
spelling for each syllable of Kanji), or perhaps Roman phonetic spelling),
but telling them that some microscopic fraction of their language has to be
selected for interaction with computers is just flatly bogus.

(a side note to provoke more chuckles than thought:  are ideographs the CISCs
of language?  Perhaps that makes Morse code the RISC...)

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

"The Unicorn is a pony that has been tragically
disfigured by radiation burns." -- Dr. Science

kent@xanth.UUCP (Kent Paul Dolan) (08/12/87)

In article <34@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes:
>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
>) 
>) While we're developing nightmares about the number of bits the Japanese
>) need in a char, remember for text processing that for 1 billion of the
>) earth's residents, the smallest unit of text processing is the ideograph,
>) and that even 21 bits is probably barely sufficient to represent the number
>) of written words in Chinese.
>
>Are you suggesting that there are more than 2**20 = 1048576 different
>written words in Chinese?  At typically 60 entries on a page, their
>dictionaries must have then some 17500 pages or more.  I think that 16 bits
>are enough to accommodate all Chinese characters, and certainly ample for
>the about 5000 that are in actual use.
>--
>Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

Surely not!  My own English active/passive vocabularies are 100,000/250,000
words.  The Oxford Dictionary of the English Language fills a five foot book
shelf and contains well over a million entries.  The Chinese have had a LONG
time to work with a written language; I would expect their numbers to exceed
these.  There seems little chance that 65536 ideographs would suffice.

(Comments from members of the Chinese community who know the answers to such
questions would save a lot of fruitless debate here!  ;-)

Kent, the man from xanth.

adam@misoft.UUCP (08/12/87)

|In article <34@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes:
|>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
|>) 
|>) and that even 21 bits is probably barely sufficient to represent the number
|>) of written words in Chinese.
|>
|>I think that 16 bits
|>are enough to accommodate all Chinese characters, and certainly ample for
|>the about 5000 that are in actual use.
|
|There seems little chance that 65536 ideographs would suffice.

A Chinese word is made up of one or more characters. Each character has a value
of one phoneme. There are about 5000 phonemes in common use, and there is an
agreed 16-bit code for the most common phonemes. The codes are then combined
to make words.

I've been on the net for a while now and I didn't know that comp.lang.c was
short for comp.lang.chinese (though looking at that convoluted employment test
had me wondering......)

       -Adam.

/* If at first it don't compile, kludge, kludge again.*/

henry@utzoo.UUCP (Henry Spencer) (08/12/87)

>	 8 bits - color

Surely you jest.  Any color-graphics type will tell you that you need at
least 24 bits, maybe 36 or 48. :-)

More seriously, your all-inclusive scheme falls down like this in several
areas.  8 bits may not be enough for a font size when things like fractional
sizes come in (yes, there are fractional sizes).  8 bits certainly is not
enough for a font in demanding applications -- ever looked at a font catalog?
Finally, it's not common to change things like color and font from one
character to the next (unless one is a Mac user intoxicated with the joy of
font changing, sigh...), so a lot of those bits are being wasted.  Better
to use some sort of font-switch (etc.) sequences, simultaneously giving
more compact coding and more flexibility.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

peter@sugar.UUCP (Peter da Silva) (08/13/87)

In Japan programming languages are the least of the problems their written
language causes them. An incredible amount of data is never stored anywhere
but on the original form, photocopies of said form, or faxed copies of said
form. Even with the best tools available it's just too hard to keypunch.

This, of course, makes it even more amazing that they have been so succesful
in the world community. It seems likely to me, though, that at some point
they're going to have to break down and drop Kanji for professional use.
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter (I said, NO PHOTOS!)

pom@under..ARPA (Peter O. Mikes) (08/13/87)

To: henry@utzoo.UUCP
Subject: Re: What is a byte
Newsgroups: comp.lang.c,comp.std.internat
In-Reply-To: <8404@utzoo.UUCP>
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP>
Organization: S-1 Project, LLNL
 
In article <8404@utzoo.UUCP> you  ( Henry Spencer ) write:
>
>font changing, sigh...), so a lot of those bits are being wasted.  Better
>to use some sort of font-switch (etc.) sequences, simultaneously giving
>more compact coding and more flexibility.
>--    ^ 
  This | is An IMPORTANT IDEA  :-| . The so called 'Daisy Sort' - a sequence 
  of  characters on the printwheel is optimized - using the frequency 
  if bigrams in English language - in such a manner that characters which 
  are frequent neighbours are near to each other ( that makes for a faster 
  printer). NOW, if I recall correctly, about 90%  of movements are within  
  ten-spokes-distance  and (another statistical fact) the special symbols 
  and capitals are so rare that their spacing is irrelevant ( except that 
  digits tend to follow digits - so you place all digits next to each other)
  | |
   v
    => It is very wasteful to store English text using ASCII. 
   
  ergo:  
        There are really just few 'rational alternatives' for storing text: 
         
    1)  4 bits: sign + 3bit distance in the sort (of imaginary standard  
                                                   printwheel)   
          with one code ( 0+000) being reserved to mean: The following                    4-bit word has another meaning (namely : e.g long jump) or jump to 
          another subset of the character set ( such as - switch the cases
          UPPER/lower,  digits+aritmetics signs, carriage motion controls... 
           
   2) 6 bits: 1bit sign +1bit ( distance/font switch) + 4bit (either distance   
        to next character within given sort or one of 16 other subfont sorts) 
  
   3) ...


	Naturally, languages such as c, would have a different statistics
  and should probably merit a special sort (which would be marked by a six?
  bit code on the beginning of the file/document (since unix command 'file'
  would not work {it does not work too well anyway} ) specifying the 
 (type ot the file) = (apropriate daisy sort), such as: english text,
  numerical data, Post-Script file, c-source,...

    => It is ALSO very wastefull to store numerical data sets using ASCII. 

     Of course, in the numerical_data character-subset  we need characters
  for over-flow and undefined (NaN, Infinity, missing data point, end-of-file,
  end-of-row = end_of_vector, another-data-set ..
  and characters for decimal point/comma  E and triplet-separator so that
  I can write 
 
   6_234,567 = 6_234.567  = 0.623_567_7E4 to mean

  six thousands and  234.567  ( The decimal comma  (European way ) is preferred
  by the ISO SI standard, while decimal point (US way) is tolerated) 
  and the  (current ISO) triplet separator (namely blank i.e. 1 000  for
  one thousand ) MUST be changed ( since blank is used in parsing ). 

 Perhaps  1_000=E3 (and 10 = E3.101  ?) or 1:000 = 1E3 
                                    (with / only, being used for division?)
   
 Actually, for speed of parsing it would be highly preferable to AVOID
 alphabetic separators (. and ,) and letters to express numbers. 

  Perhaps we can write   3:456::4    to express
     three thousands and  four hundred fifty six and  four tenth 

    and perhaps  1:+3  = 1:000 ( 1E3) and 5:-3 for  .005  ( 5e-3) ?

In any case, we should be able to express all numbers using sixteen digit-type
    characters:  + -   0..9, ( decimal sign ) (exponent sign) (thats 13 or 14)
    and then perhaps  ether | or { } for c-style sets, 
    and ( one triplet separator) ( e.g. : or_     ( not blank )
    We then can represent   Infinity as ::: or +++ and NaN as +_+  etc

  Anyway, I just wanted to say, that Henry's  pertinent reminder that 
  character sets and grouping of characters into sets (or sub-fonts)
  affects compactness of information storage really points a way to an
  objective measure of suitability of different coding  methods for different
  uses - and that several categories  of use , namely 

1) english text or just any plain text  (i.e. prose),   (4 or 6 bits)
   numerical data sets ( i.e. number or point sets)     (4 bits)
3)  c or just 'any programing language' 
4)  carriage motions ( tabs, form feeds, cursor addressing ,..??
    modifiers ( highlight, underline, typography...)
?)  ...?...
 

    are frequent enough and universal enough  to merit their own 
    character families or subfonts, binary representations
    and an international standard.
 
         
     
   

franka@mmintl.UUCP (Frank Adams) (08/14/87)

In article <1549@frog.UUCP> john@frog.UUCP (John Woods, Software) writes:
>(a side note to provoke more chuckles than thought:  are ideographs the CISCs
>of language?  Perhaps that makes Morse code the RISC...)

No, Morse code is the PDP-8 of language -- an extremely sparse
implementation imposed by severe technical limitations.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

devine@vianet.UUCP (Bob Devine) (08/15/87)

>   The so called 'Daisy Sort' [...]    points a way to an
>   objective measure of suitability of different coding  methods for different
>   uses - and that several categories  of use [text, programming code]
>   are frequent enough and universal enough  to merit their own 
>   character families or subfonts, binary representations
>   and an international standard.

  Unfortunately, it would be very difficult to write any general
text manipulating programs.  With ASCII encoding it is very easy
to yank sections out of random files and assemble a consistent
file.  The problems of dealing with a mixed-mode file will quickly
eliminate advantages gained in the single-mode case.  Likewise
consider how a program like 'grep' would operate; it would need
a switch to handle different reprensentations of the same string.

oster@dewey.soe.berkeley.edu (David Phillip Oster) (08/15/87)

In article <8409@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Remember that the OED includes an awful lot of words that are obsolete or
>terminally obscure by anyone's standards.  It is not a dictionary of current
>English.

That's part of the point. Would you support an encoding scheme that
prevented me from using English documents, even those containg
obselete or obscure words, on my computer? Well if we are going to
standardize on an encoding for Chinese, it should be able to cover ALL
of Chinese.  There is no reason why we couldn't use a huffman encoding
scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
pattern is a filler, and the 16th pattern means that the next byte
encodes the 254 next most common ideograms, the 255 bit pattern
meaning that the next 16-bit word had the 65534 next most common, and
so on.  

That way, the average length of a run of chinese text is
likely to be about 10 bits per ideogram, and any single ideogram would
have canonical 64 bit representation: its bit pattern in the left of
the 64 bits, including any nybble-shift, byte-shift, or word-shift bit
patterns and padded out with filler nybbles.

 
Now, all we have to do is pick an ideogram frequency standard.  Say,
this idea would also work for English. Assuming that the average
English word takes 6*8 bits (average length of 5 + terminating space
* 8 bit ascii) you could cut the disk space required for computer
storage by a factor of close to 5 by using this encoding scheme. Too
bad that you'd have a mammoth word list in main memory to unpack it
speedily. Might be a nice way to increase the effective bandwidth of
all those modems pushing UseNet around though.
--- David Phillip Oster            --My Good News: "I'm a perfectionist."
Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
Uucp: {seismo,decvax,...}!ucbvax!oster%dewey.soe.berkeleye yoe

guy@gorodish.UUCP (08/16/87)

> In Japan programming languages are the least of the problems their written
> language causes them. An incredible amount of data is never stored anywhere
> but on the original form, photocopies of said form, or faxed copies of said
> form. Even with the best tools available it's just too hard to keypunch.
> 
> This, of course, makes it even more amazing that they have been so succesful
> in the world community. It seems likely to me, though, that at some point
> they're going to have to break down and drop Kanji for professional use.

I don't know about that.  More and more machines are adding support for Kanji.
There are a large number of Japan-only (Japan-mostly?  I seem to remember Jun
Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which
most of the traffic is in Japanese, represented in Kanji.  (He said they added
Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a
Kanji terminal.)  The NEC PC also includes Kanji support; it is often used as a
Kanji terminal.

These machines may not be able to handle every single Kanji character, but the
90/10 rule may apply.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

henry@utzoo.UUCP (Henry Spencer) (08/17/87)

> >Remember that the OED includes an awful lot of words that are obsolete or
> >terminally obscure by anyone's standards.  It is not a dictionary of current
> >English.
> 
> That's part of the point. Would you support an encoding scheme that
> prevented me from using English documents, even those containg
> obselete or obscure words, on my computer?...

Depends.  The current encoding scheme (ASCII) already prevents you from
using English documents beyond a certain age -- the English alphabet has
changed!  Just try typing in a document that uses the thorn (vertical
stroke with a loop on the side) or the long 's' (like an integral sign).
The lack of these symbols is a real nuisance to certain scholars, but
somehow it doesn't interfere with most uses of ASCII.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

tony@artecon.UUCP (08/18/87)

In article <25736@sun.uucp>, guy%gorodish@Sun.COM (Guy Harris) writes:

>In article Peter Da Silva writes:

> > In Japan programming languages are the least of the problems their written
> > language causes them. An incredible amount of data is never stored anywhere
> > but on the original form, photocopies of said form, or faxed copies of said
> > form. Even with the best tools available it's just too hard to keypunch.
> > 
> > This, of course, makes it even more amazing that they have been so succesful
> > in the world community. It seems likely to me, though, that at some point
> > they're going to have to break down and drop Kanji for professional use.
> 
> I don't know about that.  More and more machines are adding support for Kanji.
> There are a large number of Japan-only (Japan-mostly?  I seem to remember Jun
> Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which
> most of the traffic is in Japanese, represented in Kanji.  (He said they added
> Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a
> Kanji terminal.)  The NEC PC also includes Kanji support; it is often used as a
> Kanji terminal.
> 
> These machines may not be able to handle every single Kanji character, but the
> 90/10 rule may apply.
> 	Guy Harris

Yes, it is true that Kanji is getting more support.  Hewlett-Packard has
a new drafting plotter (HP-7595) which has a Kanji option.  The form of
specification is that when you invoke the Kanji font, you go into a two-byte
mode.  That is, it takes two bytes to specify one Kanji character.  Control
bytes are used as control bytes, but the 94 printing bytes are used in the
Kanji specification.  So, 94 * 94 = 8836 different characters you can use.
This is a good way of doing it since you never know how your OS is going
to muck with control codes or full 8-bit binary data going to I/O devices.
I believe that this is a fairly standard way of doing this for printers.

8836 may not seem like a lot of Kanji (which is known to go to about 50000 in
Japanese), but only 1850 are needed to graduate from high school, and usually
about 3000 are used in college texts.  There are two "JIS" standards set
by the Japanese Ministry of Education.  JIS level 1 is about 3000 characters
(including the basic 1850, KataKana, HiraGana, English alphabet, Cyrillic,
Greek, and special symbols), and JIS level 2 is about 8000 (including the
3000 JIS level 1).  As a rule, one is supposed to try to stick to JIS level 1,
but use JIS level 2 for Proper names and just a few other execptions.

So, in reply to above: 

	1) You may not be able to handle all 50000 Kanji, but JIS level 2 
	   is more than enough, 

	2) It really isn't that difficult to implement because:

		a)  It is a well defined font, accessed easily in two-byte
			sequencing (you don't even need 8-bits, 7 will do)

		b)  You can get already masked ROMS which contain Kanji
		    in a rasterized form for raster printers.

		c)  The Japanese are more than happy to help you implement
		    Kanji in your products.  They will digitize Kanji for
		    whatever reasonable form you need it.

-- Tony

BTW, I am not Japanese...but.."I think I'm turning Japanese, I really think so!"

"Konnichi-wa"
-- 
**************** Insert 'Standard' Disclaimer here:  OOP ACK! *****************
*  Tony Parkhurst -- {hplabs|sdcsvax|ncr-sd|hpfcla|ihnp4}!hp-sdd!artecon!adp  *
*                -OR-      hp-sdd!artecon!adp@nosc.ARPA                       *
*******************************************************************************

john@frog.UUCP (John Woods, Software) (08/18/87)

In article <8409@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
(and many others as well)
>>[English has] over 1,000,000 words.  Chinese is probably about the same.
> 
Many people (including Henry) have pointed out that (A) English is larger
than most languages (having borrowed "one of everything" from everyone), and
(B) Chinese ideographs are not one-per-word, but one-per-concept (hence most
words are two or more ideographs).  So, I went back to the source I first
read about this topic in:  "Multilingual Word Processing", Joseph D. Becker,
Scientific American July 1984.

In this article, he doesn't give an actual count of Chinese ideographs (just
the statement "tens of thousands"), but in the "flexible encoding" he and
other Xerox denizens developed (using alphabet shift-codes), to encode Chinese
you send the "shift superalphabet (for 16 bit characters)", the 8-bit "super-
alphabet number", and then 16-bit character sequences.  "The main
superalphabet, designated 00000000, is all one needs except for very rare
Chinese characters."  A little later in the article is the implication that
about 7000 ideographs are "commonly seen" in Chinese publishing.

So, there we have it:  not as bad as I thought, but still indicating that
8 bits is woefully inadequate.

Also, I seem to have slipped up in my understanding of Kanji:  Kanji is the
set of Chinese ideographs borrowed by the Japanese, of which "about 3500"
are in common use (and the number is declining).  The phonetic letters (which
can spell words in entirety, and are used to indicate grammatical endings for
Kanji roots) are collectively called "kana", and come in two sets:  "hiragana"
and "katakana" (it is probably more complicated than that, but that is about
all the article gives).  There used to be Kanji "typewriters" which scarcely
anyone used (using several hundred keys); now, computerised systems exist in
which one can type phonetic hiragana symbols (or, for those who prefer, the
Romaji phonetics), and press a "lookup key" to have the computer turn the
just-typed word into proper Kanji.

The Bibliography in that Scientific American says the following publications
may be helpful:

_Writing Systems of the World_, Akira Nakanishi.  Charles E. Tuttle, 1980.
"A Historical Study of Typewriters and Typing Methods:  From the Position
of Planning Japanese Parallels", Hisao Yamada in _Journal of Information
Processing_, Vol. 2, No. 4, pp 175-202; February, 1980.

Can we all now consider the statement "7 bits is enough" most sincerely dead?

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu

"The Unicorn is a pony that has been tragically
disfigured by radiation burns." -- Dr. Science

edstrom%UNCAEDU.BITNET@wiscvm.wisc.EDU (08/20/87)

In article <1804@vice.TEK.COM> dougg@vice.TEK.COM (Doug Grant) writes:

>    For those who say "but too much bandwidth would be used for
>    64-bit characters!" I say hang on - fiber optic communications
>    are coming!

I'm one of those who find 64 bit characters unacceptable just because they
would choke the wires. The other arguments against the idea are as good
but this is enoug.

Fiber optic communication is comming but no one will say when or for
how much. When it does all of those 9600 baud modems would look like
1200 baud terminals and 1200 baud modems would.... well, its too ugly
to think about. A standard 64 bit character would cost billions impliment.

Besides, its overkill. Single characters are relatively unimportant.
The rare occasions where one changes font in the middle of a word are
so few and far between that I can't remember the last time I missed
that feature.

I could see something like that at the word or sentence level. Perhaps
define a data type called "form_string" whose first 4 or 8 bytes would
be formatting information.

peterg@felix.UUCP (Peter Gruenbeck) (08/20/87)

---------------
Disclaimer: I may not know what I'm talking about. Batteries not included.
---------------

I have difficulty getting used to the idea of a 32 bit byte. What would
happen the the nybble (half a byte - get it?). Would we be biting off
more than we could chew? I would be in favor of leaving a byte as
8 bits and using the term WORD to represent a unit of addressable
memory. That way we limit the confusion of how many bits something
has to a term which is already confusing.

For example:
   Many small computers (6502, 808x, 68000) have a word = 8 bits.

   Some older mainframe systems (IBM 370, Cyber, DEC) have word
   lengths of 32 bits, 12 bits, 60 bits.

   Some specialized engines (Itel 370 compatable) have 80 bit
   words for the microcode intrepreters. Also, some PC ram disks
   may be considered to have 128 byte words since that is what you
   address and then take the rest sequentially.

New machines to handle the complex multi-language problem may
have a 32 bit word if that is what a character takes. Maybe we should
define a new term called a CHAR to define the number of bits required
to represent a character.  

I'm told I'm quite a character myself at times. I suspect you'll need
more than 32 bits to define me (I hope). This is not to say that
this is the final WORD.

henry@utzoo.uucp (Henry Spencer) (08/20/87)

> In the English dictionary that the documentation department here uses, there
> are 320,000 words.  I am told that the Oxford English Dictionary has
> approaching 1,000,000 words, and that the the total English language has just
> over 1,000,000 words.  Chinese is probably about the same.

Remember that the OED includes an awful lot of words that are obsolete or
terminally obscure by anyone's standards.  It is not a dictionary of current
English.

I would also wonder about the assumption that Chinese would be about the
same size.  I have heard it said that English has the largest vocabulary of
any human language by a wide margin, because of its dominant position and
its unusually extensive borrowing from other languages.
-- 
Support sustained spaceflight: fight |  Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"!  | {allegra,ihnp4,decvax,utai}!utzoo!henry

mouse@mcgill-vision.UUCP (08/23/87)

In article <6252@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
[stuff about assuming sizeof(char)==1]
[>> uses an example of a Japanese programmer having problems]

> If *I* were a Japanese programmer, I think I'd resent being treated
> as a second-class citizen by the programming language.

If you insist on taking a language designed in the English world for
the English world and using it in Japan, it wouldn't surprise me a bit
if it made a poor showing.

Why do we all assume that C must be twisted and bent to fit the
international environment?  Are there *no* computer languages designed
by Japanese for a Japanese environment (or Chinese or Arabic or Hindi
or etc)?  Perhaps it is time for one.

(Not that I have anything against extending C to such an environment; I
like C too.  But it's beginning to look as though the result of such
attempts "ain't C", to coin a phrase.)

					der Mouse

				(mouse@mcgill-vision.uucp)

chips@usfvax2.UUCP (Chip Salzenberg) (08/24/87)

[I've added comp.text to the list.]

In article <8892@brl-adm.ARPA>, edstrom%UNCAEDU.BITNET@wiscvm.wisc.EDU writes:
} 
} In article <1804@vice.TEK.COM> dougg@vice.TEK.COM (Doug Grant) writes:
} 
} >    For those who say "but too much bandwidth would be used for
} >    64-bit characters!" I say hang on - fiber optic communications
} >    are coming!
} 
} [...] it's overkill. Single characters are relatively unimportant.
} The rare occasions where one changes font in the middle of a word are
} so few and far between that I can't remember the last time I missed
} that feature.
} 
} I could see something like that at the word or sentence level. Perhaps
} define a data type called "form_string" whose first 4 or 8 bytes would
} be formatting information.

Actually, it's rather presumptuous to define n bits for color, m bits for
font, etc. -- how many fonts will _you_ be using in five years?  Nobody knows.

It would be much simpler to store text in the "form_string" mentioned above.
But why exclude font changes in the middle of a word?

If we need a common interchange format for formatted text, let's use
PostScript.  It already exists and it's been implemented in several places.

-- 
Chip Salzenberg            UUCP: "uunet!ateng!chip"  or  "chips@usfvax2.UUCP"
A.T. Engineering, Tampa    Fidonet: 137/42    CIS: 73717,366
"Use the Source, Luke!"    My opinions do not necessarily agree with anything.

V4007%TEMPLEVM.BITNET@WISCVM.WISC.EDU (Mike Brower) (08/28/87)

Dear Sir,
    I am sorry, but I am not Mike Brower. Mike has moved on in life. He has
graduated from Temple and no longer owns this account. This account is now
owned by Franky Choi. I am willing to accept interesting mail.

guy@gorodish.UUCP (08/31/87)

> Why do we all assume that C must be twisted and bent to fit the
> international environment?

Gee, *I* don't assume that.  Making the language support comments in foreign
languages doesn't seem too hard; with some more work (and cooperating linkers -
I suspect the UNIX linker can handle 8-bit characters in symbol names) you
could even have it support object names in foreign languages (but then again, a
hell of a lot of object names are in a foreign language NOW; quick, in what
natural language is "strpbrk" a word?).  It might even be possible to support
*keywords* in foreign languages - I'm told there are compilers for some
languages that do this - but the trouble with C is that a lot of "keywords" are
routine names, and it'd be kind of a pain to put N different translations of
"exit" into the C library.

As for making programs written in the language support foreign languages,
there are no massive changes to C required here, either.  Most of the support
can be done in library routines.  It is not *required* that "char" be increased
in size beyond one byte to support other languages, nor would it be *required*
that "strcmp" understand the collation rules for all languages.

> Are there *no* computer languages designed by Japanese for a Japanese
> environment (or Chinese or Arabic or Hindi or etc)?

Not that I know of.  There may be some, but I suspect they are VERY minor
languages.

> Perhaps it is time for one.

The trouble is that "one" wouldn't be enough!  You'd need languages for
Japanese *and* Chinese *and* Korean *and* Arabic *and* Hebrew *and* Hindi
*and*... if the languages were really designed for *that particular* language's
environment.

If this were the *only* way to write programs that can handle those languages,
you would have to write the same program N times over for all those
environments.  If you wanted your system to be sold in all those different
environments, you would either have to supply compilers for *all* those
languages or make the compiler suite be something selected on a per-country
basis.  I can't see how this would do anything other than impose costs that far
outweigh the putative benefits of such a scheme.

> (Not that I have anything against extending C to such an environment; I
> like C too.  But it's beginning to look as though the result of such
> attempts "ain't C", to coin a phrase.)

No, the "ain't C" phrase is properly applied only when something contradicts
some C standard, or perhaps when it grossly violates assumptions made by
reasonable C programmers.  There may be some problems with changing the mapping
between "char"s and bytes (problems caused by C's unfortunate conflation of the
notion of "byte", "very small integral type", and "character" into the type
"char" - they should have been separate types), but I see no contradiction or
gross violation in the internationalization proposals I've seen.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.UUCP (08/31/87)

In article <866@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes:
>Why do we all assume that C must be twisted and bent to fit the
>international environment?

First, I am not proposing that C be "bent and twisted".  I think it is
possible to cleanly address the needs of the international programming
community.  (I think my proposal for this was much cleaner than the one
that is likely to be adopted, but at least you can ignore the latter if
you're sure that you don't need to worry about such matters.)

Second, if you rephrase the question "Why do we have to consider the
international community?", the answer is:  Because ISO or JIS will come
up with something for THEIR version of the C standard if we don't come
up with it for OURS.  Having different standards, particularly if one
of them is likely to be unappealing to us, is a situation to be avoided.

You should also observe that most large companies you think of as U.S.-
based actually have a significant percentage of their market overseas.
They certainly feel the need for international standards.

fbaube@note.nsf.GOV (09/01/87)

[feed the line eater]

in article <650@gec-mi-at.co.uk> adam@gec-mi-at.co.uk writes:

> A Chinese word is made up of one or more characters. Each character
> has a value of one phoneme. There are about 5000 phonemes in common use,
> and there is an agreed 16-bit code for the most common phonemes.

If *characters* means *Chinese* characters, this is a very
simplified description of Chinese etymology.  Compound characters
are formed in different ways.  For some, the constituent
characters' phonetic values, taken together, equate to the
phonetic value of the word represented by the compound
character.  In other compound characters the constituent
characters combine in a concrete way (e.g. band + saw =>
bandsaw), and in others the combination is more abstract or
metaphorical (e.g. muck + raker => Woodstein).

> I've been on the net for
> a while now and I didn't know that comp.lang.c was short for
> comp.lang.chinese (though looking at that convoluted employment
> test had me wondering......)
>
>      -Adam.

It beats debating FORTRASH  :-)

peter@sugar.UUCP (Peter da Silva) (09/02/87)

> languages that do this - but the trouble with C is that a lot of "keywords" are
> routine names, and it'd be kind of a pain to put N different translations of
> "exit" into the C library.

Simple...

#include <japan.h>
-- 
-- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter
--                  U   <--- not a copyrighted cartoon :->

henry@utzoo.UUCP (Henry Spencer) (09/02/87)

> I've been on the net for a while now and I didn't know that comp.lang.c was
> short for comp.lang.chinese...

No, you're thinking of comp.lang.apl!  (APL also being known, by those who
aren't too keen on it, as "Chinese BASIC".)
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry