dupuy@amsterdam.columbia.edu (Alexander Dupuy) (01/01/70)
In article <20131@ucbvax.BERKELEY.EDU> oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) writes: > There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte >encodes the 254 next most common ideograms, the 255 bit pattern >meaning that the next 16-bit word had the 65534 next most common, and >so on. > >That way, the average length of a run of chinese text is >likely to be about 10 bits per ideogram, and any single ideogram would >have canonical 64 bit representation: its bit pattern in the left of >the 64 bits, including any nybble-shift, byte-shift, or word-shift bit >patterns and padded out with filler nybbles. This underscores the central tradeoff in a code for Chinese or Chinese/Japanese - compact respresentation to save disk space versus consistent (same character size) representation for processing. But there is really no reason we have to trade these off against each other. We can just define a consistent representation for processing (24 or 32 bits will suffice - I don't think we need 64) and use a compresseion algorithm (Lempel-Ziv, Huffman, whatever, as long as it's standard, and not too expensive to decode/encode) when we aren't manipulating individual characters. Some languages even have rudimentary forms of support for this (packed array of char vs. array of char in Pascal). It's clear that operating system support has to be much better than it is now for there to be any hope of writing programs which are portable between Latin-only, Chinese/Japanese-only, and Chinese/Japanese/Latin environments. I don't see the programming language constructs as being the major problem. @alex --- arpanet: dupuy@columbia.edu uucp: ...!seismo!columbia, and i
lambert@cwi.nl (Lambert Meertens) (01/01/70)
In article <970@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: ) In article <51@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes: ) >There seems to be a good reason for [using Kanji]: after romanization, ) >words written differently in Kanji may become the same. ) ) The romanized form is phonetic, right? I presume that Japanese speakers can ) understand each other when conversing by telephone; doesn't this have the same ) level of ambiguity? In some cases the intonation pattern (accent) may help to disambiguate words that are otherwise homonyms. Generally, spoken text has more clues to help interpretation than written text (sentence melody, pauses, stresses). And it tends to be more redundant anyway, and in a telephone conversation the other party continually signals to you if they are still with you (which Japanese speakers tend to do much more strongly than English speakers). Nevertheless, one native Japanese speaker told me that he expected to be able to figure out the meaning of a technical text written in say hiragana, on the condition that at least the word boundaries are marked (which is not done in normal Japanese writing). It seems to be a matter of preference rather than of strict necessity. -- Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl
andersa@kuling.UUCP (Anders Andersson) (01/01/70)
In article <1384@ogcvax.UUCP> dinucci@ogcvax.UUCP (David C. DiNucci) writes: >I do not know how the final Kanji is actually stored, but it could >conceivably be stored as the sequence of hiragana followed by some >special index telling which "view" of that sequence should be used >when displaying the character. This would seem to take care of some >of the problems discussed in this group. It could cause problems if I'm not sure I understand exactly what problems this solution would take care of. How long can a hiragana word be? The entire word (plus the "view" index) would be the key when looking up the display bitmap, which seems pretty much like using any variable-bytesize character set to represent Kanji. Is this representation particularly suited to sorting and/or searching Japanese text? -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)
karl@haddock.UUCP (08/07/87)
[I probably should have included comp.std.internat earlier, but I didn't think of it. c.s.internat readers can pick up context from comp.lang.c if desired.] In article <6216@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn) writes: >In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: >>[For example,] on a bit-addressible machine in an Arabic- or Japanese- >>language environment, one might have "short char" be 1 bit, "char" be 8, >>and "long char" be 16. > >... I would prefer that a (char) be capable of holding an entire basic >textual unit, since many applications are already based on that assumption. >...might as well simply make (char) be the right thing and not introduce a >new type. ... most international implementations could make (short char) >8 bits and (char) or (long char) 16 bits. >>If this is to be phased in without breaking a lot of programs, X3J11 should >>immediately bless all three names, but insist that they all be the same size. >>(Which restriction should be deprecated, to be removed in the next standard.) > >I don't think it's within the realm of practical politics to say that the >problem will not be solved until the next issue of the standard. The problem with your proposal is that it would break existing code that assumes sizeof(char) == 1. If a user wants to write a portable program that refers to objects smaller than 16 bits%, he can't use (short char) because existing compilers won't accept it, and he can't use (char) because new ones might make it too big. That's why I suggested the temporary restriction. Also, in the world of international text processing I don't think we have all the questions yet, let alone the answers. I figure X3J11 should take care of one thing we do know (that "char" as commonly implemented nowdays won't suffice) and pave the way for a real fix later. (Hmm. If I were a Japanese user, using a VAX, and I was told that, because Japanese characters require more than 8 bits, and because (char) is the obvious datatype for characters, and because C requires that nothing be smaller than (char), my compiler couldn't address individual bytes, then I think I'd start looking for a new vendor or a new programming language.) Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint %Assuming the implementation allows such an object to exist at all.
gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/08/87)
In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: >The problem with your proposal is that it would break existing code that >assumes sizeof(char) == 1. Of course, such code is already broken in the international environment. In fact, in an 8-bit (char) implementation, such code would continue to work. In other words, something has to give for internationalized implementations; the question is what? With my proposal, sizeof(short char)==1, so there could be a transition period during which implementations would make sizeof(char)==sizeof(short char) until application source has been cleaned up. (Some developers have been careful to not rely on sizeof(char)==1 all along, anticipating the day when this assumption may have to be changed.) >If I were a Japanese user, using a VAX, and I was told that, because >Japanese characters require more than 8 bits, and because (char) is the >obvious datatype for characters, and because C requires that nothing be >smaller than (char), my compiler couldn't address individual bytes, then I >think I'd start looking for a new vendor or a new programming language. That's why something has to be done. As I reported recently, X3J11 has agreed in principle with Bill Plauger's proposal for a typedef letter_t and a few conversion-oriented functions, but NO library for letter_t analogous to the standard str*() routines. This necessitates source-level kludgery for any application for which portability into a multi-byte character environment is a possibility. I don't like that very much, but since I'm not expecting to sell software products to the Japanese I'll go along with it if the vendors think it will fly. This seems to be another case of not wanting to do things technically correctly if that would require a radical change to previous practice. That's a legitimate concern, of course. If *I* were a Japanese programmer, I think I'd resent being treated as a second-class citizen by the programming language.
kent@xanth.UUCP (Kent Paul Dolan) (08/09/87)
While we're developing nightmares about the number of bits the Japanese need in a char, remember for text processing that for 1 billion of the earth's residents, the smallest unit of text processing is the ideograph, and that even 21 bits is probably barely sufficient to represent the number of written words in Chinese. Anyone for 32 bit characters? I sure don't want 24 bit ones! ;-) (Of course, one _could_ always write off the market, but a billion customers is rather a lot at which to turn up ones nose!) Kent, the man from xanth.
gwyn@brl-smoke.ARPA (Doug Gwyn ) (08/10/87)
In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes: >While we're developing nightmares about the number of bits the Japanese >need in a char, remember for text processing that for 1 billion of the >earth's residents, the smallest unit of text processing is the ideograph ... I'm no expert, but I seem to recall that Chinese ideographs (which as I understand it come in several varieties) are pretty much made from a (relatively) small set of basic strokes placed in different positions. I think there are even Chinese typewriters, or at least type compositors. If this is correct, then one possibility would be to devise a suitable (acceptable to technical Chinese) representation for ideographs in terms of basic strokes and placement instructions, which could be treated as text units. After all, the letter "w" doesn't mean much when taken out of English context; we too need the whole word-symbol, not just a letter-component to express a meaning. It's just that our alphabet is simpler and is combined in 1 dimension instead of 2.
lambert@cwi.nl (Lambert Meertens) (08/10/87)
In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes:
)
) While we're developing nightmares about the number of bits the Japanese
) need in a char, remember for text processing that for 1 billion of the
) earth's residents, the smallest unit of text processing is the ideograph,
) and that even 21 bits is probably barely sufficient to represent the number
) of written words in Chinese.
Are you suggesting that there are more than 2**20 = 1048576 different
written words in Chinese? At typically 60 entries on a page, their
dictionaries must have then some 17500 pages or more. I think that 16 bits
are enough to accommodate all Chinese characters, and certainly ample for
the about 5000 that are in actual use.
--
Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl
dougg@vice.TEK.COM (Doug Grant) (08/10/87)
In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes: > > While we're developing nightmares about the number of bits the Japanese > need in a char, remember for text processing that for 1 billion of the > earth's residents, the smallest unit of text processing is the ideograph, > and that even 21 bits is probably barely sufficient to represent the number > of written words in Chinese. Anyone for 32 bit characters? I sure don't > want 24 bit ones! ;-) Great idea, Kent! But with so many characters and attributes in common usage, even 32 bits isn't enough for everyone to communicate exactly what they mean. I would like to propose an ASCII-compatible 64-bit character set (really!). Here's my suggestion for how to divvy up the bits: 24 bits - character 8 bits - font 8 bits - size 8 bits - color 4 bits - intensity (boldness) 2 bits - blink rate (00 = don't blink) 1 bit - normal/reverse 8 bits - sync 1 bit - left over - any suggestions? Here's how it would be ASCII compatible: The eighth bit of the first byte received would be used as an ASCII/extended character set flag. If it is zero, the character is normal 7-bit ascii. If it is 1, the next seven bytes are used to complete the eight-byte character. Only the eighth bit of the first byte is set to one - the eighth bit of the remaining seven bytes is set to zero, thus assuring that when "Extended Character Set" characters come in, their bytes can be kept in sync. For those who say "but too much bandwidth would be used for 64-bit characters!" I say hang on - fiber optic communications are coming! I'd sure like to see one standard character set that can accomodate the whole world! Doug Grant dougg @vice.TEK.COM disclaimer: These opinions are my own, but my employer is welcome to adopt them.
guy%gorodish@Sun.COM (Guy Harris) (08/11/87)
> Are you suggesting that there are more than 2**20 = 1048576 different > written words in Chinese? At typically 60 entries on a page, their > dictionaries must have then some 17500 pages or more. I think that 16 bits > are enough to accommodate all Chinese characters, and certainly ample for > the about 5000 that are in actual use. According to a document called "USMARC Character Set: Chinese Japanese Korean", from the Library of Congress, Washington, a 24-bit character was developed to "represent and store in machine-readable form all the Chinese, Japanese, and Korean characters used with the USMARC format." It says that the character sets incorporated into this character set (the RLIN - Research Libraries Information Network - East Asian Character Code, or REACC) are: + *Symbol and Character Tables of Chinese Character Code for Information Interchnage*, vol. 1 and 2 (2nd ed., Nov. 1982) and *Variant Forms of Chinese Character Code for Information Interchange* (2nd ed., Dec. 1982) (CCCII) Editor: The Chinese Character Analysis Group. Total: 33,000 characters. REACC contains all of the 4,807 "most ocmmon" Chinese characters in volume 1 (as listed by the Ministry of Education in Taiwan) and about 5,000 of the 17,000 characters taken from a compilation of data from different computer centers (mostly personal names) in volume 2. REACC also contains about 3,000 of the approximately 11,000 characters in the CCCII *Variant Forms*, which lists PRC simplified forms and other variants, some of which are also used in modern Japanese. + *Code of Chinese Graphic Character Set for Information Interchange Primary Set: The People's Republic of China National Standard* (GB 2312-80) (1st ed., 1981). Total: 6,763 characters. All the characters in this set are in REACC. + *Code of the Japanese Graphic Character Set for Information Interchange: Japanese Industrial Standard* (JIS C 6226) (1983). Total: 6,349 characters. All the characters in this set are in REACC. + *Korean Information Processing System* (KIPS). Total: 2,392 Chinese characters and 2,058 Korean Hangul. Chinese characters in this set are in REACC; all hangul are also incoroporated in REACC, as well as some hangul *not* in KIPS. One characteristic of this character set is that it tries to permit a simple rule to get the codes for various variant forms of characters from the code for the traditional form of the character. So, while you can probably stuff the major Chinese characters into 16 bits (the CCCII, including variant characters, contains 33,000 characters), you may not want to. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
howard@COS.COM (Howard C. Berkowitz) (08/11/87)
In article <2034@xanth.UUCP>, kent@xanth.UUCP (Kent Paul Dolan) writes: > While we're developing nightmares about the number of bits the Japanese > need in a char, remember for text processing that for 1 billion of the > earth's residents, the smallest unit of text processing is the ideograph, > and that even 21 bits is probably barely sufficient to represent the number > of written words in Chinese. Anyone for 32 bit characters? I sure don't > want 24 bit ones! ;-) I worked at the Library of Congress in the late 70's, and was responsible for the hardware and systems software aspects of experimental terminals for the 140 or so fonts (700 or so languages and dialects) in which the Library has materials. Chinese, of course, was the nightmare. Several authorities said we should assume about 50K distinct ideographs, but the language scholars in the Orientalia Division said 100K was a more correct number. When the outside experts challenged this, saying that the additional 50K appear in only esoteric documents used by very specialized scholars, Orientalia responded with "who do you think use the Orientalia collection at the Library of Congress?" It developed, however, that the Chinese ideograph problem could be simplified. While there are a very large number of distinct ideographs, these ideographs are composed of a much smaller (<100) number of superimposed radicals. Chinese dictionaries use radicals as a means of lexical ordering. While I am out of touch with current research, it was felt at the time that Chinese (and full Japanese Kanji) could be approached by using a mixture of codes for common ideographs and escapes to strings of radicals (to be superimposed), or purely by radical strings. When discussing the Oriental language problem, do distinguish the linguistic problem of ideograph uniqueness from the graphic problem of ideograph display. This differentiation is similar to the difference between a code and a cipher. -- -- howard(Howard C. Berkowitz) @cos.com {seismo!sundc, hadron, hqda-ai}!cos!howard (703) 883-2812 [ofc] (703) 998-5017 [home] DISCLAIMER: I explicitly identify COS official positions.
john@frog.UUCP (John Woods, Software) (08/12/87)
In article <34@piring.cwi.nl>, lambert@cwi.nl (Lambert Meertens) writes: I>In article <2034@xanth.UUCP> kent@xanth.UUCP (Kent Paul Dolan) writes: N>)While we're developing nightmares about the number of bits the Japanese C>)need in a char, remember for text processing that for 1 billion of the L>)earth's residents, the smallest unit of text processing is the ideograph, U>)and that even 21 bits is probably barely sufficient to represent the number D>)of written words in Chinese. E> D>Are you suggesting that there are more than 2**20 = 1048576 different >written words in Chinese? At typically 60 entries on a page, their T>dictionaries must have then some 17500 pages or more. I think that 16 bits E>are enough to accommodate all Chinese characters, and certainly ample for X>the about 5000 that are in actual use. T> In the English dictionary that the documentation department here uses, there are 320,000 words. I am told that the Oxford English Dictionary has approaching 1,000,000 words, and that the the total English language has just over 1,000,000 words. Chinese is probably about the same. I can see asking the Chinese to adopt some limited alphabet scheme (such as Romaji used by the Japanese (if I remember correctly, a 3-Roman-character spelling for each syllable of Kanji), or perhaps Roman phonetic spelling), but telling them that some microscopic fraction of their language has to be selected for interaction with computers is just flatly bogus. (a side note to provoke more chuckles than thought: are ideographs the CISCs of language? Perhaps that makes Morse code the RISC...) -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA "The Unicorn is a pony that has been tragically disfigured by radiation burns." -- Dr. Science
henry@utzoo.UUCP (Henry Spencer) (08/12/87)
> 8 bits - color
Surely you jest. Any color-graphics type will tell you that you need at
least 24 bits, maybe 36 or 48. :-)
More seriously, your all-inclusive scheme falls down like this in several
areas. 8 bits may not be enough for a font size when things like fractional
sizes come in (yes, there are fractional sizes). 8 bits certainly is not
enough for a font in demanding applications -- ever looked at a font catalog?
Finally, it's not common to change things like color and font from one
character to the next (unless one is a Mac user intoxicated with the joy of
font changing, sigh...), so a lot of those bits are being wasted. Better
to use some sort of font-switch (etc.) sequences, simultaneously giving
more compact coding and more flexibility.
--
Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology
the soi-disant "Planetary Society"! | {allegra,ihnp4,decvax,utai}!utzoo!henry
peter@sugar.UUCP (Peter da Silva) (08/13/87)
In Japan programming languages are the least of the problems their written language causes them. An incredible amount of data is never stored anywhere but on the original form, photocopies of said form, or faxed copies of said form. Even with the best tools available it's just too hard to keypunch. This, of course, makes it even more amazing that they have been so succesful in the world community. It seems likely to me, though, that at some point they're going to have to break down and drop Kanji for professional use. -- -- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter (I said, NO PHOTOS!)
pom@under..ARPA (Peter O. Mikes) (08/13/87)
To: henry@utzoo.UUCP Subject: Re: What is a byte Newsgroups: comp.lang.c,comp.std.internat In-Reply-To: <8404@utzoo.UUCP> References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> Organization: S-1 Project, LLNL In article <8404@utzoo.UUCP> you ( Henry Spencer ) write: > >font changing, sigh...), so a lot of those bits are being wasted. Better >to use some sort of font-switch (etc.) sequences, simultaneously giving >more compact coding and more flexibility. >-- ^ This | is An IMPORTANT IDEA :-| . The so called 'Daisy Sort' - a sequence of characters on the printwheel is optimized - using the frequency if bigrams in English language - in such a manner that characters which are frequent neighbours are near to each other ( that makes for a faster printer). NOW, if I recall correctly, about 90% of movements are within ten-spokes-distance and (another statistical fact) the special symbols and capitals are so rare that their spacing is irrelevant ( except that digits tend to follow digits - so you place all digits next to each other) | | v => It is very wasteful to store English text using ASCII. ergo: There are really just few 'rational alternatives' for storing text: 1) 4 bits: sign + 3bit distance in the sort (of imaginary standard printwheel) with one code ( 0+000) being reserved to mean: The following 4-bit word has another meaning (namely : e.g long jump) or jump to another subset of the character set ( such as - switch the cases UPPER/lower, digits+aritmetics signs, carriage motion controls... 2) 6 bits: 1bit sign +1bit ( distance/font switch) + 4bit (either distance to next character within given sort or one of 16 other subfont sorts) 3) ... Naturally, languages such as c, would have a different statistics and should probably merit a special sort (which would be marked by a six? bit code on the beginning of the file/document (since unix command 'file' would not work {it does not work too well anyway} ) specifying the (type ot the file) = (apropriate daisy sort), such as: english text, numerical data, Post-Script file, c-source,... => It is ALSO very wastefull to store numerical data sets using ASCII. Of course, in the numerical_data character-subset we need characters for over-flow and undefined (NaN, Infinity, missing data point, end-of-file, end-of-row = end_of_vector, another-data-set .. and characters for decimal point/comma E and triplet-separator so that I can write 6_234,567 = 6_234.567 = 0.623_567_7E4 to mean six thousands and 234.567 ( The decimal comma (European way ) is preferred by the ISO SI standard, while decimal point (US way) is tolerated) and the (current ISO) triplet separator (namely blank i.e. 1 000 for one thousand ) MUST be changed ( since blank is used in parsing ). Perhaps 1_000=E3 (and 10 = E3.101 ?) or 1:000 = 1E3 (with / only, being used for division?) Actually, for speed of parsing it would be highly preferable to AVOID alphabetic separators (. and ,) and letters to express numbers. Perhaps we can write 3:456::4 to express three thousands and four hundred fifty six and four tenth and perhaps 1:+3 = 1:000 ( 1E3) and 5:-3 for .005 ( 5e-3) ? In any case, we should be able to express all numbers using sixteen digit-type characters: + - 0..9, ( decimal sign ) (exponent sign) (thats 13 or 14) and then perhaps ether | or { } for c-style sets, and ( one triplet separator) ( e.g. : or_ ( not blank ) We then can represent Infinity as ::: or +++ and NaN as +_+ etc Anyway, I just wanted to say, that Henry's pertinent reminder that character sets and grouping of characters into sets (or sub-fonts) affects compactness of information storage really points a way to an objective measure of suitability of different coding methods for different uses - and that several categories of use , namely 1) english text or just any plain text (i.e. prose), (4 or 6 bits) numerical data sets ( i.e. number or point sets) (4 bits) 3) c or just 'any programing language' 4) carriage motions ( tabs, form feeds, cursor addressing ,..?? modifiers ( highlight, underline, typography...) ?) ...?... are frequent enough and universal enough to merit their own character families or subfonts, binary representations and an international standard.
henry@utzoo.UUCP (Henry Spencer) (08/14/87)
> In the English dictionary that the documentation department here uses, there > are 320,000 words. I am told that the Oxford English Dictionary has > approaching 1,000,000 words, and that the the total English language has just > over 1,000,000 words. Chinese is probably about the same. Remember that the OED includes an awful lot of words that are obsolete or terminally obscure by anyone's standards. It is not a dictionary of current English. I would also wonder about the assumption that Chinese would be about the same size. I have heard it said that English has the largest vocabulary of any human language by a wide margin, because of its dominant position and its unusually extensive borrowing from other languages. -- Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology the soi-disant "Planetary Society"! | {allegra,ihnp4,decvax,utai}!utzoo!henry
devine@vianet.UUCP (Bob Devine) (08/15/87)
> The so called 'Daisy Sort' [...] points a way to an > objective measure of suitability of different coding methods for different > uses - and that several categories of use [text, programming code] > are frequent enough and universal enough to merit their own > character families or subfonts, binary representations > and an international standard. Unfortunately, it would be very difficult to write any general text manipulating programs. With ASCII encoding it is very easy to yank sections out of random files and assemble a consistent file. The problems of dealing with a mixed-mode file will quickly eliminate advantages gained in the single-mode case. Likewise consider how a program like 'grep' would operate; it would need a switch to handle different reprensentations of the same string.
oster@dewey.soe.berkeley.edu (David Phillip Oster) (08/15/87)
In article <8409@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Remember that the OED includes an awful lot of words that are obsolete or >terminally obscure by anyone's standards. It is not a dictionary of current >English. That's part of the point. Would you support an encoding scheme that prevented me from using English documents, even those containg obselete or obscure words, on my computer? Well if we are going to standardize on an encoding for Chinese, it should be able to cover ALL of Chinese. There is no reason why we couldn't use a huffman encoding scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th pattern is a filler, and the 16th pattern means that the next byte encodes the 254 next most common ideograms, the 255 bit pattern meaning that the next 16-bit word had the 65534 next most common, and so on. That way, the average length of a run of chinese text is likely to be about 10 bits per ideogram, and any single ideogram would have canonical 64 bit representation: its bit pattern in the left of the 64 bits, including any nybble-shift, byte-shift, or word-shift bit patterns and padded out with filler nybbles. Now, all we have to do is pick an ideogram frequency standard. Say, this idea would also work for English. Assuming that the average English word takes 6*8 bits (average length of 5 + terminating space * 8 bit ascii) you could cut the disk space required for computer storage by a factor of close to 5 by using this encoding scheme. Too bad that you'd have a mammoth word list in main memory to unpack it speedily. Might be a nice way to increase the effective bandwidth of all those modems pushing UseNet around though. --- David Phillip Oster --My Good News: "I'm a perfectionist." Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour." Uucp: {seismo,decvax,...}!ucbvax!oster%dewey.soe.berkeleye yoe
guy@gorodish.UUCP (08/16/87)
> In Japan programming languages are the least of the problems their written > language causes them. An incredible amount of data is never stored anywhere > but on the original form, photocopies of said form, or faxed copies of said > form. Even with the best tools available it's just too hard to keypunch. > > This, of course, makes it even more amazing that they have been so succesful > in the world community. It seems likely to me, though, that at some point > they're going to have to break down and drop Kanji for professional use. I don't know about that. More and more machines are adding support for Kanji. There are a large number of Japan-only (Japan-mostly? I seem to remember Jun Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which most of the traffic is in Japanese, represented in Kanji. (He said they added Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a Kanji terminal.) The NEC PC also includes Kanji support; it is often used as a Kanji terminal. These machines may not be able to handle every single Kanji character, but the 90/10 rule may apply. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
lambert@mcvax.UUCP (08/16/87)
[I have removed comp.std.c from the Newsgroups line and added sci.lang.japan]
In article <479@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
) This, of course, makes it even more amazing that they have been so succesful
) in the world community. It seems likely to me, though, that at some point
) they're going to have to break down and drop Kanji for professional use.
There seems to be a good reason for not doing this: after romanization,
words written differently in Kanji may become the same. Although
ambiguities caused by homonymy occur in all languages (like English "drill"
= 1. [the use of] a tool for boring holes, metaphorically also boring
exercise; 2. [[a tool for] sowing in] a furrow; 3. twilled cotton; 4. a
baboon), these seem nothing compared to what the Japanese would face.
For example, the word "kanji" itself can mean: 1. feeling, sensation,
impression; 2. Kanji (Chinese character); 3. manager, secretary;
4. inspector, superintendent; 5. smilingly. These are all written
differently now.
A particularly bad example: "ko^ka" = 1. Faculty of Engineering;
2. consideration of services; 3. a high price; 4. an official price;
5. overhead, elevated; 6. merits and demerits; 7. effect, efficiency;
8. descent, fall; 9. the marriage of an Imperial princess to a subject;
10. mineralization; 11. colloid degeneration, gelatination; 12. hardening,
cementation, vulcanization, stiffening; 13. hard money, cash; 14. a leave
of absence; 15. taxes; 16. an evil effect; 17. the Yellow Peril; 18. an
unfortunate slip of the tongue; 19. amalgamation; 20. a school song; 21. a
high-rise building.
This high degree of ambiguity is the combined result of two characteristics
of Japanese. One is that there are say 1850 Kanji characters in common
use, each having an independent semantic content and usually a one-syllable
"reading", the so-called On reading, derived from the Chinese
pronunciation. There may be more than one On reading and there are some
bisyllabic On readings. There is also a Kun (original Japanese) reading,
which is completely unrelated (like On = "chu^", Kun = "hiru"), and which
more often than not is polysyllabic, but most single syllables occur as a
Kun reading. I haven't counted them, but say there are about ninety
syllables for readings of these 1850 characters, so typically a single
syllable may be the reading of 20 different characters. The second
characteristic that is important here is the ease with which compound words
are formed in Japanese, often by stringing some Ons together. Thus, all
the different "ko^ka"s above are the result of combining a highly ambiguous
"ko^" with a highly ambiguous "ka", and there are hundreds of other
potential meanings for this compound than the few given above (culled from
a dictionary). Written in Kanji, there is no ambiguity.
Not all words are that ambiguous if spelled in Romaji, but glancing through
my dictionary I estimate that about one third to one half of the entries
have the same romanization as another entry, and the number of clusters of
four or five homonymous entries may be as many as one thousand (as I find
one on almost every page of 1000 pages, sometimes two or three).
It may be that I am overestimating the problem and that the context would
suffice well enough to disambiguate romanized Japanese to make it
acceptable for professional use. Perhaps a Japanese reader of this article
may care to comment.
--
Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl
henry@utzoo.UUCP (Henry Spencer) (08/17/87)
> >Remember that the OED includes an awful lot of words that are obsolete or > >terminally obscure by anyone's standards. It is not a dictionary of current > >English. > > That's part of the point. Would you support an encoding scheme that > prevented me from using English documents, even those containg > obselete or obscure words, on my computer?... Depends. The current encoding scheme (ASCII) already prevents you from using English documents beyond a certain age -- the English alphabet has changed! Just try typing in a document that uses the thorn (vertical stroke with a loop on the side) or the long 's' (like an integral sign). The lack of these symbols is a real nuisance to certain scholars, but somehow it doesn't interfere with most uses of ASCII. -- Support sustained spaceflight: fight | Henry Spencer @ U of Toronto Zoology the soi-disant "Planetary Society"! | {allegra,ihnp4,decvax,utai}!utzoo!henry
dinucci@ogcvax.UUCP (David C. DiNucci) (08/17/87)
<Rain-iitaa, tabenasai> In article <piring.51> lambert@cwi.nl (Lambert Meertens) writes: > >In article <479@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes: >) This, of course, makes it even more amazing that they have been so succesful >) in the world community. It seems likely to me, though, that at some point >) they're going to have to break down and drop Kanji for professional use. > >There seems to be a good reason for not doing this: after romanization, >words written differently in Kanji may become the same. Although A fairly important breakthrough was made in the area of Japanese word processing some years ago when it was realized that characters could be translated from a phonetic alphabet to Kanji after entering the word. Japanese word processors now allow the user to enter a word as one or more hiragana (one of the phonetic alphabets with only about 55 basic characters), then at the touch of a key, cycle through the Kanji corresponding to that pronunciation, starting with the most commonly used. The user stops when the desired Kanji appears, then continues with the next word. I do not know how the final Kanji is actually stored, but it could conceivably be stored as the sequence of hiragana followed by some special index telling which "view" of that sequence should be used when displaying the character. This would seem to take care of some of the problems discussed in this group. It could cause problems if the "reference" dictionary (conversion tables from kana to Kanji) changed, but I don't think this is likely to happen often. A standard set of reference tables would need to be adopted (if it hasn't already been) if this were to actually be used as a data interchange format. A Chinese word processor would be a different story, however, since I do not believe Chinese has a phonetic alphabet like Japanese's hiragana and katakana. More info about a specific Japanese word processor is available to us English-only readers by reading a review of one for the Mac a year or so ago (I don't remember exactly which magazine it was in). Disclaimer: I know only a little nihongo, but my wife is a native Japanese, and often uses a Japanese word processor made by Fujitsu. -- Dave DiNucci UUCP: ..ucbvax!tektronix!ogcvax!dinucci Oregon Graduate Center CSNET: dinucci@Oregon-Grad.csnet Beaverton, Oregon
tony@artecon.UUCP (08/18/87)
In article <25736@sun.uucp>, guy%gorodish@Sun.COM (Guy Harris) writes: >In article Peter Da Silva writes: > > In Japan programming languages are the least of the problems their written > > language causes them. An incredible amount of data is never stored anywhere > > but on the original form, photocopies of said form, or faxed copies of said > > form. Even with the best tools available it's just too hard to keypunch. > > > > This, of course, makes it even more amazing that they have been so succesful > > in the world community. It seems likely to me, though, that at some point > > they're going to have to break down and drop Kanji for professional use. > > I don't know about that. More and more machines are adding support for Kanji. > There are a large number of Japan-only (Japan-mostly? I seem to remember Jun > Murai saying these groups were forwarded to Carnegie-Mellon) newgroups in which > most of the traffic is in Japanese, represented in Kanji. (He said they added > Kanji support to X.10, including a "jterm" variant of "xterm" that emulated a > Kanji terminal.) The NEC PC also includes Kanji support; it is often used as a > Kanji terminal. > > These machines may not be able to handle every single Kanji character, but the > 90/10 rule may apply. > Guy Harris Yes, it is true that Kanji is getting more support. Hewlett-Packard has a new drafting plotter (HP-7595) which has a Kanji option. The form of specification is that when you invoke the Kanji font, you go into a two-byte mode. That is, it takes two bytes to specify one Kanji character. Control bytes are used as control bytes, but the 94 printing bytes are used in the Kanji specification. So, 94 * 94 = 8836 different characters you can use. This is a good way of doing it since you never know how your OS is going to muck with control codes or full 8-bit binary data going to I/O devices. I believe that this is a fairly standard way of doing this for printers. 8836 may not seem like a lot of Kanji (which is known to go to about 50000 in Japanese), but only 1850 are needed to graduate from high school, and usually about 3000 are used in college texts. There are two "JIS" standards set by the Japanese Ministry of Education. JIS level 1 is about 3000 characters (including the basic 1850, KataKana, HiraGana, English alphabet, Cyrillic, Greek, and special symbols), and JIS level 2 is about 8000 (including the 3000 JIS level 1). As a rule, one is supposed to try to stick to JIS level 1, but use JIS level 2 for Proper names and just a few other execptions. So, in reply to above: 1) You may not be able to handle all 50000 Kanji, but JIS level 2 is more than enough, 2) It really isn't that difficult to implement because: a) It is a well defined font, accessed easily in two-byte sequencing (you don't even need 8-bits, 7 will do) b) You can get already masked ROMS which contain Kanji in a rasterized form for raster printers. c) The Japanese are more than happy to help you implement Kanji in your products. They will digitize Kanji for whatever reasonable form you need it. -- Tony BTW, I am not Japanese...but.."I think I'm turning Japanese, I really think so!" "Konnichi-wa" -- **************** Insert 'Standard' Disclaimer here: OOP ACK! ***************** * Tony Parkhurst -- {hplabs|sdcsvax|ncr-sd|hpfcla|ihnp4}!hp-sdd!artecon!adp * * -OR- hp-sdd!artecon!adp@nosc.ARPA * *******************************************************************************
john@frog.UUCP (John Woods, Software) (08/18/87)
In article <8409@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: (and many others as well) >>[English has] over 1,000,000 words. Chinese is probably about the same. > Many people (including Henry) have pointed out that (A) English is larger than most languages (having borrowed "one of everything" from everyone), and (B) Chinese ideographs are not one-per-word, but one-per-concept (hence most words are two or more ideographs). So, I went back to the source I first read about this topic in: "Multilingual Word Processing", Joseph D. Becker, Scientific American July 1984. In this article, he doesn't give an actual count of Chinese ideographs (just the statement "tens of thousands"), but in the "flexible encoding" he and other Xerox denizens developed (using alphabet shift-codes), to encode Chinese you send the "shift superalphabet (for 16 bit characters)", the 8-bit "super- alphabet number", and then 16-bit character sequences. "The main superalphabet, designated 00000000, is all one needs except for very rare Chinese characters." A little later in the article is the implication that about 7000 ideographs are "commonly seen" in Chinese publishing. So, there we have it: not as bad as I thought, but still indicating that 8 bits is woefully inadequate. Also, I seem to have slipped up in my understanding of Kanji: Kanji is the set of Chinese ideographs borrowed by the Japanese, of which "about 3500" are in common use (and the number is declining). The phonetic letters (which can spell words in entirety, and are used to indicate grammatical endings for Kanji roots) are collectively called "kana", and come in two sets: "hiragana" and "katakana" (it is probably more complicated than that, but that is about all the article gives). There used to be Kanji "typewriters" which scarcely anyone used (using several hundred keys); now, computerised systems exist in which one can type phonetic hiragana symbols (or, for those who prefer, the Romaji phonetics), and press a "lookup key" to have the computer turn the just-typed word into proper Kanji. The Bibliography in that Scientific American says the following publications may be helpful: _Writing Systems of the World_, Akira Nakanishi. Charles E. Tuttle, 1980. "A Historical Study of Typewriters and Typing Methods: From the Position of Planning Japanese Parallels", Hisao Yamada in _Journal of Information Processing_, Vol. 2, No. 4, pp 175-202; February, 1980. Can we all now consider the statement "7 bits is enough" most sincerely dead? -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu "The Unicorn is a pony that has been tragically disfigured by radiation burns." -- Dr. Science
karl@haddock.ISC.COM (Karl Heuer) (08/19/87)
In article <51@piring.cwi.nl> lambert@cwi.nl (Lambert Meertens) writes: >There seems to be a good reason for [using Kanji]: after romanization, >words written differently in Kanji may become the same. [Gives examples of >Japanese homonymy] The romanized form is phonetic, right? I presume that Japanese speakers can understand each other when conversing by telephone; doesn't this have the same level of ambiguity? Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
peterg@felix.UUCP (Peter Gruenbeck) (08/20/87)
--------------- Disclaimer: I may not know what I'm talking about. Batteries not included. --------------- I have difficulty getting used to the idea of a 32 bit byte. What would happen the the nybble (half a byte - get it?). Would we be biting off more than we could chew? I would be in favor of leaving a byte as 8 bits and using the term WORD to represent a unit of addressable memory. That way we limit the confusion of how many bits something has to a term which is already confusing. For example: Many small computers (6502, 808x, 68000) have a word = 8 bits. Some older mainframe systems (IBM 370, Cyber, DEC) have word lengths of 32 bits, 12 bits, 60 bits. Some specialized engines (Itel 370 compatable) have 80 bit words for the microcode intrepreters. Also, some PC ram disks may be considered to have 128 byte words since that is what you address and then take the rest sequentially. New machines to handle the complex multi-language problem may have a 32 bit word if that is what a character takes. Maybe we should define a new term called a CHAR to define the number of bits required to represent a character. I'm told I'm quite a character myself at times. I suspect you'll need more than 32 bits to define me (I hope). This is not to say that this is the final WORD.
mouse@mcgill-vision.UUCP (08/23/87)
In article <6252@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes: > In article <899@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: [stuff about assuming sizeof(char)==1] [>> uses an example of a Japanese programmer having problems] > If *I* were a Japanese programmer, I think I'd resent being treated > as a second-class citizen by the programming language. If you insist on taking a language designed in the English world for the English world and using it in Japan, it wouldn't surprise me a bit if it made a poor showing. Why do we all assume that C must be twisted and bent to fit the international environment? Are there *no* computer languages designed by Japanese for a Japanese environment (or Chinese or Arabic or Hindi or etc)? Perhaps it is time for one. (Not that I have anything against extending C to such an environment; I like C too. But it's beginning to look as though the result of such attempts "ain't C", to coin a phrase.) der Mouse (mouse@mcgill-vision.uucp)
guy@gorodish.UUCP (08/31/87)
> Why do we all assume that C must be twisted and bent to fit the > international environment? Gee, *I* don't assume that. Making the language support comments in foreign languages doesn't seem too hard; with some more work (and cooperating linkers - I suspect the UNIX linker can handle 8-bit characters in symbol names) you could even have it support object names in foreign languages (but then again, a hell of a lot of object names are in a foreign language NOW; quick, in what natural language is "strpbrk" a word?). It might even be possible to support *keywords* in foreign languages - I'm told there are compilers for some languages that do this - but the trouble with C is that a lot of "keywords" are routine names, and it'd be kind of a pain to put N different translations of "exit" into the C library. As for making programs written in the language support foreign languages, there are no massive changes to C required here, either. Most of the support can be done in library routines. It is not *required* that "char" be increased in size beyond one byte to support other languages, nor would it be *required* that "strcmp" understand the collation rules for all languages. > Are there *no* computer languages designed by Japanese for a Japanese > environment (or Chinese or Arabic or Hindi or etc)? Not that I know of. There may be some, but I suspect they are VERY minor languages. > Perhaps it is time for one. The trouble is that "one" wouldn't be enough! You'd need languages for Japanese *and* Chinese *and* Korean *and* Arabic *and* Hebrew *and* Hindi *and*... if the languages were really designed for *that particular* language's environment. If this were the *only* way to write programs that can handle those languages, you would have to write the same program N times over for all those environments. If you wanted your system to be sold in all those different environments, you would either have to supply compilers for *all* those languages or make the compiler suite be something selected on a per-country basis. I can't see how this would do anything other than impose costs that far outweigh the putative benefits of such a scheme. > (Not that I have anything against extending C to such an environment; I > like C too. But it's beginning to look as though the result of such > attempts "ain't C", to coin a phrase.) No, the "ain't C" phrase is properly applied only when something contradicts some C standard, or perhaps when it grossly violates assumptions made by reasonable C programmers. There may be some problems with changing the mapping between "char"s and bytes (problems caused by C's unfortunate conflation of the notion of "byte", "very small integral type", and "character" into the type "char" - they should have been separate types), but I see no contradiction or gross violation in the internationalization proposals I've seen. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
gwyn@brl-smoke.UUCP (08/31/87)
In article <866@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes: >Why do we all assume that C must be twisted and bent to fit the >international environment? First, I am not proposing that C be "bent and twisted". I think it is possible to cleanly address the needs of the international programming community. (I think my proposal for this was much cleaner than the one that is likely to be adopted, but at least you can ignore the latter if you're sure that you don't need to worry about such matters.) Second, if you rephrase the question "Why do we have to consider the international community?", the answer is: Because ISO or JIS will come up with something for THEIR version of the C standard if we don't come up with it for OURS. Having different standards, particularly if one of them is likely to be unappealing to us, is a situation to be avoided. You should also observe that most large companies you think of as U.S.- based actually have a significant percentage of their market overseas. They certainly feel the need for international standards.
peter@sugar.UUCP (Peter da Silva) (09/02/87)
> languages that do this - but the trouble with C is that a lot of "keywords" are > routine names, and it'd be kind of a pain to put N different translations of > "exit" into the C library. Simple... #include <japan.h> -- -- Peter da Silva `-_-' ...!seismo!soma!uhnix1!sugar!peter -- U <--- not a copyrighted cartoon :->