[comp.arch] Kanji processing

jaw@eos.UUCP (James A. Woods) (03/03/88)

From article <1460@thorin.cs.unc.edu>, by ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi):
> There are few variations to encode Kanji (unbounded, but about 7k-8k for
> business use), Kanas (2 different sets of abuot 60 each), and alphanumerics
> intermixed. Scanning these 8bit (ASCII)/ 16bit (JIS; Japanese Industry 
> Standard) mixed string is a pain (simple FSM, sometime).  You need a new
> set of C string libraries. Also in these code, bit7 (8'th bit) is used, 
> which is messy with some UNIX utilities.  Want to try Boyer-Moore string
> matching with alphabet size of 10K ?  It will be an interesting excersize.  
> 
The last release of [ef]?grep a year ago (in the source archives) adapts
the Boyer-Moore-Gosper algorithm to Kanji, at least for simple strings.
It does *not* require extending the precomputed BMG tables to deal with
16-bit entities (large alphabets).  One simply searches for a 16-bit
Kanji character as if it were two consecutive 8-bit bytes, which it
is, modulo the SS1 and SS2 codes.  Though the "obvious" Kanji software
would process codes serially to disambiguate things, BMG requires doing
partial matches on known content, and only then backup to process the
ASCII/Kanji delimiters.

> Chinese, and several other languages need larger character set, too.  
> If you want to export computers and operating systems to these countries, 
> you better not forget (Along with I/O devices, of course. Who buy a 
> business computer which print bills in unreadable characters ?)
> 
I wholeheartedly agree.  Any vendor who ignores Kanji in their Unix
port is conceding half the market.  But whether RISC machines need
special instructions for Kanji is still an open question.

TRON or Hitachi G-series (64-bit chip) designers are certainly
welcome to post information about any such architectural considerations.

ames!jaw