jwb@monu6.cc.monash.edu.au (Jim Breen) (05/10/91)
Archive-name: text/japanese/jdic/1991-04-26 Archive: monu6.cc.monash.edu.au:/pub/Nihongo/jdic10.zoo [130.194.32.106] Original-posting-by: jwb@monu6.cc.monash.edu.au (Jim Breen) Original-subject: PC Japanese/English Dictionary Display Program Reposted-by: emv@msen.com (Edward Vielmetti, MSEN) If anyone is interested, I have written a PC Program capable of fast ascii/kana searches and displays of a dictionary file. It is oriented around using the "EDICT" dictionary in MOKE, but would probably work for any line-oriented dictionary file. It is in the pub/nihongo directory in the anonymous FTP area of monu6.cc.monash.edu.au (130.194.32.106), along with the MOKE1.1 files. Look for "jdic10.zoo". I have included the ".doc" file in this posting. Jim Breen ===================================================================== J D I C Simple English Japanese Dictionary Display Version 1.0 ====================================================== Introduction ------------ This program provides a simple English/Japanese (kana & kanji) display of selected entries of a dictionary file. While it will work (more or less) with any text file containing a mix of Japanese and English words, it has been designed to operate on a dictionary in the "EDICT" format used by the MOKE (Mark's Own Kanji Editor) Japanese text editor. All the Japanese is in kana and kanji, so if you cannot read at least hiragana and katakana, give up now. I wrote this program to gain experience in handling and displaying the Japanese character set, and to exploit the dictionary that came with my copy of MOKE. I also wanted to brush up my C skills. I make no claims for it, but I am pleased how it turned out. The executable code and documentation is being released to the "public domain". All usage of this program is at the user's risk, and there is no warranty on its performance. I will consider releasing the source (if anyone is actually interested in it) at a later date. I welcome suggestions, comments and constructive criticism. Operation --------- JDIC must operate on a PC or AT with a graphics card. It has been written using Turbo C 2.0, and has been tested on VGA, CGA and HERC cards. Auto-detection is used to determine the type of graphics card. It must, in this release, operate from a directory containing the BGI files, the EDICT file, and the 16-bit JIS font files "k16jis1.fnt" and "k16jis2.fnt". These latter files are not included in this distribution. If you use MOKE, you have them already. If not you will need to track them down at one of the FTP sites. JDIC also needs an index file "EDICT.JDX". If is not present it will be created. JDIC saves the length of the EDICT file in EDICT.JDX, and if it detects that the size of EDICT has changed, it will insist on recreating the index file. Operation is very simple. At the "Enter Search String:" prompt, type a few letters from the *start* of the word(s) you are seeking. JDIC does not match on strings in the middle of words. The scan is case-insensitive. You will notice an "(A)" in the top lefthand corner of the screen. This is to indicate you are entering search strings in ascii (i.e. in English). If you press F3, you toggle between ascii, hiragana and katakana. (Why F3?, well that is the key that MOKE uses for this function.) To enter a search string in kana, type it in romaji and it will be converted to kana for the search. The romaji->kana translation is as similar to that used in MOKE as I can make it, i.e. for a small "tsu" you can type "shippai" or shit-pai, and for "n" you can type "n'" if necessary. Most of the time just typing ordinary Hepburn or kunrei romaji works. In fact, I have made the matching of kana strings insensitive to whether they are katakana or hiragana. The ONE difference between them is that typing a "-" in hiragana gets a "u", and in katakana gets a "-", just as in MOKE. The display is in "dictionary" order for the words matched, i.e. alphabetical for the ascii search, and EUC/kana for the kana search. Dictionary ---------- Clearly to be of any use, JDIC must have a reasonably good dictionary. Unfortunately there are no good machine readable dictionary files in the public domain yet. Included with this distribution is the tiny EDICT file from MOKE 1.1 (the shareware version). There is a bigger, but still rather limited EDICT supplied with MOKE 2.0 release, however Mark Edwards, the author, has not placed it in the public domain. I am compiling a supplement to MOKE (2.0)'s EDICT which will fill in the gaps, but unless you buy MOKE 2.0 (after all, it's only $US50) you will miss out on a lot. (If anyone feels like contributing to a public domain dictionary in EDICT format, I am willing to collate and distribute it. Just mail me the pieces.) The dictionary file must use the "EUC" coding for Japanese characters. MOKE's EDICT does this, so that was the coding I adopted in JDIC. Files using JIS codings can be converted to EUC using MOKE itself, or Ken Lunde's "JIS.C" program. The format each entry of EDICT is: Japanese [yomikata] /english1/english2/..../ If the word is in kana alone, the yomikata is omitted. Technical --------- JDIC holds the complete dictionary in RAM, along with the first 3490 bitmaps of the JIS character set and the index table. The index table contains an entry for each word in the dictionary, sorted in alpha /kana order. This enables a fast search to be done, and for the display to be in alphabetical order by keyword. Common words like: "of", "to", "the", etc. and grammatical terms like: "adj", "vi", "vt", etc. are not indexed. If a kanji is required that is not in the ~3000 most common ones, it is read from disk into a circular cache buffer. This happens rarely. JDIC can cope with dictionaries up to about 250 kbytes (MOKE's EDICT is about 60k). If I ever need to handle a bigger one I can easily leave the dictionary on disk. The parsing and sort to set up the index table would be slower, but the searching will still be quite fast. Next Version ------------ If I ever get to another version, I will look at the moke.rc file (if any) and/or at the same environment variables as MOKE to establish the path(s) to the ".bgi" and ".fnt" files. I hope that eventually I can compile a big enough dictionary that I am forced to scan it from disk. Acknowledgements ---------------- I wrote about two-thirds of this program. Great lumps of it were lifted with minor modifications from "KD" (Kanji Driver), which was written by Izumi Ohzawa at Berkeley, in particular the JIS handling module (kjis.c) which was a port of "jis.pas" by Seiichi Nomura and Seke Wei. Ken Lunde's "japan.inf" and his elegant "jis.c" explained the workings of EUC and old/new JIS codes. Mark Edward's MOKE remains the tour de force in this field, and an inspiration for us all. I regard JDIC as a humble and minor accessory to MOKE. (I use tables lifted from two of the ".hlp" files in MOKE to drive the romaji->kana code.) Jim Breen Department of Robotics & Digital Technology Monash University Melbourne, Australia (jwb@monu6.cc.monash.edu.au) April 1991 -- Jim Breen AARNet:jwb@monu6.cc.monash.edu.au Department of Robotics & Digital Technology. Monash University. PO Box 197 Caulfield East VIC 3145 Australia (ph) +61 3 573 2552 (fax) +61 3 573 2745 JIS:$B%8%`!!%V%j!<%s(J -- comp.archives file verification monu6.cc.monash.edu.au -rw-r--r-- 1 886 729 83701 Apr 26 10:02 /pub/Nihongo/jdic10.zoo found jdic ok monu6.cc.monash.edu.au:/pub/Nihongo/jdic10.zoo