[comp.archives] [sci.lang.japan] PC Japanese/English Dictionary Display Program

jwb@monu6.cc.monash.edu.au (Jim Breen) (05/10/91)

Archive-name: text/japanese/jdic/1991-04-26
Archive: monu6.cc.monash.edu.au:/pub/Nihongo/jdic10.zoo [130.194.32.106]
Original-posting-by: jwb@monu6.cc.monash.edu.au (Jim Breen)
Original-subject: PC Japanese/English Dictionary Display Program
Reposted-by: emv@msen.com (Edward Vielmetti, MSEN)

If anyone is interested, I have written a PC Program capable of
fast ascii/kana searches and displays of a dictionary file. It is
oriented around using the "EDICT" dictionary in MOKE, but would
probably work for any line-oriented dictionary file.

It is in the pub/nihongo directory in the anonymous FTP area of
monu6.cc.monash.edu.au (130.194.32.106), along with the MOKE1.1
files. Look for "jdic10.zoo".


I have included the ".doc" file in this posting.

Jim Breen

=====================================================================

J D I C

Simple English Japanese Dictionary Display Version 1.0
======================================================

Introduction
------------

This program provides a simple English/Japanese (kana & kanji)
display of selected entries of a dictionary file. While it will work
(more or less) with any text file containing a mix of Japanese and
English words, it has been designed to operate on a dictionary in
the "EDICT" format used by the MOKE (Mark's Own Kanji Editor) Japanese 
text editor. All the Japanese is in kana and kanji, so if you cannot read 
at least hiragana and katakana, give up now.

I wrote this program to gain experience in handling and displaying the
Japanese character set, and to exploit the dictionary that came with
my copy of MOKE. I also wanted to brush up my C skills. I make no claims 
for it, but I am pleased how it turned out. The executable code and
documentation is being released to the "public domain". All usage of this
program is at the user's risk, and there is no warranty on its performance.
I will consider releasing the source (if anyone is actually interested in 
it) at a later date.

I welcome suggestions, comments and constructive criticism.

Operation
---------

JDIC must operate on a PC or AT with a graphics card. It has been written 
using Turbo C 2.0, and has been tested on VGA, CGA and HERC cards. 
Auto-detection is used to determine the type of graphics card. It must, 
in this release, operate from a directory containing the BGI files, 
the EDICT file, and the 16-bit JIS font files "k16jis1.fnt" and
"k16jis2.fnt". These latter files are not included in this distribution.
If you use MOKE, you have them already. If not you will need to track them
down at one of the FTP sites.

JDIC also needs an index file "EDICT.JDX". If is not present it will be
created. JDIC saves the length of the EDICT file in EDICT.JDX, and if it 
detects that the size of EDICT has changed, it will insist on recreating
the index file.

Operation is very simple. At the "Enter Search String:" prompt, type
a few letters from the *start* of the word(s) you are seeking. JDIC does 
not match on strings in the middle of words. The scan is case-insensitive.

You will notice an "(A)" in the top lefthand corner of the screen. This is
to indicate you are entering search strings in ascii (i.e. in English).
If you press F3, you toggle between ascii, hiragana and katakana. (Why F3?,
well that is the key that MOKE uses for this function.) To enter a search 
string in kana, type it in romaji and it will be converted to kana for
the search. The romaji->kana translation is as similar to that used in
MOKE as I can make it, i.e. for a small "tsu" you can type "shippai" or
shit-pai, and for "n" you can type "n'" if necessary. Most of the time just
typing ordinary Hepburn or kunrei romaji works.

In fact, I have made the matching of kana strings insensitive to whether they
are katakana or hiragana. The ONE difference between them is that typing a 
"-" in hiragana gets a "u", and in katakana gets a "-", just as in MOKE.

The display is in "dictionary" order for the words matched, i.e. alphabetical
for the ascii search, and EUC/kana for the kana search.

Dictionary
----------

Clearly to be of any use, JDIC must have a reasonably good dictionary.
Unfortunately there are no good machine readable dictionary files in
the public domain yet. Included with this distribution is the tiny
EDICT file from MOKE 1.1 (the shareware version). There is a bigger,
but still rather limited EDICT supplied with MOKE 2.0 release, however
Mark Edwards, the author, has not placed it in the public domain. I am
compiling a supplement to MOKE (2.0)'s EDICT which will fill in the gaps, 
but unless you buy MOKE 2.0 (after all, it's only $US50) you will miss
out on a lot. (If anyone feels like contributing to a public domain 
dictionary in EDICT format, I am willing to collate and distribute it. 
Just mail me the pieces.)

The dictionary file must use the "EUC" coding for Japanese characters. 
MOKE's EDICT does this, so that was the coding I adopted in JDIC. Files 
using JIS codings can be converted to EUC using MOKE itself, or Ken 
Lunde's "JIS.C" program.

The format each entry of EDICT is:

Japanese [yomikata] /english1/english2/..../

If the word is in kana alone, the yomikata is omitted.

Technical
---------

JDIC holds the complete dictionary in RAM, along with the first 3490 
bitmaps of the JIS character set and the index table. The index table
contains an entry for each word in the dictionary, sorted in alpha /kana
order. This enables a fast search to be done, and for the display to be 
in alphabetical order by keyword. Common words like: "of", "to", "the", etc.
and grammatical terms like: "adj", "vi", "vt", etc. are not indexed. 

If a kanji is required that is not in  the ~3000 most common ones, it is 
read from disk into a circular cache buffer. This happens rarely.

JDIC can cope with dictionaries up to about 250 kbytes (MOKE's EDICT is
about 60k). If I ever need to handle a bigger one I can easily leave the
dictionary on disk. The parsing and sort to set up the index table would 
be slower, but the searching will still be quite fast.

Next Version
------------

If I ever get to another version, I will look at the moke.rc file (if any) 
and/or at the same environment variables as MOKE to establish the path(s) 
to the ".bgi" and ".fnt" files. I hope that eventually I can compile a
big enough dictionary that I am forced to scan it from disk.

Acknowledgements
----------------

I wrote about two-thirds of this program. Great lumps of it were lifted
with minor modifications from "KD" (Kanji Driver), which was written by 
Izumi Ohzawa at Berkeley, in particular the JIS handling module (kjis.c) 
which was a port of "jis.pas" by Seiichi Nomura and Seke Wei.

Ken Lunde's "japan.inf" and his elegant "jis.c" explained the workings
of EUC and old/new JIS codes.

Mark Edward's MOKE remains the tour de force in this field, and an inspiration
for us all. I regard JDIC as a humble and minor accessory to MOKE. (I use
tables lifted from two of the ".hlp" files in MOKE to drive the 
romaji->kana code.)

Jim Breen
Department of Robotics & Digital Technology
Monash University
Melbourne, Australia
(jwb@monu6.cc.monash.edu.au)

April 1991
-- 
Jim Breen                                   AARNet:jwb@monu6.cc.monash.edu.au  
Department of Robotics & Digital Technology. 
Monash University. PO Box 197 Caulfield East VIC 3145 Australia
(ph) +61 3 573 2552 (fax) +61 3 573 2745           JIS:$B%8%`!!%V%j!<%s(J

-- comp.archives file verification
monu6.cc.monash.edu.au
-rw-r--r--  1 886      729         83701 Apr 26 10:02 /pub/Nihongo/jdic10.zoo
found jdic ok
monu6.cc.monash.edu.au:/pub/Nihongo/jdic10.zoo