rcpt@rw7.urc.tue.nl (Piet Tutelaers) (04/19/91)
I have spend some time to see if `ispell' can be enhanced with support
for more languages than only English (and american English). Ispell
as it is now available has several drawbacks for using it in a
multi-language environment:
* the reduction rules are based upon English and bad for Dutch,
French etc (see further on)
* the builtin `detex' (if the filename ends with .tex) does not
recognize LaTeX constructs like \begin{quote}
* There is no support for diacritical symbols like "i. The Dutch words:
ge\"{\i}nfecteerd
ge\"\i nfecteerd
ge"infecteerd (supported by the new dutch.sty)
should all be accepted as valid derivations of the verb
`infecteren'. Now it will be split up into `ge' and `nfecteerd'.
In the future (X11.5 ?) there will hopefully become better support for
European languages.
* Ispell can not skip over text. If your document contains more than
one language it makes no sense to check french words with an
english dictionary. It would be great if ispell offered a facility
to skip forward and backward in the text being checked. (Try to
check this text with ispell!)
A PD Dutch word list containing 150000 words (1.7 Mbytes) processed by
`buildhash' without word reduction gives a 3.5 Mbytes long dictionary,
reduced by `munchlist' it is still 2.6 Mbytes. Because this hash table
will be loaded in memory the program behaves like a snail.
To improve the situation for Dutch we need a reduced word list. The
best I can think of is to automatically reduce a word list. It should
be possible to write a program that reduces words like:
gespeeld jou'e gespeild played
speelt joue spielt plays
speelde jouait spielte playing
spelen jouer spielen play
jouerai
jouons
etc.
into their basic stems (ex. spelen, jouer, spielen, play) together
with a function `reduce' (that given the word `gespeeld' responds with
`spelen') and a function `expand' (given the word `spelen' returns the
words `gespeeld', `speelt', `speelde' etc.). Does anybody have
pointers to an existed method or literature on this subject? Perhaps a
nice problem for a class on software engeneering?
--Piet
internet: rcpt@urc.tue.nl | Piet Tutelaers Room RC 1.90
bitnet: rcpt@heithe5.BITNET | Eindhoven University of Technology
phone: +31 (0)40 474541 | P.O. Box 513, 5600 MB Eindhoven, NL