[comp.text.tex] Improving ispell for different languages

rcpt@rw7.urc.tue.nl (Piet Tutelaers) (04/19/91)

I have spend some time to see if `ispell' can be enhanced with support
for more languages than only English (and american English).  Ispell
as it is now available has several drawbacks for using it in a
multi-language environment:
  * the reduction rules are based upon English and bad for Dutch,
    French etc (see further on)
  * the builtin `detex' (if the filename ends with .tex) does not
    recognize LaTeX constructs like \begin{quote}
  * There is no support for diacritical symbols like "i. The Dutch words:
	ge\"{\i}nfecteerd
	ge\"\i nfecteerd
	ge"infecteerd			(supported by the new dutch.sty)
    should all be accepted as valid derivations of the verb
    `infecteren'. Now it will be split up into `ge' and `nfecteerd'. 
    In the future (X11.5 ?) there will hopefully become better support for
    European languages.
  * Ispell can not skip over text. If your document contains more than
    one language it makes no sense to check french words with an
    english dictionary. It would be great if ispell offered a facility
    to skip forward and backward in the text being checked. (Try to
    check this text with ispell!)
  
A PD Dutch word list containing 150000 words (1.7 Mbytes) processed by
`buildhash' without word reduction gives a 3.5 Mbytes long dictionary,
reduced by `munchlist' it is still 2.6 Mbytes. Because this hash table
will be loaded in memory the program behaves like a snail.

To improve the situation for Dutch we need a reduced word list. The
best I can think of is to automatically reduce a word list. It should
be possible to write a program that reduces words like:
	gespeeld	jou'e		gespeild	played
	speelt		joue		spielt		plays
	speelde		jouait		spielte		playing
	spelen		jouer		spielen		play
			jouerai
			jouons
			etc.

into their basic stems (ex. spelen, jouer, spielen, play) together
with a function `reduce' (that given the word `gespeeld' responds with
`spelen') and a function `expand' (given the word `spelen' returns the
words `gespeeld', `speelt', `speelde' etc.). Does anybody have
pointers to an existed method or literature on this subject? Perhaps a
nice problem for a class on software engeneering?

--Piet

internet: rcpt@urc.tue.nl       | Piet Tutelaers        Room  RC 1.90
bitnet:   rcpt@heithe5.BITNET   | Eindhoven University of  Technology
phone:    +31 (0)40 474541      | P.O. Box 513, 5600 MB Eindhoven, NL