[comp.text] wanted "French digital dictionary"

clement@opus.cs.mcgill.ca (Clement Pellerin) (01/11/90)

We would buy a French digital dictionary if only we could find one.
We need a list of words like /usr/dict/words together
with a spelling program to access the French dictionary.
This is necessary to process the accents properly.
We are also interested in a complete dictionary with definitions
similar to webster on the NeXT.

If you heard of such a product please mail me the reference as I don't
read this newsgroup.

Clement Pellerin, McGill University, Montreal, Canada
clement@opus.cs.mcgill.ca
clement@musocs.bitnet
...uunet!musocs!opus!clement

-- 
news <clement

gamin@ireq-robot.UUCP (Martin Boyer) (01/11/90)

clement@opus.UUCP (Clement Pellerin) writes:
>
>We would buy a French digital dictionary if only we could find one.
>We need a list of words like /usr/dict/words together
>with a spelling program to access the French dictionary.
>This is necessary to process the accents properly.

I have been looking for such a product for about two years and I haven't
found anything directly usable on Unix; I looked at the dictionnary files
from various (French) PC word processors and the format is *totally*
obscure (I couldn't identify a single word in it).  I will try to do the
same with the dictionnary from the French version of Interleaf but I
suspect that it isn't usable either.

Possible solutions are:

1. Get in touch with the company that did it for Interleaf and ask them to
package and sell the dictionnary by itself (a *lot* of people would *pay*
for this).

2. Do it ourselves.  Namely, set up a mailing list and ask contributors to
send lists of (correct) words.  It is simple to assemble, sift, and sort.
Then, look at something like ispell (from GNU) and see what can be done to
customize it.  From what I know about Unix spell, this is no easy task as
much of the English grammar is built in spell;  French has probably more
exceptions than English, making the algorithms intricate.

3. I heard that Sun Montreal is working on a French version of Unix,
perhaps they'd care to help (m'entendez-vous?).


My experience tells me that for anything this specific, nobody else is
going to do it.  Therefore, I am willing to start on solution 2 if there is
enough interest.  Start sending me mail at the address below if you are
willing to contribute.  DON'T send anything if you just want the results;
I'll post it if something gets done (in a year or two...).

Now, how about persuing the rest of this discussion in French?
Or in another forum (mail), before it gets too specific.
-- 
Martin Boyer                         ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU
Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp
Varennes, QC, Canada   J3X 1S1
+1 514 652-8136

rodgers@clausius.mmwb.ucsf.edu (R. P. C. Rodgers) (01/12/90)

gamin@ireq-robot.UUCP (Martin Boyer) writes:

>clement@opus.UUCP (Clement Pellerin) writes:
>>
>>We would buy a French digital dictionary if only we could find one.

>I have been looking for such a product for about two years and I haven't
>found anything directly usable on Unix [deletions]

>Possible solutions are:

>2. Do it ourselves.  Namely, set up a mailing list and ask contributors to
>send lists of (correct) words.  It is simple to assemble, sift, and sort.

>My experience tells me that for anything this specific, nobody else is
>going to do it.  Therefore, I am willing to start on solution 2 if there is
>enough interest.  Start sending me mail at the address below if you are
>willing to contribute.

>-- 
>Martin Boyer                         ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU
>Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp

Great idea, but before plunging into this big project, I think you should
lay out some ground rules.  How are you going to handle accent marks, for
example?  Will you use the troff -ms (.AM) conventions, or what?  Presumably
you will want only stem words, and will recognize derivatives of these by
the application of grammatical rules.

Good Luck!
R. P. C. Rodgers, M.D.         (415)476-8910 (work) 664-0560 (home)
UCSF Laurel Heights Campus     UUCP: ...ucbvax.berkeley.edu!cca.ucsf.edu!rodgers
3333 California St., Suite 102 ARPA: rodgers@maxwell.mmwb.ucsf.edu
San Francisco CA 94118 USA     BITNET: rodgers@ucsfcca

lamy@cs.utoronto.ca (Jean-Francois Lamy) (01/12/90)

Spelling checking is more difficult in languages where number and gender
agreement is an issue.  A simple-minded approach like that of spell or ispell
would give you an immense number of false errors.  In the case of French you
would at least need all conjugated variants of French verbs, and a way to
deal with accents properly, and I claim that would still not be enough.

I am aware of one effort in the early 80's to build a full morphological
dictionary (i.e. one that has enough data to support conjugation -- If I
remember well there are over 180 forms of verb conjugation in French -- forget
about those 3 groups and a few irregular ones you learned about in High School
:-), and lemmatisation (i.e. given the word "fusse" figure out that it is a
form of of the verb "e^tre").  As far as I recall the project got mired in a
feud about copyright/licensing issues, and never got distributed or
commercialized.

The reason I bring this up is that someone tried to do spelling verification
using that data, and found out that it is a much harder problem than one might
think it is.  So let me go on record as extremely skeptical that anything
useful would come out of a simple minded approach, and that what works for
English (spell/ispell) will not carry over to other languages (like French)
where word morphology is subject to weird and wonderful transmutations when
changing gender, number or tense.

Jean-Francois Lamy               lamy@cs.utoronto.ca, uunet!cs.utoronto.ca!lamy
Department of Computer Science, University of Toronto, Canada M5S 1A4

clement@opus.cs.mcgill.ca (Clement Pellerin) (01/12/90)

In article <90Jan11.141206est.2694@neat.cs.toronto.edu>  J.-F. Lamy writes:
> Spelling checking is more difficult in languages where number and gender
> agreement is an issue.  A simple-minded approach like that of spell or ispell
> would give you an immense number of false errors.  In the case of French you
> would at least need all conjugated variants of French verbs, and a way to
> deal with accents properly, and I claim that would still not be enough.

> remember well there are over 180 forms of verb conjugation in French -- forget
> about those 3 groups and a few irregular ones you learned about in High School

There are complete books on conjugations, even us can't get it right
that's why we need a spelling checker:-)

> The reason I bring this up is that someone tried to do spelling verification
> using that data, and found out that it is a much harder problem than one might
> think it is.  So let me go on record as extremely skeptical that anything
> useful would come out of a simple minded approach, and that what works for
> English (spell/ispell) will not carry over to other languages (like French)
> where word morphology is subject to weird and wonderful transmutations when
> changing gender, number or tense.

Granted, spell works because of the simple rules of English.  A good
French spelling checker would have to do a great deal of syntactical
analysis before even coming close to what spell can achieve.  I was
well aware of the difficulties, and that's the reason I am only asking
for a simple minded solution.  Doing it right is simply not possible.

Nevertheless, I consider that simple minded help is better than nothing.
I would settle for anything that would lookup every word in the dictionary
to see if it is present or not.  You seem to imply that even this does not
work.  Obviously, number, gender and tense will go unnoticed.  It will at least
catch spelling mistakes in the roots of the words.  Can you expand
on your fellow's experiments?  I don't see how he would conclude
that this simple minded tool is not worth it.

Let me reinstate that we are also looking for a machine readable
dictionary with definitions of the words. There is a problem of fast
indexing but that should be easy to do.  Webster on the NeXT does it
very well indeed.
-- 
news <clement

gamin@ireq-robot.UUCP (Martin Boyer) (01/12/90)

clement@opus.UUCP (Clement Pellerin):

>We would buy a French digital dictionary if only we could find one.

gamin@ireq-robot.UUCP (Martin Boyer):

>[Let's get organized and see how we can hack a version of spell to handle
>French]

lamy@cs.utoronto.ca (Jean-Francois Lamy):

>Spelling checking is more difficult in languages where number and gender
>agreement is an issue.  A simple-minded approach like that of spell or ispell
>would give you an immense number of false errors.
>[...]
>So let me go on record as extremely skeptical that anything
>useful would come out of a simple minded approach [...]

clement@opus.cs.mcgill.ca (Clement Pellerin):

>[...]
>Nevertheless, I consider that simple minded help is better than nothing.  I
>would settle for anything that would lookup every word in the dictionary to
>see if it is present or not.  [...]

I am an optimist by nature and sometimes by choice because it helps to get
things done.  I would go with Clement Pellerin and say that even an
incomplete solution would help.  I would be happy with a database of nouns
and no verbs but the numerous variations of "avoir" (to have) and "^etre"
(to be) because this is where we would get the most for our investment.

Jean-Francois is probably right, however, in pointing out that a
simple-minded approach will yield an immense number of false errors.  It is
quite possible that "filtering" a perfectly correct text through such a
filter would produce the list of all the words, or variations of words,
that our French speller doesn't know about.

Is there a way to have a "loose" checker that would only flag words that
are misspelled "for sure" and disregard those words that it doesn't know
about.  If, for instance, something like the soundex algorithm, which
hashes words based on their pronunciation instead of their spelling, could
recognize that a word is "close enough" to a dictionnary entry but not
exactly the same, possibly because it is misspelled.  Words that have no
close match in the dictionnary are simply "unknown".  Perhaps certain
features of of French can be used (like the fact that you can't have four
consonants in a row and three in only certain cases, or that certain
sequences of letters are not part of any French word).

I'd like to hear comments about the "brute force" approach; listing all
non-trivial words in the dictionnary.  How big would it be?  Even if slow,
is it practical?
-- 
Martin Boyer                         ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU
Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp
Varennes, QC, Canada   J3X 1S1
+1 514 652-8136