clement@opus.cs.mcgill.ca (Clement Pellerin) (01/11/90)
We would buy a French digital dictionary if only we could find one. We need a list of words like /usr/dict/words together with a spelling program to access the French dictionary. This is necessary to process the accents properly. We are also interested in a complete dictionary with definitions similar to webster on the NeXT. If you heard of such a product please mail me the reference as I don't read this newsgroup. Clement Pellerin, McGill University, Montreal, Canada clement@opus.cs.mcgill.ca clement@musocs.bitnet ...uunet!musocs!opus!clement -- news <clement
gamin@ireq-robot.UUCP (Martin Boyer) (01/11/90)
clement@opus.UUCP (Clement Pellerin) writes: > >We would buy a French digital dictionary if only we could find one. >We need a list of words like /usr/dict/words together >with a spelling program to access the French dictionary. >This is necessary to process the accents properly. I have been looking for such a product for about two years and I haven't found anything directly usable on Unix; I looked at the dictionnary files from various (French) PC word processors and the format is *totally* obscure (I couldn't identify a single word in it). I will try to do the same with the dictionnary from the French version of Interleaf but I suspect that it isn't usable either. Possible solutions are: 1. Get in touch with the company that did it for Interleaf and ask them to package and sell the dictionnary by itself (a *lot* of people would *pay* for this). 2. Do it ourselves. Namely, set up a mailing list and ask contributors to send lists of (correct) words. It is simple to assemble, sift, and sort. Then, look at something like ispell (from GNU) and see what can be done to customize it. From what I know about Unix spell, this is no easy task as much of the English grammar is built in spell; French has probably more exceptions than English, making the algorithms intricate. 3. I heard that Sun Montreal is working on a French version of Unix, perhaps they'd care to help (m'entendez-vous?). My experience tells me that for anything this specific, nobody else is going to do it. Therefore, I am willing to start on solution 2 if there is enough interest. Start sending me mail at the address below if you are willing to contribute. DON'T send anything if you just want the results; I'll post it if something gets done (in a year or two...). Now, how about persuing the rest of this discussion in French? Or in another forum (mail), before it gets too specific. -- Martin Boyer ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp Varennes, QC, Canada J3X 1S1 +1 514 652-8136
rodgers@clausius.mmwb.ucsf.edu (R. P. C. Rodgers) (01/12/90)
gamin@ireq-robot.UUCP (Martin Boyer) writes: >clement@opus.UUCP (Clement Pellerin) writes: >> >>We would buy a French digital dictionary if only we could find one. >I have been looking for such a product for about two years and I haven't >found anything directly usable on Unix [deletions] >Possible solutions are: >2. Do it ourselves. Namely, set up a mailing list and ask contributors to >send lists of (correct) words. It is simple to assemble, sift, and sort. >My experience tells me that for anything this specific, nobody else is >going to do it. Therefore, I am willing to start on solution 2 if there is >enough interest. Start sending me mail at the address below if you are >willing to contribute. >-- >Martin Boyer ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU >Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp Great idea, but before plunging into this big project, I think you should lay out some ground rules. How are you going to handle accent marks, for example? Will you use the troff -ms (.AM) conventions, or what? Presumably you will want only stem words, and will recognize derivatives of these by the application of grammatical rules. Good Luck! R. P. C. Rodgers, M.D. (415)476-8910 (work) 664-0560 (home) UCSF Laurel Heights Campus UUCP: ...ucbvax.berkeley.edu!cca.ucsf.edu!rodgers 3333 California St., Suite 102 ARPA: rodgers@maxwell.mmwb.ucsf.edu San Francisco CA 94118 USA BITNET: rodgers@ucsfcca
lamy@cs.utoronto.ca (Jean-Francois Lamy) (01/12/90)
Spelling checking is more difficult in languages where number and gender agreement is an issue. A simple-minded approach like that of spell or ispell would give you an immense number of false errors. In the case of French you would at least need all conjugated variants of French verbs, and a way to deal with accents properly, and I claim that would still not be enough. I am aware of one effort in the early 80's to build a full morphological dictionary (i.e. one that has enough data to support conjugation -- If I remember well there are over 180 forms of verb conjugation in French -- forget about those 3 groups and a few irregular ones you learned about in High School :-), and lemmatisation (i.e. given the word "fusse" figure out that it is a form of of the verb "e^tre"). As far as I recall the project got mired in a feud about copyright/licensing issues, and never got distributed or commercialized. The reason I bring this up is that someone tried to do spelling verification using that data, and found out that it is a much harder problem than one might think it is. So let me go on record as extremely skeptical that anything useful would come out of a simple minded approach, and that what works for English (spell/ispell) will not carry over to other languages (like French) where word morphology is subject to weird and wonderful transmutations when changing gender, number or tense. Jean-Francois Lamy lamy@cs.utoronto.ca, uunet!cs.utoronto.ca!lamy Department of Computer Science, University of Toronto, Canada M5S 1A4
clement@opus.cs.mcgill.ca (Clement Pellerin) (01/12/90)
In article <90Jan11.141206est.2694@neat.cs.toronto.edu> J.-F. Lamy writes: > Spelling checking is more difficult in languages where number and gender > agreement is an issue. A simple-minded approach like that of spell or ispell > would give you an immense number of false errors. In the case of French you > would at least need all conjugated variants of French verbs, and a way to > deal with accents properly, and I claim that would still not be enough. > remember well there are over 180 forms of verb conjugation in French -- forget > about those 3 groups and a few irregular ones you learned about in High School There are complete books on conjugations, even us can't get it right that's why we need a spelling checker:-) > The reason I bring this up is that someone tried to do spelling verification > using that data, and found out that it is a much harder problem than one might > think it is. So let me go on record as extremely skeptical that anything > useful would come out of a simple minded approach, and that what works for > English (spell/ispell) will not carry over to other languages (like French) > where word morphology is subject to weird and wonderful transmutations when > changing gender, number or tense. Granted, spell works because of the simple rules of English. A good French spelling checker would have to do a great deal of syntactical analysis before even coming close to what spell can achieve. I was well aware of the difficulties, and that's the reason I am only asking for a simple minded solution. Doing it right is simply not possible. Nevertheless, I consider that simple minded help is better than nothing. I would settle for anything that would lookup every word in the dictionary to see if it is present or not. You seem to imply that even this does not work. Obviously, number, gender and tense will go unnoticed. It will at least catch spelling mistakes in the roots of the words. Can you expand on your fellow's experiments? I don't see how he would conclude that this simple minded tool is not worth it. Let me reinstate that we are also looking for a machine readable dictionary with definitions of the words. There is a problem of fast indexing but that should be easy to do. Webster on the NeXT does it very well indeed. -- news <clement
gamin@ireq-robot.UUCP (Martin Boyer) (01/12/90)
clement@opus.UUCP (Clement Pellerin): >We would buy a French digital dictionary if only we could find one. gamin@ireq-robot.UUCP (Martin Boyer): >[Let's get organized and see how we can hack a version of spell to handle >French] lamy@cs.utoronto.ca (Jean-Francois Lamy): >Spelling checking is more difficult in languages where number and gender >agreement is an issue. A simple-minded approach like that of spell or ispell >would give you an immense number of false errors. >[...] >So let me go on record as extremely skeptical that anything >useful would come out of a simple minded approach [...] clement@opus.cs.mcgill.ca (Clement Pellerin): >[...] >Nevertheless, I consider that simple minded help is better than nothing. I >would settle for anything that would lookup every word in the dictionary to >see if it is present or not. [...] I am an optimist by nature and sometimes by choice because it helps to get things done. I would go with Clement Pellerin and say that even an incomplete solution would help. I would be happy with a database of nouns and no verbs but the numerous variations of "avoir" (to have) and "^etre" (to be) because this is where we would get the most for our investment. Jean-Francois is probably right, however, in pointing out that a simple-minded approach will yield an immense number of false errors. It is quite possible that "filtering" a perfectly correct text through such a filter would produce the list of all the words, or variations of words, that our French speller doesn't know about. Is there a way to have a "loose" checker that would only flag words that are misspelled "for sure" and disregard those words that it doesn't know about. If, for instance, something like the soundex algorithm, which hashes words based on their pronunciation instead of their spelling, could recognize that a word is "close enough" to a dictionnary entry but not exactly the same, possibly because it is misspelled. Words that have no close match in the dictionnary are simply "unknown". Perhaps certain features of of French can be used (like the fact that you can't have four consonants in a row and three in only certain cases, or that certain sequences of letters are not part of any French word). I'd like to hear comments about the "brute force" approach; listing all non-trivial words in the dictionnary. How big would it be? Even if slow, is it practical? -- Martin Boyer ireq-robot!mboyer@Larry.McRCIM.McGILL.EDU Institut de recherche d'Hydro-Quebec mboyer@ireq-robot.uucp Varennes, QC, Canada J3X 1S1 +1 514 652-8136