chuq@plaid.UUCP (06/08/87)
Date: Fri, 5 Jun 87 19:12:35 PDT From: dick@ccb.ucsf.edu (Dick Karpinski) Yep, storing the dictionary is the heart of the problem. There is a guy at Stanford who did his thesis on the subject. I forget his name but his thesis is published, so people at Stanford should be able to locate it. I'd wager that there is available software to demonstrate his thesis. He found some pretty neat ways to compress the dictionary. Dick Karpinski ---------------------------------------- Submissions to: desktop%plaid@sun.com -OR- sun!plaid!desktop Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request Paths: {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Now, where did my ex-wife put my Fairy Dust?
chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)
From: rokicki@rocky.stanford.edu (Tomas Rokicki) Date: 8 Jun 87 22:52:39 GMT Organization: Stanford University Computer Science Department In article <20583@sun.uucp>, chuq%plaid@Sun.COM (Chuq Von Rospach) writes: > Yep, storing the dictionary is the heart of the problem. There is a > guy at Stanford who did his thesis on the subject. I forget his name > but his thesis is published, so people at Stanford should be able to > locate it. I'd wager that there is available software to demonstrate > his thesis. He found some pretty neat ways to compress the dictionary. Please pick up a copy of Computers & Typesetting, Volume B, entitled TeX: The Program, by Don Knuth, published by Addison Wesley Publishing Company. I paid $34.95 for my copy. You should start at section 919, page 386, and read. It describes an implementation of Frank M. Liang's hyphenation algorithm, which is described more fully in Liang's PhD thesis from Stanford University. (If anyone wants a copy of this, mail me and I'll give you details.) The dictionary is not compressed; rather, patterns are found and used. These patterns are amazingly regular and dependable. An exception dictionary is used for those few exceptions (I believe a few dozen have been found.) Please, anyone writing or considering writing a typesetting program, consider using these algorithms. They are fast, small, use little data space, and *work*. There is no excuse for the poor hyphenation so many systems give you, and also no reason to have to put in hyphens by hand. If you desire, you can rip the code right out of TeX, since the source code is so available and readable; Don Knuth specifically allows this. Also, when using systems with automatic hyphenation, please look over the hyphens you are given and insure they are reasonable. There are many cases where a hyphen might be okay, except in that particular case. For instance, Automatic hyphenation systems used in auto- Here the context and hyphenation leads the reader to expect the word to be `automatic', but the word might be `automobiles', jarring the reader ever so slightly. There are much better examples, but none spring to mind at the moment. Liang's algorithm can be adapted to foreign languages fairly easily. Some languages, such as German, which change the spelling of words when hyphenated, can cause some difficulty. So, here's a question for y'all. How should eighteen be hyphenated? Dictionaries disagree; I want explanations of your choice. Neither eight-een nor eigh-teen look quite right. I vote for eight-teen, but this violates accepted English hyphenation rules; this is one of those words you should avoid hyphenating. Bad hyphenation exists all over, and can be quite comical. The text `Introducing Artificial Intelligence' by G. L. Simons hyphenates vie- wed. That word, again, was viewed. Took you a second shot to read it, eh? ``Logic Design Principles'' by Edward J. McCluskey abounds with bad hyphenations; pick up a copy and look at a random page. For instance, `wav-eform.' And, lastly, for the sake of Pete, do not break paragraphs into lines a line at a time! Sorry for the length; I get carried away. -tom ---------------------------------------- Submissions to: desktop%plaid@sun.com -OR- sun!plaid!desktop Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request Paths: {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Now, where did my ex-wife put my Fairy Dust?
chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)
Date: Mon, 8 Jun 87 21:25:05 PDT From: alfke@csvax.caltech.edu (J. Peter Alfke) Organization: California Institute of Technology dick@ccb.ucsf.edu (Dick Karpinski) writes: >Yep, storing the dictionary is the heart of the problem. There is a >guy at Stanford who did his thesis on the subject. I forget his name >but his thesis is published, so people at Stanford should be able to >locate it. I'd wager that there is available software to demonstrate >his thesis. He found some pretty neat ways to compress the dictionary. This method is used in TeX (not surprising, considering that Knuth's at Stanford -- the guy was probably one of his grad students), and is described pretty fully in one of the appendices to The TeXbook. Basically, there are a bunch of hyphenation templates, encoded in a funny way, which the word is matched against and the hyphens pop out like magic. This works properly on all but a very few words, which can be caught in a small explicit hyphenation dictionary. Both the relevant files (templates and excep- tions) come with the standard TeX distribution, so all you need to do is implement the algorithm. Have fun! -- pEtEr AlfkE I'm going to have to torture you now, @ but I want you to know cSVAx it isn't personal. .cAlTEch.EDU ---------------------------------------- Submissions to: desktop%plaid@sun.com -OR- sun!plaid!desktop Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request Paths: {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Now, where did my ex-wife put my Fairy Dust?
chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)
Date: 9 Jun 1987 10:31-EDT From: Tom.Lane@zog.cs.cmu.edu >Yep, storing the dictionary is the heart of the problem. There is a >guy at Stanford who did his thesis on the subject. [...] >I'd wager that there is available software to demonstrate >his thesis. If I'm not mistaken, the guy was one of Knuth's students, and his results are implemented in TeX. If you don't want to use TeX, you could still borrow its hyphenation algorithms and dictionary; they're public domain as I understand it. tom lane ----- ARPA: lane@ZOG.CS.CMU.EDU UUCP: ...!seismo!zog.cs.cmu.edu!lane BITNET: lane%zog.cs.cmu.edu@cmuccvma ---------------------------------------- Submissions to: desktop%plaid@sun.com -OR- sun!plaid!desktop Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request Paths: {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Now, where did my ex-wife put my Fairy Dust?