[comp.text.desktop] Hy-phen-a-tion dic-tion-ary

chuq@plaid.UUCP (06/08/87)

Date: Fri, 5 Jun 87 19:12:35 PDT
From: dick@ccb.ucsf.edu (Dick Karpinski)

Yep, storing the dictionary is the heart of the problem.  There is a
guy at Stanford who did his thesis on the subject.  I forget his name
but his thesis is published, so people at Stanford should be able to
locate it.  I'd wager that there is available software to demonstrate
his thesis.  He found some pretty neat ways to compress the dictionary.

Dick Karpinski

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun

Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)

From: rokicki@rocky.stanford.edu (Tomas Rokicki)
Date: 8 Jun 87 22:52:39 GMT
Organization: Stanford University Computer Science Department

In article <20583@sun.uucp>, chuq%plaid@Sun.COM (Chuq Von Rospach) writes:
> Yep, storing the dictionary is the heart of the problem.  There is a
> guy at Stanford who did his thesis on the subject.  I forget his name
> but his thesis is published, so people at Stanford should be able to
> locate it.  I'd wager that there is available software to demonstrate
> his thesis.  He found some pretty neat ways to compress the dictionary.

Please pick up a copy of Computers & Typesetting, Volume B, entitled
TeX:  The Program, by Don Knuth, published by Addison Wesley Publishing
Company.  I paid $34.95 for my copy.  You should start at section 919,
page 386, and read.

It describes an implementation of Frank M. Liang's hyphenation algorithm,
which is described more fully in Liang's PhD thesis from Stanford
University.  (If anyone wants a copy of this, mail me and I'll give you
details.)

The dictionary is not compressed; rather, patterns are found and used.
These patterns are amazingly regular and dependable.  An exception
dictionary is used for those few exceptions (I believe a few dozen have
been found.)

Please, anyone writing or considering writing a typesetting program,
consider using these algorithms.  They are fast, small, use little data
space, and *work*.  There is no excuse for the poor hyphenation so many
systems give you, and also no reason to have to put in hyphens by hand.
If you desire, you can rip the code right out of TeX, since the source
code is so available and readable; Don Knuth specifically allows this.

Also, when using systems with automatic hyphenation, please look over
the hyphens you are given and insure they are reasonable.  There are
many cases where a hyphen might be okay, except in that particular
case.  For instance,

	Automatic hyphenation systems used in auto-

Here the context and hyphenation leads the reader to expect the word
to be `automatic', but the word might be `automobiles', jarring the
reader ever so slightly.  There are much better examples, but none
spring to mind at the moment.

Liang's algorithm can be adapted to foreign languages fairly easily.
Some languages, such as German, which change the spelling of words
when hyphenated, can cause some difficulty.

So, here's a question for y'all.  How should eighteen be hyphenated?
Dictionaries disagree; I want explanations of your choice.  Neither
eight-een nor eigh-teen look quite right.  I vote for eight-teen, but
this violates accepted English hyphenation rules; this is one of
those words you should avoid hyphenating.

Bad hyphenation exists all over, and can be quite comical.  The text
`Introducing Artificial Intelligence' by G. L. Simons hyphenates vie-
wed.  That word, again, was viewed.  Took you a second shot to read it,
eh?  ``Logic Design Principles'' by Edward J. McCluskey abounds with
bad hyphenations; pick up a copy and look at a random page.  For
instance, `wav-eform.'

And, lastly, for the sake of Pete, do not break paragraphs into lines
a line at a time!

Sorry for the length; I get carried away.

							-tom

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)

Date: Mon, 8 Jun 87 21:25:05 PDT
From: alfke@csvax.caltech.edu (J. Peter Alfke)
Organization: California Institute of Technology

dick@ccb.ucsf.edu (Dick Karpinski) writes:
>Yep, storing the dictionary is the heart of the problem.  There is a
>guy at Stanford who did his thesis on the subject.  I forget his name
>but his thesis is published, so people at Stanford should be able to
>locate it.  I'd wager that there is available software to demonstrate
>his thesis.  He found some pretty neat ways to compress the dictionary.

This method is used in TeX (not surprising, considering that Knuth's at
Stanford -- the guy was probably one of his grad students), and is described
pretty fully in one of the appendices to The TeXbook.
Basically, there are a bunch of hyphenation templates, encoded in a funny way,
which the word is matched against and the hyphens pop out like magic.  This
works properly on all but a very few words, which can be caught in a small
explicit hyphenation dictionary.  Both the relevant files (templates and excep-
tions) come with the standard TeX distribution, so all you need to do is
implement the algorithm.
Have fun!
-- 
							pEtEr AlfkE
I'm going to have to torture you now,				  @
but I want you to know						  cSVAx
it isn't personal.						.cAlTEch.EDU

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?

chuq%plaid@Sun.COM (Chuq Von Rospach) (06/11/87)

Date:  9 Jun 1987 10:31-EDT 
From: Tom.Lane@zog.cs.cmu.edu

>Yep, storing the dictionary is the heart of the problem.  There is a
>guy at Stanford who did his thesis on the subject. [...]
>I'd wager that there is available software to demonstrate
>his thesis.

If I'm not mistaken, the guy was one of Knuth's students, and his
results are implemented in TeX.  If you don't want to use TeX, you
could still borrow its hyphenation algorithms and dictionary; they're
public domain as I understand it.

				tom lane
-----
ARPA: lane@ZOG.CS.CMU.EDU
UUCP: ...!seismo!zog.cs.cmu.edu!lane
BITNET: lane%zog.cs.cmu.edu@cmuccvma

----------------------------------------
Submissions to:   desktop%plaid@sun.com -OR- sun!plaid!desktop
Administrivia to: desktop-request%plaid@sun.com -OR- sun!plaid!desktop-request
Paths:  {ihnp4,decwrl,hplabs,seismo,ucbvax}!sun
Chuq Von Rospach	chuq@sun.COM		Delphi: CHUQ

Now, where did my ex-wife put my Fairy Dust?