[net.internat] Hyphenation

franka@mmintl.UUCP (Frank Adams) (11/05/85)

[Not food]

In article <471@harvard.ARPA> kosower@harvard.ARPA (David A. Kosower) writes:
>There is, for 
>example, a large body of knowledge that has been built up over the years
>on the proper and elegant way to handle hyphenation automatically
>in English.  There are a variety of algorithms and methods that
>text formatters use.

Yes, and none of them are any good.  Have you seen the things those
algorithms do?  The only successful hyphenation algorithm is to look
the word up in a dictionary.

There are probably more and better on-line dictionaries available for
English than for any other language.  This is an issue that must be
addressed.

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

sommar@enea.UUCP (Erland Sommarskog) (11/09/85)

In article <773@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>algorithms do?  The only successful hyphenation algorithm is to look
>the word up in a dictionary.
>
And sometimes not even that is foolproof. I can't speak for English,
but I could give you examples from Swedish on words whose correct hyphenation
is different depending on their meaning. I would say the only sure
way to get a correct hyphenation is to have it interactive. Such things
are above the competence of a computer. 

haapanen@watdcsu.UUCP (Tom Haapanen [DCS]) (11/10/85)

In article <773@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:

>Yes, and none of them are any good.  Have you seen the things those
>algorithms do?  The only successful hyphenation algorithm is to look
>the word up in a dictionary.

>There are probably more and better on-line dictionaries available for
>English than for any other language.  This is an issue that must be
>addressed.

Yes, some languages are quite easy to hyphenate.  FinnAPL idiom
handbook includes a one-liner in APL that correctly hyphenates about
90% of Finnish words; a friend added a few more lines and got it up
to, what I estimate to be between 99.5 and 99.9%.  It still misses on
	  __ _
	"haayoaie"	(no umlauts on vt240's)
which translates to "wedding night intention".  Oh well, that is
apparently the most consecutive vowels of any Finnish word.  The
correct hyphenation is haa-yo-ai-e, incidentally.


				   \tom haapanen
				   watmath!watdcsu!haapanen
Im all lost in the Supermarket
I can no longer shop happily
I came in here for that special offer
Guaranteed personality				 (c) The Clash, 1979

trickey@alice.UucP (Howard Trickey) (11/10/85)

> Yes, and none of them are any good.

The hyphenation algorithm invented by Frank Liang, and incorporated in TeX
is good.  It is essentially a way of converting a hyphenated wordlist
(from a dictionary, but with all forms of all words) and creating a
list of "patterns".  You can set parameters to trade off table size
vs. percentage of hyphens that it will find vs. error rate.
The standard TeX table takes about 20kbyte, finds 86.7% of the hyphens
in an inflected Webster's Pocket Dictionary (and all of the hyphens in
the 676 most common words), and no wrong hyphens.  With about 2kbyte
you could find 35.2% of the hyphens and no errors.
(Please note that this algorithm is in TeX82, not the original TeX.)
See "Word Hy-phen-a-tion by Com-put-er" by Frank Liang (Phd thesis)
Stanford CS Dept. tech report STAN-CS-83-977 for details.

Several groups have done French hyphenation tables using this algorithm,
and found that they are typically much smaller than English ones.

pdg@ihdev.UUCP (P. D. Guthrie) (11/11/85)

>Yes, some languages are quite easy to hyphenate.  FinnAPL idiom
>handbook includes a one-liner in APL that correctly hyphenates about
>90% of Finnish words; a friend added a few more lines and got it up
>to, what I estimate to be between 99.5 and 99.9%.  
>	  __ _
I once saw a complete text editor that was a one-liner in APL :-)
Eight hundred or so symbols though.

kosower@harvard.ARPA (David A. Kosower) (11/14/85)

   In article <773@mmintl.UUCP>, Frank Adams claims that none of the
existing hyphenation algorithms are any good.  I believe this is
false, and will attempt to demonstrate this below.  But first,
a few comments on article <968@enea.UUCP>, by Erland Sommarskog.
He notes that there are words in Swedish whose hyphenation depends
on context.  This is a good point; in fact, such examples exist
in English, too: consider the word "record".  The verb is hyphenated
"re-cord", the noun "rec-ord".  This point is interesting not
because it indicates any fundamental problem preventing the
construction of an algorithmic hyphenator (it is in fact rather
irrelevant, as we shall see), but rather because it is one of
several reasons that dictionary lookup isn't a perfect solution
to the hyphenation, either.

   Before proceeding, we must quickly dispose of the conclusion
Sommarskog draws from this observation, namely that "the only sure
way to get a correct hyphenation is to have it interactive" [sic].
Nonsense!!  Note, for a trivial example, that on computers, the
identity (a*2)/2 = a is not valid!  Does this mean that integer
arithmetic must be done interactively??  Of course not.  It merely
means the programmer must have some way of knowing when the difference
between genuine integer arithmetic and integer arithmetic modulo
the word-size of the computer is important (easy), and that the
programmer should be able to get around this problem if he so
desires (also easy).  The parallel ingredients in our case are
the ability to recognize that a word was hyphenated incorrectly
(read the document you've written and consult a dictionary when
in doubt), and the ability to override the computer's mistake
(provided in any sensible text formatter).
[Note that the latter is in any event necessary, since different
authorities -- different dictionaries -- may disagree on the
correct hyphenation of a word: compare "in-de-pend-ent" from
the American Heritage Dictionary with "in-de-pen-dent" from 
Webster's.  British speakers may reference that greatest 
dictionary of all, the Oxford English Dictionary, to see what
it says about the word.]
In any event, surely you do not want to be asked to hyphenate any
numbers of words EVERY single time you run your document through
a text formatter, or every time your what-you-see-is-what-you-get
editor reads it in... good grief.

   Just to beat a dead horse a little bit more, and also because
the point is important, what makes you so sure that YOU can hyphenate
words correctly?  Hyphenation, I submit, is considerably more 
difficult than spelling; given the rather large number of poor
or mediocre spellers even among native speakers of a language,
at least a unphonetic language like English in which spelling is 
difficult, I would be quite surprised if even a relatively poor
hyphenation algorithm did not perform better than the vast
majority of native speakers.  [To see that this is NOT a trivial
point, consider the parallel situation for many syntactical or
semantic errors.]  Most native speakers will probably hyphenate
at least a fair percentage of words by... looking them up in
a printed dictionary.  To have a computer program, waiting for
a human user to look up a fact in a printed dictionary and type
it in, must surely be one of the better parodies of technological
progress.

   Those who do not believe the claim made in the preceeding 
paragraph are invited to perform the following experiment: get
a friend to pick 20 to 30 words out of a dictionary.  Hyphenate
them, and score yourself as a "hyphenation algorithm" by the
criteria described below.  Then repeat the experiment with 10
subject chosen and random from the local populace.

   So we really ought to have the computer do most of the hyphenation
work.  We want it to do it "well", so that we have to correct it as
little as possible.  What does "well" mean?  Obviously, we don't
just mean "yield as few incorrect hyphenations as possible", since
the trivial algorithm "yield no hyphenations" satisfies this
condition but surely is not a very useful hyphenation algorithm.
In fact, there are three significant numbers about any hyphenation
mechanism ("mechanism" here includes dictionary lookup):

   o  The percentage of incorrect hyphenations it produces.

   o  The percentage of all possible hyphenations that it actually
      finds.

   o  Its efficiency.

Both of the first two numbers should of course be measured for realistic
text samples, i.e. they should weighted for REALISTIC frequencies
of word appearances.  We want the first number to be as close to
zero as possible, and the second number to be as close to 100%
as possible.  But while we would probably not tolerate a percentage
of incorrect hyphens greater than about 5% (remember that hyphenation
isn't all that frequent in most documents, so this already amounts
to a rather infrequent error), we might well tolerate an algorithm
that produces signficantly less than 100% of all possible hyphens,
especially if the hyphens it does find break the word up into
small enough chunks; I would estimate that a figure as low as 70 to
80% might be acceptable here.

   Much of the remainder of this article is based upon Appendix H
of the TeXbook, by Donald Knuth, and upon the Knuth-Liang hyphenation
algorithm it describes.  The two have invented an algorithm which
satisfies the criteria described above, and its implementation in
TeX, Knuth's formatter, allows both for user modifications, and
for overriding the hyphenation decisions of the formatter in 
specific instances.  In practice, I have indeed found that the
algorithm works quite well.

  I will not describe the Liang algorithm in detail; those who
want more information should consult either the aforementioned
appendix of the TeXbook (by Donald E. Knuth, published jointly
by the American Mathematical Society and Addison-Wesley, 1984)
or Liang's Ph.D. thesis (Department of Computer Science,
Stanford University, 1983).  For our purposes, it will suffice to
note that the algorithm first looks up the word in an exception
dictionary, which is rather small (< 50 words), and for those
words not in the expection dictionary (the majority) uses tables
of word-fragment patterns to make a hyphenation decision.  These
tables occupy perhaps 20KB in compressed (but directly usable)
form.  Its vital statistics will also be useful (I have not
verified these; they are quoted directly from Appendix H
of the TeXbook):

  o   It hyphenates completely and correctly the 700 or so most
      common words in English.
  o   It finds 89.3% of the hyphens in a 115,000 word dictionary
      supplied by an unnamed publisher (I assume Merriam-Webster).
  o   It inserts NO incorrect hyphens for any of the words in
      that dictionary.

Let us now compare this algorithm and the other "obvious" method
of hyphenation, dictionary lookup.

   Dictionaries are of course necessary for document preparation,
e.g. for spelling checkers, and the preparation of computer-readable
dictionaries for foreign languages is both necessary and 
desirable.  I do not wish to belittle any such undertaking in
any way; on the contrary, I encourage it strongly.  Dictionaries
could in principle also be used to store hyphenation points,
and indeed, such dictionaries are needed as adjuncts to the preparation
of tables for the Knuth-Liang algorithm, and for checking its
accuracy.  But I do not believe they provide the most efficient
way of giving a text formatter information about hyphenation in
a given language.  I know this to be true for English, and I would
expect it to be true for most Indo-European languages.  For
other languages (Semitic and Finno-Ugric are the two other major 
groups that come to mind; note that languages written in Kanji
don't have hyphenation), it's anyone's guess.

   The example given at the very beginning, of a word whose hyphenation
depends on its meaning, and therefore on the context it is present
in (hah! there's another example: "pre-sent" and "pres-ent"), shows
that a dictionary, too, cannot give all possible hyphenations.
It must omit some.  Dictionaries are rather large, and thus likely
to contain errors; this will lead to a less-than-perfect score on
avoiding incorrect hyphens.  In fact, even a small dictionary will
contain about 50K words; since the average word length in English
is slightly over 4 characters, such a dictionary needs 200KB;
the most space-efficient encoding for the hyphenation points demands
another byte per word, so our baby dictionary is already 250KB.
Even in this day and age, accessing a 20KB table, and performing
a little computation is more efficient than accessing a 250KB
table -- in fact it gets better as time goes along, simply because
silicon hardware is getting faster more quickly than mass-storage devices.
But these large dictionaries aren't large enough.  English has a lot 
more words than 50K; so we really need a larger dictionary.  How
large?   Well, we have to include derived forms (e.g. "demonstration"
from  "demonstrate") since the hyphens for derived forms CANNOT be
derived easily from those for the root word (as I've said, hyphenation
is harder than spelling!); compare "dem-on-stra-tion" with
"de-mon-stra-tive".  We also will need to store words built up by
addition of suffixes and prefixes (how can we tell those that behave
regularly from those that don't?).  The Third Edition of
Merriam-Webster's dictionary contains 450,000 words, admittedly
including many compound nouns which probably don't need to be in a
hyphenation dictionary, but also omitting many forms which ought to be
included.  Now we're talking 2.3 megabytes.  I don't know how many
words the full Oxford English Dictionary (surely the most
comprehensive dictionary of any language ever assembled) includes, but
I wouldn't be surprised if the figure were 2 or 3 million.   So now
we're up to 10 megabytes (perhaps more).  Real estate on magnetic
media is cheap, but not THAT cheap.  And 10MB is not an amount of data
that can be stored easily in RAM (and therefored accessed
efficiently).

  Of course, one could stop at any stage, and either (a) give up, and
demand that the user specify explicit hyphenations for everything
that's left or (b) extend the dictionary with (horrors!) an
algorithmic method of hyphenation.  Either choice concedes that
there's really something to algorithmic hyphenation after all.  The
second choice shows that the Knuth-Liang algorithm is one particular
choice in a wide range of choices in a division of labor between a
dictionary and an algorithm.  But it is a careful choice, based on
analysis, not on a haphazard gut feeling.  It shows that a tiny
dictionary in combination with a pattern-driven alogrithm can work,
and work efficiently.

   In fact, the patterns that Liang has collected contain quite a bit
of information about the structure of words in English not to be found
by lookup in any dictionary.  The last class of additions we must
consider are coined words, some of which may derive from known words,
and other which may not, and bits of common usage.  Living languages,
after all, are truly alive: they evolve and change, adding new words
and new usages as time moves on.  There is inevitably a time lag
between the introduction of a new word, and its acceptance to the
point it appears in a standardized dictionary.  But new words do
follow certain deep rules of a language; we may not understand these
rules, or be able to enunciate them; but if we find an algorithm that
captures the essence of these rules, we have found a real gem.
Liang's algorithm is such a gem.  As examples, I will consider coined
words and nonsense words, the latter precisely because they sound as
though they ought to be English (some such words, e.g.  "chortle" =
combination of "chuckle" and "snort", do eventually become
full-fledged words).  In particular, consider "yuppie" (coined),
"quining" (from "quine", coined by Douglas Hofstadter in honor of the
logician W. V. Quine), and "brillig" (nonsense word).  They  are not
in any dictionary.  But Liang's algorithm hyphenates them correctly
[in my judgement]: "yup-pie", "quin-ing", and "bril-lig".

   As I said in my earlier posting, a great deal of knowledge has been
accumulated over the course of many years' work on hyphenators for English.  
These efforts have culminated in a workable scheme.  For related languages, 
I suspect the hardest part has already been done: finding an efficient, 
successful algorithm.  One must construct the appropriate tables for the new
target language, check the accuracy of the algorithm, and build up
lists of exceptions which must be treated specially.  Other general
rules (the equivalent of "no discretionary hyphens may be placed in a
hyphenated compound noun" and "do not hyphenate after the first
character of a word" in English) must be formulated.  For unrelated
languages, the problem may as yet be unsolved.  It might even be
unsolvable, though I doubt that.  Computers have evolved to the point
where matters once beyond the "competence of computers" are amenable
to algorithmic attack.  I am specifically avoiding AI-ish approaches;
these have failed more than they have succeeded, and they have given a
bad name to any computation-based approach.  For the hyphenation
problem, this bad name is undeserved.  The moral is that we should
be looking for such algorithms for a variety of problems, not ranting and
raving about how they can't work, and how primitive, knee-jerk solutions
are "obviously" the only possible ones.

                                       David A. Kosower
                                       kosower@harvard.ARPA

inc@fluke.UUCP (Ensign Benson, Space Cadet) (11/15/85)

>> ...a large body of knowledge ... on the proper and elegant way to handle
>> hyphenation ... There are a variety of algorithms and methods ...

 
> Yes, and none of them are any good.  Have you seen the things those
> algorithms do?  The only successful hyphenation algorithm is to look
> the word up in a dictionary.
> 
> Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
> Multimate International    52 Oakland Ave North    E. Hartford, CT 06108


Oh bull-ticky, Frank. English has only around half a dozen rules for English
hyphenation. They mainly deal with syllabication.

    1) Words are divided only at the end of a sylllable. Corollary: Do not
    divide a one-syllable word.

    2) Do not leave fewer than 3 letters at the end of a line, or place
    fewer than that number at the beginning of a line. Corollary: Do not
    divide words of fewer than 6 letters.

    3) Proper names are never hyphenated. The easy way is just to never
    hyphenate anything that begins with an upper case. This means that no
    first word of a sentence would ever be hyphenated, but hey! it's better
    to err on the side of too few hyphens than too many.

    4) The last word of a paragraph or page is never hyphenated.

    5) Multi-syllabic prefixes or suffixes are not hyphenated. (super-
    market as opposed to su-permarket)

    6) Not really a rule, but lately getting there: never hyphenate two
    consecutive lines.

My former employer, CPT Corporation of Minneapolis, had implemented these
rules (as well as a 99% accurate syllabication algorithm) in their word
processing software as early as 1981. So yes, I have seen what software has
done for hyphenation -- fixed it!! 




-- 
			       Ensign Benson
			       -Space Cadet-
 
    _-_-_-_-_-_-_-_-_-_-_-The Digital Circus, Sector R-_-_-_-_-_-_-_-_-_-_-_

franka@mmintl.UUCP (Frank Adams) (11/16/85)

In article <501@harvard.ARPA> kosower@harvard.ARPA writes:
>Most native speakers will probably hyphenate
>at least a fair percentage of words by... looking them up in
>a printed dictionary.

Actually, I think most native speakers will put a hyphen in in a place
where they are reasonably sure one belongs, and will acheive a rather
high success rate at doing so.  I do agree that fully interactive
hyphenation is unacceptable.  However, a reasonably sized dictionary,
with resort to interaction instead of to an algorithm, seems to me
to be a viable option in many cases.  From experience, I would say
that most words not found in a 30,000 or so word dictionary are proper
nouns, and not likely to be found even in a much larger dictionary.

>In fact, there are three significant numbers about any hyphenation
>mechanism ("mechanism" here includes dictionary lookup):
>
>   o  The percentage of incorrect hyphenations it produces.
>
>   o  The percentage of all possible hyphenations that it actually
>      finds.
>
>   o  Its efficiency.
>
>Both of the first two numbers should of course be measured for realistic
>text samples, i.e. they should weighted for REALISTIC frequencies
>of word appearances.  We want the first number to be as close to
>zero as possible, and the second number to be as close to 100%
>as possible.  But while we would probably not tolerate a percentage
>of incorrect hyphens greater than about 5% (remember that hyphenation
>isn't all that frequent in most documents, so this already amounts
>to a rather infrequent error), we might well tolerate an algorithm
>that produces signficantly less than 100% of all possible hyphens,
>especially if the hyphens it does find break the word up into
>small enough chunks; I would estimate that a figure as low as 70 to
>80% might be acceptable here.

I would quibble with these figures.  I think you want the first number
under 1% for a general purpose algorithm.  On the other hand, I think
even 50% is quite adequate for the second.  Since the Knuth-Liang
algorithm [description in original article not quoted here] apparently
meets these criteria, I will withdraw my claim.

Frank Adams                           ihpn4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

arndt@ttds.UUCP (Arndt Jonasson) (11/16/85)

In article <773@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>In article <471@harvard.ARPA> kosower@harvard.ARPA (David A. Kosower) writes:
>>There is, for 
>>example, a large body of knowledge that has been built up over the years
>>on the proper and elegant way to handle hyphenation automatically
>>in English.  There are a variety of algorithms and methods that
>>text formatters use.
>
>Yes, and none of them are any good.  Have you seen the things those
>algorithms do?  The only successful hyphenation algorithm is to look
>the word up in a dictionary.

Isn't the hyphenation method used in Knuth's TeX rather good? Most 
algorithms used in e.g. newspapers are intolerably bad, though. There are a few
basic rules, enhanced by many exceptions. Often when hyphenating a word,
although the basic rules are quite appropriate, some exception rule
(or no rule at all, it seems) is used instead.

(Another thing needed is an algorithmic for where to put commas in a sentence
in Swedish. Swedish journalists haven't the faintest notion what the purpose
of a comma is.)

Arndt Jonasson

sommar@enea.UUCP (Erland Sommarskog) (11/16/85)

In article <501@harvard.ARPA> David A. Kosower discusses hyphenation,
and argues in favour for algorithmic hyphenation contra interactive
and dictionary methods. Specially he presents an algorithm of Knuth-
Liang, which rather is a combination of a dicionary and a algorithm.
I shall not discuss this algorithm, since I have a very little experience 
of TeX.

Instead I'm going to develop my thoughts about interactive hyphenation.
David gives two main arguments against hyphenating interactively:
1) It takes to much time. Having to hyphenate the same words every time
   you format a document is not any fun.
2) Humans also makes errors. If I've understood him right. He's implying
   that an average writer would make more erroneous hyphenations than
   a good computer algorithm.
   
I start with 2).
After having read the numbers of examples, I'm about to agree with David,
as lons as we stick to English. My reference point this far was Swedish,
and I claim that hyphenation is more simple in our language. It has
some basic rules which are overridden by concatenated words (easy for a
human, a bit more difficult for a computer) and loan words (can be a 
difficulty even for a native Swede sometimes). Since catenated words
are very frequent in Swedish, a method like Knuth-Liang using fragments
of words is probably superior than a true dictionary method.

And so to 1)
Of course it would not be acceptable for a text formatter asking for
the same word in the same text. Since I'm guilty of a relatively small
text formatter, Torino, which has interactive hyphenation, let me
describe how it is implemented.
  When Torino finds a "victim" it finds a proposal according to the basic
rules in Swedish and then asks the user to hyphenate the word, using
the proposal as the default value. The hyphenation is then stored in
a library which is saved on a file which has the same name as the
document but has an other extension. Next time the document is formatted,
Torino first checks out the library and if nothing is found, it asks
the user. When checking the library there is a user-adjustable limit
so that "inter-nationalisation" is not choosen when "internationalisa-
tion" fits into the line.
  The library file is a text file, so if an erroneous syllabication
would have come into the library it is easily removed.
  There are also some other possibilities: You can specify an explicit
syllabication file including no one at all. You can also turn off the
interactive part, using just the library. 
  The method I've used is quite simple and can be improved, but it has
some advantages over algorithmic and dictionary methods. The main one
is that is almost language independent. Porting Knuth-Liang to other
languages than English will imply a good deal of work. 
  It shall be noted that the method I've described is not truly inter-
active, rather it is a combination of all three (algorithm used for
the proposals).  

On this discussion on hyphenation and internationalisation I'd like
to add the following question: Is there any chance that nroff or TeX
or any other English-speaking formatter would ever hyphenate the
Swedish word "tillaga" correctly? Even with help form the user?
The word is hyphenated "till-laga".

wmartin@brl-tgr.ARPA (Will Martin ) (11/18/85)

Actually, it seems to me that you are using this hyphenation difficulty
in the wrong way. Instead of going to great lengths to overcome it, you
could instead use it as a tool to eliminate the bad and noxious practice
of hyphenation itself! As more and more text-production facilities
become computerized, any restraints and limitations imposed by the
computerization will become de facto industry standards. So those who
want hyphenated text for justified right margins or whatever other
reasons could eventually become segregated into the manual-production
part of the field, IF you people, who are the ones that make
computerized hyphenation possible, will simply stick together against it!

There is NO *real* reason to hyphenate words to split them across lines;
it is merely a convention, established over hundreds of years by the
printing establishment. We have a chance here to overcome this hidebound
and annoying custom, and establish instead either variable spacing for
justified right margins, or, better yet, settle on irregular right
margins as the new standard.

Don't cooperate! Instead of making an effort to get the machines to do
what the people say they want, instead expend that same effort to
convince the people that they need no longer indulge in the antiquated
custom of hyphenating at all. And if no computer types do the bidding of
those wanting hyphenation, it will simply die out as the computerized
text-handling becomes ubiquitous. (You don't spend effort to get the
trailing "s" character to print like "f", do you? Treat hyphenation the
same way!)

Will

(If you hadn't guessed by now, I am against hyphenation, and never do it
myself. :-)

spw2562@ritcv.UUCP (11/21/85)

In article <3353@brl-tgr.ARPA> wmartin@brl-tgr.ARPA (Will Martin ) writes:
>                        We have a chance here to overcome this hidebound
>and annoying custom, and establish instead either variable spacing for
>justified right margins, or, better yet, settle on irregular right
>margins as the new standard.

Why don't we just start wraping words without hyphens wherever they hit th
e end of the line?  I think this is a whole lot easier than trying to just
ify or hypenate.  8-)

>                              And if no computer types do the bidding of
>those wanting hyphenation, it will simply die out as the computerized
>text-handling becomes ubiquitous.

Good idea!  As of now, I sit here doing nothing the hyphenwanters say.

>Will
>
>(If you hadn't guessed by now, I am against hyphenation, and never do it
>myself. :-)
          ^
	  | what's that little jober there??  8-)

==============================================================================
        Steve Wall [Snoopy] @ Rochester Institute of Technology
        USnail: 6675 Crosby Rd, Lockport, NY 14094, USA
        Usenet: ...!ritcv!spw2562                       Unix 4.2 BSD
        BITNET: SPW2562@RITVAXC                         VAX/VMS 4.2
        Voice:  Yell "Hey Steve!"

    Disclaimer:  What I just said may or may not have anything to do with
                 what I meant to say...

kosower@harvard.UUCP (David A. Kosower) (11/25/85)

   In reply to the discussion presented by Sommarskog (<1090@enea.UUCP>),
I have a few comments.   The need for Sommarskog's
formatter to resort to interactive hyphenation suggests to me that
in fact the algorithm he uses is indeed somewhat inadequate, whether
as a result of the error rate or the insufficient number of candidate
hyphenation points found, he has not indicated.

   If the algorithm has an error rate (number of incorrect hyphens
produced) that is too high, it is simply unacceptable.  No amount
of interactive fooling around is going to hide this from the user,
or make it more palatable.  What `too high' means here is application-
dependent;  I would tend to agree that my earlier figure of 5% as
an upper limit is too lenient, and that the upper bound ought to be
1 or 2%.
   If the algorithm has a success rate (number of hyphens, out of
all those possible) that is too low, it simply isn't very useful.
In the case of Sommarskog's formatter, this means the user will have
to construct library files for many words.  It is *dumb*, *wasteful*,
and *error-prone* to have each user do this independently; such a 
library, in any sane system, will be a system-wide resource.  It will 
be done *once*, presumably by an outside company whose business is 
producing such files.  Hmmm... haven't we seen such a creature before?
Yes!  It's our old friend the hyphenation dictionary!
   If the algorithm has a high success rate, then there's really no need
for interactive hyphenation.  One might ask the user about exceptions here,
rather than requiring him to put them into his document or into
a library file, though only at the price of disabling non-interactive
running of the formatter even for valid input files; but there is no
great advantage to doing so.
   Interactive hyphenation seems to be mostly a crutch for inadequate
hyphenation algorithms.  The solution is not to add interactive
hyphenation to such a system, but to implement adequate hyphenation
algorithms.  Such algorithms are now *known* to exist, so there isn't
really an excuse not to use them.

   Sommarskog further claims his method is language-independent.  This is
rubbish.  Hyphenation `rules' in say, English, are complicated, and
I doubt will be embodied, explicitly or implicitly, in any algorithmic
approach that is not equivalent in complexity to the Knuth-Liang 
approach.  Porting the Knuth-Liang algorithm to other languages does
require a fair amount of work, as Sommarskog points out; but this
work only has to be *once*, whereupon it is available to all who want
to write in that language, the residual incremental effort on each
user's part being negligible.

   As far as the question about `tillaga' is concerned (the word
hyphenates `till-laga'), the answer is that although TeX will of course
not hyphenate it automatically in this fashion (it doesn't speak
Swedish either! :-)), it can be taught to do so.  Although the
following details are rather technical, I believe they are of
sufficient interest for me to present them.  Those who are familiar
with TeX and who have read the relevant section of the TeXbook
(by Donald E. Knuth, publ. by Addison-Wesley), please forgive my
verbosity.
   In TeX, discretionary hyphens are specified using the
`\discretionary' primitive.  It takes three arguments:  the pre-break
text, the post-break text, and the no-break text.  Thus, if one
wanted to specify the hyphenation of `market' explicitly (though TeX
already knows it), one would say:
mar\discretionary{-}{}{}ket
because it hyphenates as `mar-ket'.  To hyphenate `tillaga', one could
say:
       till\discretionary{-}{l}{}aga
  or
       til\discretionary{l-}{}{}laga
One of these options is undoubtedly linguistically more correct, but
they produce the same effect.  The reason for the two options is that
the mechanism can handle more difficult examples: one Knuth gives
is the German word `backen' which hyphenates as `bak-ken'.  One
must specify this by
       ba\discretionary{k-}{k}{ck}

  English has a slightly more subtle version of the same problem, which
has to do with ligatures; sometimes, a hyphen would split a ligature, e.g.
`...ff...' would become `...f-f...'.  Although this doesn't change
the actual characters, it does change the character widths involved,
and so the formatter must be cognizant of the fact (and TeX does 
know about such things).

                                    David A. Kosower
                                    kosower@harvard.ARPA

andersa@kuling.UUCP (Anders Andersson) (11/25/85)

In article <3353@brl-tgr.ARPA> wmartin@brl-tgr.ARPA (Will Martin ) writes:
>There is NO *real* reason to hyphenate words to split them across lines;

>(If you hadn't guessed by now, I am against hyphenation, and never do it
>myself. :-)

That's what I learnt at school - English text is seldom (or never)
hyphenated, and the natural reason is that the medium length of the
words is rather small. The situation in other languages might be
different, as it is in Swedish, where short words are concatenated
into long ones and supplied with numerous endings. As Sommarskog
(that's ten letters, quite a normal length of a name in my opinion!)
points out, newspapers would look terrible without hyphenation
(that is, worse than they do today). Do we need to discuss Finnish..?

Conclusion: Even if *you* don't need hyphenation, you'll get it for free,
because *we* need it!
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)

chris@umcp-cs.UUCP (Chris Torek) (11/26/85)

Actually, if one were willing to expend the space, it should not
be too difficult to change TeX's hyphenation dictionary code to
use `nodes' rather than `characters', at which point a \hyphenation{...}
could include `\discretionary's as well as simple character lists.
This could automatically handle words that change spelling when
they are hyphenated.  Of course, for a very large set of such words,
this degenerates into a dictionary approach.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

req@warwick.UUCP (Russell Quin) (12/02/85)

A number of people have been discussing the TeX hyphenation algorithm & disag-
reeing about whether it is perfect|acceptable|terrible.

So let's took a look at a book that was set using TeX: Ullman's ``Principles of
Daabase Systems'' was published by the Computer Science Press.

I agree that there are ferwer hyphens in the book than there would have been
had it been set by troff, but that is not the point.

Below is a list of the hyphenations it used, roughly in the order in which they
appear, in the 1st 100 pages.
They vary between being perfectly acceptable and abysmal.  I was surprised by
"entity-rel- ationship" and "MO- THERS", but there are several others where the
pronunciation changes over the line break.

Clearly, TeX is *NOT* the be-all-and-end-all in automatic hyphenation.  If it
had only alerted the (presumably human) users when it had difficulties, they
might have been able to do something.
To be fair, careful proof-reading can help eliminate some of these horrors,
paricularly things like 3 consecutive lines being hyphenated, or two
consecutive lines ending in the *same* hyphenated word (broken at the same
point).  Both of these last two problems occur in the book.

I have included a few comments in the list, and marked 1 word that changes pro-
nunciation in England when it is thusly hyphenated, as it might not do so in
some parts of America.

Now, how many rules did TeX break?  How could it be improved?  (Answers
invloving the use of rm(1) will *not* be considered.  We don't run TeX, anyway)

		- Russell

infor- mation				there- fore	(on P.50)
presum- ably				set- ting
per- son				or- ganization	(line set loose)
respon- sibility			stor- age
descrip- tion				dele- tions
pro- grammers				con- venience
physi- cal				opera- tions
per- sonnel				con- sist
concep- tual				them- selves
lan- guages				occur- rences
manipu- lation				imple- mented
depend- ing				SS_ NO	(a variable name)
manipula- tion				ad- vantage
MO- THERS	(a variable name)	struc- tured
MO- THER_OF	(a variable name)	prefer- ably	(changes pronunciation!)
restric- tions				repre- sented
signif- icance				ex- hibit	(line set loose)
gen- eric				ap- proaches	(line set loose)
undir- ected				Stone- braker	(A proper name)
sys- tems				entity-rel- ationship	(!!!)
map- pings				con- fusion
rela- tions	(repeated on consecutive lines)
con- trived				SUP- PLIERS	(a variable name)
relation- ships
im- plementation
bidirec- tional
par- ticular
com- pany
EM- PLOYEES	(a variable name)
INGREDI- ENT	(a variable name)
respec- tively
opera- tion
follow- ing
com- puting
-- 
		... mcvax!ukc!warwick!req  (req@warwick.UUCP)
		... mcvax!ukc!warwick!frplist (frplist@warwick.UUCP)
friend: someone one seems to be able to tolerate at the moment

inc@fluke.UUCP (Gary Benson) (12/02/85)

> Actually, it seems to me that you are using this hyphenation difficulty
> in the wrong way. Instead of going to great lengths to overcome it, you
> could instead use it as a tool to eliminate the bad and noxious practice
> of hyphenation itself!
> 
> . . . many excellent arguments . . .
>
> Don't cooperate! Instead of making an effort to get the machines to do
> what the people say they want, instead expend that same effort to
> convince the people that they need no longer indulge in the antiquated
> custom of hyphenating at all.

Good Show!! I am the author of one of the (many) articles about how to make
the machine do what everyone assumes is a needed task: breaking up words
into unreadable chunks. It has long been known that "ragged right" text is
more readable, and the people holding us to it are those who are regimented
and think it "looks better". While I agree whole heartedly with much of your
well-reasoned approach, I also feel it may not be as simple to implement the
demise of an odious practice...

Look at the newspaper business... over the years, they have found that to
compete successfully they need a range of front page stories - something to
entice a buy by persons having a variety of interests. If there is no
hypehenation, does that not preordain column widths greater than 40 or so
characters? And if the columns must be wider, then won't there be fewer
stories? Or at least shorter ones before the "(see page C-23)" message?
It seems like there may be a dilemna in the newspaper business that is
actually best answered by hyphenation, however distateful and hard to read.



-- 
 Gary Benson  *  John Fluke Mfg. Co.  *  PO Box C9090  *  Everett WA  *  98206
   MS/232-E  = =   {allegra} {uw-beaver} !fluke!inc   = =   (206)356-5367
 _-_-_-_-_-_-_-_-ascii is our god and unix is his profit-_-_-_-_-_-_-_-_-_-_-_ 

cdsm@icdoc.UUCP (Chris Moss) (12/03/85)

There's a fallacy in assuming the hyphenation dictionary is "the standard
100% accurate authority" of course. That is, no two hyphenation dictionaries
I've seen agree with each other. Apart from traditional differences between
British & American practices, editors fail to agree on how the agreed "rules"
(many of which can conflict) apply. 

It seems an ideal area for work on automatic induction. Has anyone done
anything that anyone knows about?

trickey@alice.UucP (Howard Trickey) (12/05/85)

I know for a fact that Ullman's Principles of Databases book was typeset using
TeX78, which DIDN'T use the hyphenation algorithm that has been discussed
on the net.  The Liang algorithm only made it into TeX82.
Here are the hyphenations done on a book that used TeX82 (my thesis):
inter-vention par-ticular pro-grams repre-sentation pro-cedures
represen-tation transfor-mation elim-inated instruc-tions
opti-mizations Sec-tion Ap-pendix

I think Ullman's VLSI book used TeX82, but I won't swear to it.