[comp.sources.misc] v08i066: cz text to PostScript system, part 02 of 14

allbery@uunet.UU.NET (Brandon S. Allbery - comp.sources.misc) (10/01/89)
Posting-number: Volume 8, Issue 66
Submitted-by: howard@dahlbeck.ericsson.se (Howard Gayle)
Archive-name: cz/part02

#! /bin/sh
# This is a shell archive.  Remove anything before this line, then feed it
# into a shell via "sh file" or similar.  To overwrite existing files,
# type "sh file -c".
# The tool that generated this appeared in the comp.sources.unix newsgroup;
# send mail to comp-sources-unix@uunet.uu.net if you want that tool.
# If this archive is complete, you will see the following message at the end:
#		"End of archive 2 (of 14)."
# Contents:  78.tex 78heur.h 8859-1.rc prolog-end.p4
# Wrapped by howard@dahlbeck on Mon Sep 25 07:15:12 1989
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f '78.tex' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'78.tex'\"
else
echo shar: Extracting \"'78.tex'\" \(33673 characters\)
sed "s/^X//" >'78.tex' <<'END_OF_FILE'
X% 78.tex - report on heuristics & how to write them
X%
X% Copyright 1989 Howard Lee Gayle
X% This file is written in the ISO 8859/1 character set.
X%
X% $Header: 78.tex,v 1.8 89/08/31 07:38:27 howard Exp $
X%
X% This program is free software; you can redistribute it and/or modify
X% it under the terms of the GNU General Public License version 1,
X% as published by the Free Software Foundation.
X%
X% This program is distributed in the hope that it will be useful,
X% but WITHOUT ANY WARRANTY; without even the implied warranty of
X% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
X% GNU General Public License for more details.
X%
X% You should have received a copy of the GNU General Public License
X% along with this program; if not, write to the Free Software
X% Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
X\documentstyle[11pt,openbib]{report}
X\input gold-lt
X\input latin-lt
X\input unix-lt
X\global \def \today {$Revision: 1.8 $}
X\raggedright
X\arraycolsep 0pt
X\parskip 1.5ex
X\parindent 2em
X\title{A trigram-based heuristic for distinguishing national
Xversions of ISO~646}
X\author{Copyright \copyright\ 1989 Howard Lee Gayle
X\thanks{
XTN/ETX/T/BG,
XEricsson Telecom~AB,
XS-126~25 Stockholm,
XSweden,
Xhoward@ericsson.se,
Xuunet!ericsson.se!howard,
XPhone: +46~8~719~5565,
XFAX: +46~8~719~9598,
XTelex: 14910~ERIC~S.}}
X\begin{document}
X\maketitle
X\begin{abstract}
XISO~646 has 10~bit combinations reserved for national use.
XWhen converting from ISO~646 to character sets such as
XISO~8859/1, it is necessary to determine which national version
Xcorresponds to each ``overloaded'' bit combination.
XI have developed two trigram-based heuristics.
XOne distinguishes a Danish national version from ASCII,
Xthe other a Swedish national version from ASCII.
XThe Swedish/ASCII heuristic has an error rate below 3\% on real
Xnews articles.
XI have not tested the Danish/ASCII heuristic because of a
Xshortage of test articles.
XThe heuristics are fast,
Xabout 8~kbyte/sec,
Xand make only one pass over the input text.
X
XThis paper explains how the heuristics works, and offers some
Xadvice to those who plan to write similar heuristics for other
Xcombinations of national versions.
X\end{abstract}
X\tableofcontents
X
X\chapter*{License}
XThis program is free software; you can redistribute it and/or modify
Xit under the terms of the GNU General Public License version~1,
Xas published by the Free Software Foundation.
X
XThis program is distributed in the hope that it will be useful,
Xbut WITHOUT ANY WARRANTY; without even the implied warranty of
XMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
XGNU General Public License for more details.
X
XYou should have received a copy of the GNU General Public License
Xalong with this program; if not, write to the Free Software
XFoundation, Inc., 675~Mass~Ave, Cambridge, MA~02139, USA.
X
X\chapter{Introduction}
XInternational Standard ISO~646
X\cite{ISO:646}
Xdefines a 7-bit character set for information processing.
XNot all bit combinations are defined by the standard:
Xten are available for national or application-oriented use.
XAlso, two of the bit combinations 
X(2/3 and 2/4)
Xcan represent one of two alternatives.
XISO~646 defines an International Reference Version (IRV), for
Xuse when there is no requirement to use a national or an
Xapplication-oriented version.
XThe following table shows the differences between IRV, ASCII,
Xthe UK national version,
Xthe standard Swedish national versions
X\cite{SWASCII}
Xfor proper names and for general use,
Xthe character set actually used in Swedish 
X(swnet hierarchy)
Xnews articles,
Xand
Xthe character set actually used in Danish
X(dk hierarchy)
Xnews articles.
X\[
X\begin{tabular}{cccccccc}
XCode       &
X   IRV     &
X   ASCII   &
X   UK      &
X   Swedish &
X   Swedish &
X   Swedish &
X   Danish  \cr
X                  &
X                  &
X                  &
X                  &
X   (proper names) &
X   (general)      &
X   (swnet)        &
X   (dk)           \cr
X\hline                                             
X02/03      &
X   \#      &
X   \#      &
X   \pounds &
X   \#      &
X   \#      &
X   \#      &
X   \#      \cr
X02/04        &
X   $\otimes$ &
X   \$        &
X   \$        &
X   $\otimes$ &
X   $\otimes$ &
X   \$        &
X   \$        \cr
X04/00    &
X   @     &
X   @     &
X   @     &
X   \'{E} &
X   @     &
X   @     &
X   @     \cr
X05/11    &
X   [     &
X   [     &
X   [     &
X   \"{A} &
X   \"{A} &
X   \"{A} &
X   \AE   \cr
X05/12           &
X   $\backslash$ &
X   $\backslash$ &
X   $\backslash$ &
X   \"{O}        &
X   \"{O}        &
X   \"{O}        &
X   \O           \cr
X05/13  &
X   ]   &
X   ]   &
X   ]   &
X   \AA &
X   \AA &
X   \AA &
X   \AA \cr
X05/14        &
X   \verb+^+  &
X   \verb+^+  &
X   \verb+^+  &
X   \"{U}     &
X   \verb+^+  &
X   \verb+^+  &
X   \verb+^+  \cr
X06/00       &
X   \verb+`+ &
X   \verb+`+ &
X   \verb+`+ &
X   \'{e}    &
X   \verb+`+ &
X   \'{e}    &
X   \verb+`+ \cr
X07/11    &
X   \{    &
X   \{    &
X   \{    &
X   \"{a} &
X   \"{a} &
X   \"{a} &
X   \ae   \cr
X07/12    &
X   $|$   &
X   $|$   &
X   $|$   &
X   \"{o} &
X   \"{o} &
X   \"{o} &
X   \o    \cr
X07/13  &
X   \}  &
X   \}  &
X   \}  &
X   \aa &
X   \aa &
X   \aa &
X   \aa \cr
X07/14                    &
X  \raise 1.0ex \hbox{\_} &
X  \verb+~+               &
X  \raise 1.0ex \hbox{\_} &
X  \"{u}                  &
X  \raise 1.0ex \hbox{\_} &
X  \verb+~+               &
X  \verb+~+               \cr
X\end{tabular}
X\]
XThe $\otimes$ glyph represents currency sign.
X
XA problem with having many different versions of ISO~646 is that it is
Xnot always possible for a computer program to determine which national
Xversion was used to encode a given text.
XIn effect, a single bit combination is overloaded to represent
Xtwo or more different characters.
XThis results in the incorrect
Xdisplay of electronic mail messages, news and bulletin-board
Xarticles, formatted documents, \etc.
X
XThe best solution to this problem is the use of a standard for
Xtext communication
Xthat contains all needed characters for all
Xlanguages of interest,
X\eg\ \cite{Xerox}.
XHowever, since ISO~646 is in widespread use, a
Xtemporary solution would also be desirable.
X
X\chapter{Is this a real problem?}
XMany persons deal only with ASCII-encoded English text.
XSome of them may have difficulty imagining living with
Xoverloaded bit combinations.
XIn this chapter I attempt to give a flavor of the problem.
XThe reader who is already convinced that the problem is real
Xcan safely skip this chapter.
X
X\section{ASCII displays}
XOne very simple approach is to choose a single display
Xcharacter for each bit combination, 
X\eg\ ASCII,
Xand then let the human reader figure out what character was
Xactually meant.
XFor example, the Swedish alphabet has three vowels not in the
XEnglish alphabet:
X\aa, \"{a}, and \"{o}.
XSwedes do not regard these as the letters a and o with
Xdiacritical marks,
Xbut as distinct letters.
X(They sort at the end of the alphabet, not with a and o.)
XThe Swedish letters
X\aa, \"{a}, and \"{o}
Xoccur in Swedish text with approximately the same frequencies
Xthat the letters 
Xf, b, and v,
Xrespectively, occur in English text.
XTo give a flavor of what a Swede experiences when reading
XSwedish on an ASCII display, I therefore make the following
Xsubstitutions:
X\[
X\begin{tabular}{cc}
XLetter & Display \cr
X\hline
XB      & [            \cr
XV      & $\backslash$ \cr
XF      & ]            \cr
Xb      & \{           \cr
Xv      & $|$          \cr
Xf      & \}           \cr
X\end{tabular}
X\]
XMy sample text is Auden's poem ``The dead echo''
X\cite{Auden}.
X\begin{verse}
X`O who can e$|$er gaze his \}ill,'\\*
X\hspace*{1em}]armer and \}isherman say,\\*
X`On nati$|$e shore and local hill,\\*
XGrudge aching lim\{ or callus on the hand?\\*
X]athers, grand\}athers stood upon this land,\\*
XAnd here the pilgrims \}rom our loins shall stand.'\\*
X\hspace*{1em}So \}armer and \}isherman say\\*
X\hspace*{1em}In their \}ortunate heyday:\\*
X\hspace*{1em}[ut Death's so\}t answer dri\}ts across\\*
X\hspace*{1em}Empty catch or har$|$est loss\\*
X\hspace*{2em}Or an unlucky May.\\*
X{\it The earth is an oyster with nothing inside it,\\*
X\hspace*{1em}Not to \{e \{orn is the \{est \}or man;\\*
XThe end o\} toil is a \{aili\}\}'s order,\\*
X\hspace*{1em}Throw down the mattock and dance while you can.}
X
X\smallskip
X`O li\}e's too short \}or \}riends who share,'\\*
X\hspace*{1em}Tra$|$ellers think in their hearts,\\*
X`The city's common \{ed, the air,\\*
XThe mountain \{i$|$ouac and the \{athing \{each,\\*
XWhere incidents draw e$|$ery day \}rom each\\*
XMemora\{le gesture and witty speech.'\\*
X\hspace*{1em}So tra$|$ellers think in their hearts,\\*
X\hspace*{1em}Till malice or circumstance parts\\*
X\hspace*{1em}Them \}rom their constant humour:\\*
X\hspace*{1em}And slyly Death's coerci$|$e rumour\\*
X\hspace*{2em}In the silence starts.\\*
X{\it A \}riend is the old old tale o\} Narcissus,\\*
X\hspace*{1em}Not to \{e \{orn is the \{est \}or man;\\*
XAn acti$|$e partner in something disgrace\}ul,\\*
X\hspace*{1em}Change your partner, dance while you can.}
X
X\smallskip
X`O stretch your hands across the sea,'\\*
X\hspace*{1em}The impassioned lo$|$er cries,\\*
X`Stretch them towards your harm and me.\\*
XOur grass is green, and sensual our \{rie\} \{ed,\\*
XThe stream sings at its \}oot, and at its head\\*
XThe mild and $|$egetarian \{easts are \}ed.'\\*
X\hspace*{1em}So the impassioned lo$|$er cries\\*
X\hspace*{1em}Till his storm o\} pleasure dies:\\*
X\hspace*{1em}]rom the \{edpost and the rocks\\*
X\hspace*{1em}Death's enticing echo mocks,\\*
X\hspace*{2em}And his $|$oice replies.\\*
X{\it The greater the lo$|$e, the more \}alse to its o\{ject,\\*
X\hspace*{1em}Not to \{e \{orn is the \{est \}or man;\\*
XA\}ter the kiss comes the impulse to throttle,\\*
X\hspace*{1em}[reak the em\{races, dance while you can.}
X
X\smallskip
X`I see the guilty world \}orgi$|$en,'\\*
X\hspace*{1em}Dreamer and drunkard sing,\\*
X`The ladders let down out o\} hea$|$en,\\*
XThe laurels spring \}rom the martyrs' \{lood,\\*
XThe children skipping where the weepers stood,\\*
XThe lo$|$ers natural and the \{easts all good.'\\*
X\hspace*{1em}So dreamer and drunkard sing\\*
X\hspace*{1em}Till day their so\{riety \{ring:\\*
X\hspace*{1em}Parrotwise with death's reply\\*
X\hspace*{1em}]rom whelping \}ear and nestling lie,\\*
X\hspace*{2em}Woods and their echoes ring.\\*
X{\it The desires o\} the heart are as crooked as corkscrews,\\*
X\hspace*{1em}Not to \{e \{orn is the \{est \}or man;\\*
XThe second-\{est is a \}ormal order,\\*
X\hspace*{1em}The dance's pattern; Dance while you can.\\*
XDance, dance, \}or the \}igure is easy,\\*
X\hspace*{1em}The tune is catching and will not stop;\\*
XDance till the stars come down with the ra\}ters;\\*
X\hspace*{1em}Dance, dance, dance till you drop.}
X\end{verse}
X
X\section{Drop diacritical marks}
XIt has sometimes been suggested that the overloading problem
Xcan be solved simply by dropping diacritical marks,
X\eg\ write \aa\ and \"{a} as a.
XIt is harder to construct an English analogy of this
Xtransformation, but I will try to approximate it by folding
Xu into a and y into i.\footnote{In Swedish,
X\aa\ and \"{a} together have about the same
Xfrequency that u has in English.
XThe vowel a has about the same frequency in Swedish and English.
XThe Swedish vowel \"{o} has about the same frequency as the
XEnglish letter y.
XThe Swedish vowel o has about the same frequency as the English
Xvowel i.}
X
X\newpage
X\begin{verse}
X`O who can ever gaze his fill,'\\*
X\hspace*{1em}Farmer and fisherman sai,\\*
X`On native shore and local hill,\\*
XGradge aching limb or callas on the hand?\\*
XFathers, grandfathers stood apon this land,\\*
XAnd here the pilgrims from oar loins shall stand.'\\*
X\hspace*{1em}So farmer and fisherman sai\\*
X\hspace*{1em}In their fortanate heidai:\\*
X\hspace*{1em}Bat Death's soft answer drifts across\\*
X\hspace*{1em}Empti catch or harvest loss\\*
X\hspace*{2em}Or an anlacki Mai.\\*
X{\it The earth is an oister with nothing inside it,\\*
X\hspace*{1em}Not to be born is the best for man;\\*
XThe end of toil is a bailiff's order,\\*
X\hspace*{1em}Throw down the mattock and dance while ioa can.}
X
X\smallskip
X`O life's too short for friends who share,'\\*
X\hspace*{1em}Travellers think in their hearts,\\*
X`The citi's common bed, the air,\\*
XThe moantain bivoaac and the bathing beach,\\*
XWhere incidents draw everi dai from each\\*
XMemorable gestare and witti speech.'\\*
X\hspace*{1em}So travellers think in their hearts,\\*
X\hspace*{1em}Till malice or circamstance parts\\*
X\hspace*{1em}Them from their constant hamoar:\\*
X\hspace*{1em}And slili Death's coercive ramoar\\*
X\hspace*{2em}In the silence starts.\\*
X{\it A friend is the old old tale of Narcissas,\\*
X\hspace*{1em}Not to be born is the best for man;\\*
XAn active partner in something disgracefal,\\*
X\hspace*{1em}Change ioar partner, dance while ioa can.}
X
X\smallskip
X`O stretch ioar hands across the sea,'\\*
X\hspace*{1em}The impassioned lover cries,\\*
X`Stretch them towards ioar harm and me.\\*
XOar grass is green, and sensaal oar brief bed,\\*
XThe stream sings at its foot, and at its head\\*
XThe mild and vegetarian beasts are fed.'\\*
X\hspace*{1em}So the impassioned lover cries\\*
X\hspace*{1em}Till his storm of pleasare dies:\\*
X\hspace*{1em}From the bedpost and the rocks\\*
X\hspace*{1em}Death's enticing echo mocks,\\*
X\hspace*{2em}And his voice replies.\\*
X{\it The greater the love, the more false to its object,\\*
X\hspace*{1em}Not to be born is the best for man;\\*
XAfter the kiss comes the impalse to throttle,\\*
X\hspace*{1em}Break the embraces, dance while ioa can.}
X
X\smallskip
X`I see the gailti world forgiven,'\\*
X\hspace*{1em}Dreamer and drankard sing,\\*
X`The ladders let down oat of heaven,\\*
XThe laarels spring from the martirs' blood,\\*
XThe children skipping where the weepers stood,\\*
XThe lovers nataral and the beasts all good.'\\*
X\hspace*{1em}So dreamer and drankard sing\\*
X\hspace*{1em}Till dai their sobrieti bring:\\*
X\hspace*{1em}Parrotwise with death's repli\\*
X\hspace*{1em}From whelping fear and nestling lie,\\*
X\hspace*{2em}Woods and their echoes ring.\\*
X{\it The desires of the heart are as crooked as corkscrews,\\*
X\hspace*{1em}Not to be born is the best for man;\\*
XThe second-best is a formal order,\\*
X\hspace*{1em}The dance's pattern; Dance while ioa can.\\*
XDance, dance, for the figare is easi,\\*
X\hspace*{1em}The tane is catching and will not stop;\\*
XDance till the stars come down with the rafters;\\*
X\hspace*{1em}Dance, dance, dance till ioa drop.}
X\end{verse}
X
X\section{National version displays}
XInstead of an ASCII display, one can use a display that always
Xinterprets each bit combination as a character from some other
Xnational version,
Xand again let the human reader figure out what character was
Xactually meant.
XThis works well for pure Swedish text, but is problematical for
XEnglish text, and fails dismally for many programming languages
Xsuch as C{}.
XHere is what a C code fragment looks like on a Swedish display:
X
X\begin{verse}
Xif ((S\_BODY == sect) \"{o}\"{o} ('\"{a}' != B(*lp))) return (NULBSTR);\\
Xfor (++lp; (EOS != B(*lp)) \&\& ('\aa' != B(*lp)); ++lp)\\
X\hspace*{1em}\"{a}\\
X\hspace*{1em}if (!isalnum (B(*lp)) \&\& (NULCSTR == strchr (",. ", B(*lp))))\\
X\hspace*{2em}return (NULBSTR);\\
X\hspace*{1em}if (',' == B(*lp)) comma = TRUE;\\
X\hspace*{1em}if ((' ' == B(*lp)) \&\& (',' != B(lp\"{A}-1\AA))) return (NULBSTR);\\
X\hspace*{1em}\aa\\
Xreturn ((comma \&\& ('\aa' == B(*lp))) ? lp + 1 : NULBSTR);
X\end{verse}
X
X\chapter{A trigram-based heuristic}
XThe problem, then, is, given an overloaded bit combination,
Xdetermine the corresponding character.
XOn any given electronic news group, the number of different
Xnational versions used is small, most often one or two.
XLikewise, most people receive electronic mail encoded in only a
Xsmall number of different national versions, again most often
Xone or two.
XMy heuristic attempts to distinguish between two different
Xnational versions.
XI believe it could be extended beyond two national
Xversions, as I discuss below.
X
XMy heuristic is based on the obvious assumption that each
Xnational version is associated with exactly one 
Xlanguage.\footnote{This is of course false.
XFor example, the same national version can be used for Swedish
Xand Finnish, which are very different languages.
XBut if the problem is to distinguish between ASCII used for
Xtext in
XEnglish, and a Swedish/Finnish national version used for
Xtext in Finnish and Swedish, then for the purposes of the
Xheuristic, Swedish and Finnish could be treated as one language.}
XThe text is divided into words, the language of each word is
Xguessed, the corresponding national version is assumed, and the
Xcharacter set conversion is done.
X
XThe problem is therefore to guess the language of each word.
XThe heuristic breaks each word into three-letter groups called
Xtrigrams.
XThere are also two pseudo-letters: a left parenthesis denotes
Xthe pseudo-letter before the first letter in a word,
Xand a right parenthesis denotes the pseudo-letter after the
Xlast letter in a word.
XThus, each word breaks into as many trigrams as it has letters.
XConceptually,
Xeach trigram is looked up in frequency tables for the two
Xpossible languages.
XThe ratio of the products of trigram frequencies is the
X``score'' for the word.
XScores greater than unity indicate one language and scores less
Xthan unity the other.
XA more detailed discussion of the arithmetic appears below.
X
XFor example, is the word ``heuristic'' English or Swedish?
XHere is the computation.
X(Frequencies are scaled by $2^{31} - 1$.)
X\[
X\begin{tabular}{lrr}
XTrigram & Frequency & Frequency \cr
X        & (Swedish) & (English) \cr
X\hline
X(he     & 2637576   &  2516235  \cr
Xheu     &       1   &    17651  \cr
Xeur     &   62061   &   191570  \cr
Xuri     &   15515   &   301405  \cr
Xris     & 1722182   &   679471  \cr
Xist     & 2591031   &  1986143  \cr
Xsti     &  822303   &  1717530  \cr
Xtic     &   15515   &  2175108  \cr
Xic)     &  589576   &  3390018  \cr
X\end{tabular}
X\]
XThe score for ``heuristic'' is
X\[
X\frac{85241412416465694529483902220428729468105600}{43829555557656947817827179883421809827379869618260000}
X= 0.00000000194483862160 < 1
X\]
Xso it is assumed to be English.
X
X\chapter{Performance}
XI collected 326~news articles from the Swedish swnet news group
Xhierarchy.
XI randomly divided these articles into two groups of
X163~articles each.
XI used one group to compute a Swedish trigram frequency table.
XI computed an English trigram frequency table from news
Xarticles in English taken from other news groups.
XI then tested the heuristic on the second group of 163~articles.
XIt had an error rate of
X2.6\%.
XThere were a total of 3858 overloaded bit combinations.
XIt failed to recognize 40 of them as Swedish letters,
Xand it incorrectly identified 61 as Swedish letters.
X
XI then computed a new Swedish trigram frequency table from all
X326~articles, and continued to tune the heuristic.
XI would thus now expect an even lower error rate on new swnet
Xnews articles.
X
XMy Swedish heuristic took 5.0~CPU seconds to process a file of
X41~379 bytes on a Sun~3/280, giving a speed of about
X8~kilobytes per second.
XFor typical mail message and news article sizes, users should
Xthus notice little or no additional delay.
X
X\chapter{Implementation}
X\section{Trigram encoding}
XTrigram encoding is based on the obvious observation that
Xalphabets based on the Latin alphabet have about 30~letters.
XWith two encodings reserved for the pseudo-letters at the
Xbeginnings and ends of words, this means five bits are needed
Xfor each letter, or 15~bits for a trigram.
XThe 15~bit encoding can be used as an array index,
Xso frequency data about a trigram can be accessed directly,
Xat a maximum cost of one page fault.
X
X\section{Frequency encoding}
XConsider two languages
X$l_1$
Xand
X$l_2$,
Xand
X$n$
Xtrigrams.
XLet
X$f (i, j)$
Xbe an estimate of the frequency of trigram
X$i$
Xin language
X$j$.
XLet
X$c_j$
Xbe a minimum of the nonzero
X$f (i, j)$,
X\ie\ 
X$\forall i: f(i,j) < c_j \Rightarrow f(i,j) = 0$.
XLet
X$c = \max (c_1, c_2)$
Xbe the cutoff frequency:
Xall frequencies less than
X$c$
Xare treated as zero.
XThe reason for the cutoff is that the samples of
X$l_1$
Xand
X$l_2$
Xmay be of different sizes.
XUsually, much more English is available than samples of other
Xlanguages.
XWithout the cutoff, all languages would look more like English,
Xbecause uncommon trigrams would only have nonzero frequencies
Xin the English frequency table.
X
XLet
X\[
XF(i,j) =
X\left\{
X\begin{array}{ll}
Xf(i,j) & \mbox{\hspace{1em} if $f(i,j) \geq c$} \\
X0      & \mbox{\hspace{1em} if $f(i,j) <    c$} \\
X\end{array}
X\right.
X\]
X\ie\ $F$ is $f$ with the cutoff applied.
XNow for each trigram let
X\begin{eqnarray*}
Xd(i)     & = & s \ln \frac{F(i,1)+\epsilon}{F(i,2)+\epsilon} \\
X\epsilon & = & \frac{1}{2^{31} - 1} \\
X\end{eqnarray*}
Xwhere
X$s$
Xis the largest scaling factor that keeps the
X$d$ values in the range
X$[-127, 127]$.
XThe
X$d$
Xvalues are placed in single bytes, indexed by encoded trigram.
XThe ``score'' for a word is the sum of the
X$d$
Xvalues for the trigrams in the word.
XA positive score indicates
X$l_1$;
Xnegative
X$l_2$.
X
X\section{Smoothing}
XTreating each word individually leads to high error rates.
XInstead, a smoothed total score is used.
XLet
X$w_k$
Xbe the score after word
X$k$.
XThen the smoothed score after word
X$k$
Xis
X\[
XT_{k} =
X\left\{
X\begin{array}{ll}
X\alpha T_{k-1} + w_k & \mbox{\hspace{1em} if $w_k >    0$} \\
X\beta  T_{k-1} + w_k & \mbox{\hspace{1em} if $w_k \leq 0$} \\
X\end{array}
X\right.
X\]
Xwhere
X$0 \leq \alpha < 1$
Xis known as the attack factor
Xand
X$0 \leq \beta < 1$
Xis known as the decay factor.
XIf
X$T_k$
Xis positive then word
X$k$
Xis considered to be from
X$l_1$.
X
X\chapter{Hints on writing heuristics}
XI have now written a grand total of two heuristics: one to
Xdistinguish between Danish and US English, and one between
XSwedish and US English.
XThat makes me an expert.
XSeriously, in this chapter I pass along a few hints for those
Xwho want to write other heuristics.
X
XTo write a heuristic:
X\begin{enumerate}
X\item collect sample texts;
X\item select a letter encoding;
X\item create a trigram difference table;
Xand
X\item tune.
X\end{enumerate}
XI discuss each of these steps in turn.
X
X\section{Collecting sample texts}
XIdeally, sample texts should come from the source on which you
Xplan to run your heuristic,
X\eg\ a set of news groups or email to a set of people.
XIn practice it may not be feasible to collect enough text in
Xthis way.
XFor example, I had no problem collecting plenty of Swedish from
Xswnet articles, but there was little English in swnet articles,
Xso I collected English from other news groups.
XThere might have been some fundamental difference between the
XEnglish in swnet articles and the English in articles from
Xother news groups; I simply assumed there was not.
X
XI placed my sample swnet articles in the directory t-se-7.
XEach article was placed in a separate file.
XThe files were named with sequential numbers,
Xusing this awk program:
X\begin{verbatim}
XBEGIN {i = 1;}
X{printf "cp %s %03d\n", $1, i; i+=1;}
X\end{verbatim}
X
XI concatenated each file into a single large file, with a
Xmarker at the end of each article,
Xusing this shell script:
X\begin{verbatim}
Xfor f
Xdo
X   cat $f
X   echo ==========
Xdone
X\end{verbatim}
X
XI removed English headers with this sed script:
X\begin{verbatim}
X/^Path: /d
X/^Newsgroups: /d
X/^Message-ID: /d
X/^Date: /d
X/^Sender: /d
X/^Reply-To: /d
X/^Distribution: /d
X/^Organization: /d
X/^Lines: /d
X/^Followup-To: /d
X/^References: /d
X/^UUCP-Path: /d
X/^Xref: /d
Xs/^From: ..*(//
Xs/^Subject: Re^.: //
Xs/^Subject: Re: //
Xs/^Subject: //
Xs/^Summary: //
Xs/^Keywords: //
Xs/In article <..*(//
Xs/ writes:$//
X\end{verbatim}
X
XThen, in a very tedious process, I edited the result to remove
Xall remaining English words.
X(I used GNU Emacs; query-replace was a big help.)
XI also removed all ASCII graphics using overloaded bit
Xcombinations.
X
XWhat is an English word?
XConsider the (mostly) Swedish sentence
X``Vi k\"{o}r {\UNIX} p\aa\ v{\aa}ra VAXar''
X(We run {\UNIX} on our VAXen).
XI arbitrarily said that ``{\UNIX}'' was not Swedish, but ``VAXar''
Xwas, since ``{\UNIX}'' did not have a Swedish inflection, but
X``VAXar'' did.
XOne could just as legitimately argue that since the word
X``{\UNIX}'' occurs in both English and Swedish texts, it is not
Xespecially diagnostic of either, so it should be treated
Xequally.
XWhatever you decide, you should try to be consistent,
Xespecially if more than one person works on cleaning the sample text.
X
XAcademic linguists may have compiled trigram frequency
Xstatistics on languages of interest to you,
Xor they may at least have large machine-readable sample texts
Xavailable.
XHow well they work in practice depends on how
Xstatistically similar their source is to the source on which
Xyou actually run the heuristic.
XIf you don't have a large enough sample of the real source
Xavailable, they can save a lot of time.
XIt took me about a year to collect my 81~Danish sample articles.
X
XHow big a sample is enough?
XMy Danish/English heuristic was based on about 47~000 Danish
Xtrigrams, and seems to work well.
XBut I wouldn't want to rely on a much smaller sample.
X
X\section {Selecting a letter encoding}
XThere are two official standard Swedish national versions of
XISO~646, but swnet articles do not use either standard.
XSo, you need to determine the
X\defacto\ standard used in your source.
XMy charfreq command will show you the characters actually used.
X
XMy 5-bit letter representation reserves two codes for the
Xpseudo-letters, leaving 30~codes for real letters.
XIf that's enough, you're in luck.
XOtherwise, you need to fold some low-frequency letters
Xtogether.
XMy letterfreq command lists letters by descending frequency.
XFor English, folding q and z together should work.
X
XAfter selecting an encoding, write a .code file.
X
XYou can now break your sample text into a sorted list of words,
Xwith a shell command something like:
X\begin{verbatim}
Xtr '\011\014 \!-@^_~' '\012' < t0 | grep -v '^$' | sort -o t1 -df
X\end{verbatim}
Xwhere t0 contains the input sample text.
XYou need to adjust the arguments to tr to get rid of all
Xcharacters except letters.
X
X\section{Creating a trigram difference table}
XOnce you have a .code file and sample texts, you can run my
X78triFreq command to create trigram frequency tables for each
Xlanguage.
XOr, you can obtain trigram frequencies from some other source.
XThen, run the frequency tables through my 78freq2tt command to
Xproduce a trigram difference table.
X
XYou can now run my 78diff command on the words from your sample
Xtext:
X\begin{verbatim}
Xuniq < t1 | env - 78diff abcd.code abcd | sort -o t2 -nr
X\end{verbatim}
XCheck file t2 for extremes: they can be words in the wrong
Xlanguage that you missed when cleaning the sample text.
X
X\section{Tuning}
XTuning is the highly empirical process of improving your
Xheuristic until it's good enough.
XFirst, you need some way to measure how well the heuristic
Xworks.
XI ran my Swedish/English heuristic on each sample article,
Xgiving mostly-correct ISO~8859/1 files, which I put in the
Xdirectory t-se-ok.
XI then edited them by hand to be entirely correct, another very
Xtedious process.
XThen I wrote a simple test shell script t-se:
X\begin{verbatim}
Xcp /dev/null t-se-log
X(cd t-se-7; ls -1) | xargs env - t-se0
Xawk -f tst.awk t-se-log
X\end{verbatim}
XThis calls the auxiliary shell script t-se0:
X\begin{verbatim}
Xfor f
Xdo
X   78seus < t-se-7/$f > t-se-8/$f
X   cmp -l t-se-8/$f t-se-ok/$f | sed -e "s;^;$f;" >> t-se-log
X   echo $f
Xdone
X\end{verbatim}
XThe tst.awk script counts errors:
X\begin{verbatim}
X{if ($3 < 200) ++n; else ++p;}
XEND {printf "omission: %d  commission: %d  total: %d / 7498 = %f\n",
X     n, p, n + p, (n + p) / 7498;}
X\end{verbatim}
XIt distinguishes errors of commission
X(treating an overloaded bit combination as Swedish when it is
Xreally ASCII)
Xfrom errors of omisson
X(treating an overloaded bit combination as ASCII when it is
Xreally Swedish.)
XThe constant 7498 is the total number of overloaded bit
Xcombinations;
Xthis of course must be changed to the correct number.
XMy charfreq command can be used to count the overloaded bit combinations.
X
XOnce you have a way to measure progress, you're ready for the
Xincredibly boring tuning loop:
Xmake a change, run the test script, and repeat until the error
Xrate is low enough for you.
XThe changes to the heuristic made during tuning can be divided
Xinto two broad classes:
Xparameter adjustment and special cases.
X
X\subsection{Parameter adjustment}
XThe most important parameters are the attack factor
X$\alpha$
Xand the decay factor
X$\beta$,
Xbut there are many others,
Xso it is best to see the code.
XIn principle this is a problem of finding a combination of
Xparameters that minimizes the total number of errors.
XI just adjusted each parameter independently.
X
X\subsection{Special cases}
XAfter the parameters are optimized, the error rate will
Xprobably still not reach zero.
XMy heuristic contains some special cases to help reduce it.
X
X\subsubsection{Exceptional words}
XThere are two short lists for words that are always considered
Xto be in one of the two languages.
XThese exception lists can be used for several purposes.
XThe trigram heuristic is not always correct about single words.
XYou can use my 78diff command to find such words, and put them
Xin the exception list.
X
XYou may also want to put the very most common words in a
Xlanguage on the exception list, if they are written with
Xoverloaded bit combinations.
XThis reduces errors near language transitions.
X
XFinally, some ASCII graphics may be treated as exceptional
X``words,''
X\eg\ \verb+[]+ as a box in which to place a check mark.
X
XThe exception lists are checked against each word using linear
Xsearch, so they should not contain more than about 20~words
Xeach.
XIf they get much longer you should consider switching to some
Xother search scheme,
X\eg\ hashing.
X
X\subsubsection{Signature starts}
XMy heuristics try to identify the signature parts of news
Xarticles and mail messages, since that is where ASCII graphics
Xand UUCP paths are most common.
XThere is a convention that signatures start with a line
Xcontaining exactly two hyphens and nothing else.
XUnfortunately, users have shown incredible imagination in
Xdevising innovative alternatives to this boring convention.
XSo, you may need some special code for detecting signatures.
X
X\subsubsection{UUCP paths}
XMany users write UUCP paths using curly brace notation borrowed
Xfrom the C shell,
X\eg\ \verb+{uunet,mcvax}!ericsson.se!howard+.
XOne user even elided the contents of the braces,
X\eg\ \verb+{...}!foo!baz+.
XMy heuristics attempt to recognize these, but some code changes
Xmay be needed.
X
X\chapter{Speculation}
XThis chapter contains various wild, untested ideas.
X
X\section{Distinguishing among $2+1$ languages}
XImagine a news group where each article is either Danish mixed
Xwith English, or Swedish mixed with English.
XA heuristic could operate in two steps: first distinguish
XDanish/Swedish from English;
Xthen, if the result is Danish/Swedish, distinguish Danish from
XSwedish.
XOne would first combine Danish and Swedish sample texts to
Xproduce a trigram frequency table for the combined
X``language.''
XThis approach would allow much more smoothing when
Xdistinguishing Danish from Swedish than when distinguishing
XDanish/Swedish from English.
X
X\section{Distinguishing among $m>2$ languages}
XIn the general
X$m$-language 
Xcase, I would not use trigram difference tables.
XI would extend the cutoff frequency
X\[
Xc = \min_{0 \leq j \leq m} c_j
X\]
Xthen compute
X$e (i, j)  =  s \ln (S F(i,j) + 1)$
Xwhere
X$S = 2^{31} - 1$
Xand
X$s$
Xis the largest scaling factor that keeps the
X$e$ values in the range
X$[0, 255]$.
XThe
X$e$
Xvalues are placed in single bytes, indexed by encoded trigram.
XIt is thus possible to compute a score for each word in each
Xlanguage.
XThe language with the maximum smoothed score wins.
X
XHow to smoothe?
XA single smoothing parameter for all scores?
XA different parameter for each language?
XI have no idea.
X
X\section{Word length factor}
XWould the heuristic be improved by adding to the score the
Xlength of the word times some factor
X\cite{Gunnel}?
X
X\section{Smoothing}
XThe attack/decay smoothing gives a strictly single-pass
Xheuristic, but it has hysteresis that can cause errors in the
Xfirst few words after a language transition.
XIt might be better to use smoothing that is symmetric around
Xthe word being converted, \eg\ something like
X\[
XT_k = \sum_{i = -5}^5 \alpha^{\left| i \right|} w_{k+i}
X\]
XThis would require a data structure to hold scores of nearby
Xwords, plus special treatment at the beginnings and ends of
Xfiles.
X
XAnother way to think about smoothing is to think of the score
Xfor each word as being a sample of a noisy waveform.
XThe problem is to find the zero-crossings of the waveform,
Xsince these correspond to language transitions.
XThis is like edge detection.
X
X\section{A Bayesian approach}
XMy heuristics do not use explicit base rate information,
X\eg\ the frequency of Swedish words in a given 
Xsource.\footnote{There is some base rate information implicit
Xin the difference between the attack and decay factors.}
XAn alternative approach would be to treat the presence or
Xabsence of every possible trigram in a word as the results of
Xdiagnostic tests for a language.
X
X\section{Neural nets}
XIt would be fun to see how a neural net would do at resolving
Xoverloaded bit combinations.
XThe input could be, say, a 99~by~99 binary matrix, with one row
Xfor each of the 95~graphic ISO~646 bit combinations, plus
Xspace, tab, newline, and formfeed.
XEach column would represent one character position; the center
Xcolumn would represent the position of interest.
XExactly one element in each column would thus be~1.
XOutput could be a single bit, meaning either language~1 or
Xlanguage~2.
X
X\bibliography{b}
X\bibliographystyle{alpha}
X
X\chapter*{Colophon}
XThis report was written in \LaTeX\ version~2.09.
XIt was typeset in 11~point Computer Modern.
X
X\end{document}
END_OF_FILE
if test 33673 -ne `wc -c <'78.tex'`; then
    echo shar: \"'78.tex'\" unpacked with wrong size!
fi
# end of '78.tex'
fi
if test -f '78heur.h' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'78heur.h'\"
else
echo shar: Extracting \"'78heur.h'\" \(8319 characters\)
sed "s/^X//" >'78heur.h' <<'END_OF_FILE'
X/*
X * 78heur - Common code for character set conversion heuristics
X *
X * Copyright 1989 Howard Lee Gayle
X *
X * $Header: 78heur.h,v 1.2 89/08/26 13:26:29 howard Exp $
X *
X * This program is free software; you can redistribute it and/or modify
X * it under the terms of the GNU General Public License version 1,
X * as published by the Free Software Foundation.
X *
X * This program is distributed in the hope that it will be useful,
X * but WITHOUT ANY WARRANTY; without even the implied warranty of
X * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
X * GNU General Public License for more details.
X *
X * You should have received a copy of the GNU General Public License
X * along with this program; if not, write to the Free Software
X * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
X *
X * Prerequisites: 78common.h
X */
X
XPRIVATE char eIntern[] = "Internal error %s";
X
XPRIVATE bStrT BraceP ();
XPRIVATE bStrT EndP ();
XPRIVATE bStrT GrafP ();
XPRIVATE bStrT InArtP ();
XPRIVATE bStrT IPP ();
XPRIVATE bStrT LaTeXP ();
XPRIVATE bStrT PipeP ();
XPRIVATE boolT SigBegP ();
XPRIVATE bStrT UunetP ();
XPRIVATE boolT wordp ();
X
X/* BraceP - test for pair of braces */
X
XPRIVATE bStrT BraceP (lp, sect)
XR1 bStrT    lp;   /* Points to current position in line.*/
X   unsigned sect; /* Section.*/
X
X/* Function:
X *    Look for {a, b, c, ...}, where a, b, c, etc. are strings of
X *    English letters, digits, and dots.  Spaces are optional, but
X *    may only follow commas.  Also match {...} with 3 or more dots.
X * Algorithm:
X *    Linear search.
X * Returns:
X *    On success, a pointer to the byte after the closing '}'.
X *    On failure, NULBSTR.
X * Notes:
X *    1) This is only for header and signature.  It is almost always
X *       a C shell {} construct, often to give a UUCP path.
X */
X{
XR2 boolT comma = FALSE; /* There must be at least one comma.*/
XR3 int   i;             /* Returned by strspn(3).*/
X
Xif ((S_BODY == sect) || ('{' != B(*lp))) return (NULBSTR);
X++lp;
Xi = strspn ((cStrT) lp, ".");
Xif ((i >= 3) && ('}' == B(lp[i]))) return (lp + i + 1);
Xfor (; (EOS != B(*lp)) && ('}' != B(*lp)); ++lp)
X   {
X   if (!isalnum (B(*lp)) && (NULCSTR == strchr (",. ", B(*lp))))
X      return (NULBSTR);
X   if (',' == B(*lp)) comma = TRUE;
X   if ((' ' == B(*lp)) && (',' != B(lp[-1]))) return (NULBSTR);
X   }
Xreturn ((comma && ('}' == B(*lp))) ? lp + 1 : NULBSTR);
X}
X
X/* EndP - test for ending string */
X
XPRIVATE bStrT EndP (lp, sect, es)
XR1 bStrT lp;      /* Points to current position in line.*/
X   unsigned sect; /* Section.*/
X   bStrT es;      /* Does line end with this?*/
X
X/* Function:
X *    Test if rest of body line is given string.
X * Algorithm:
X *    Call bStrEQ().
X * Returns:
X *    On success, a pointer to the EOS that ends the line.
X *    On failure, NULBSTR.
X * Notes:
X *    1) Start one byte before current position to give word checking
X *       a chance to complete.
X */
X{
Xreturn (((S_BODY == sect) && bStrEQ (lp - 1, es)) ? strend (lp) : NULBSTR);
X}
X
X/* GrafP - test for ASCII graphics in signature */
X
XPRIVATE bStrT GrafP (lp, sect)
XR1 bStrT    lp;   /* Points to current position in line.*/
X   unsigned sect; /* Section.*/
X
X/* Function:
X *    Look for common ASCII graphics characters in the signature,
X *    surrounded by white space.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the byte after the graphics character.
X *    On failure, NULBSTR.
X * Notes:
X *    
X */
X{
XR2     int  i;                     /* Returned by strspn(3).*/
Xstatic char ascgraf[] = "-/=\\_|"; /* ASCII graphics.*/
X
Xif ((S_SIG != sect) || !SPCTAB (B(lp[-1]))) return (NULBSTR);
Xi = strspn ((cStrT) lp, ascgraf);
Xlp += i;
Xreturn (((0 != i) && ((EOS == B(*lp)) || SPCTAB (B(*lp)))) ? lp : NULBSTR);
X}
X
X/* InArtP - test for 'In article <...> ... (' */
X
XPRIVATE bStrT InArtP (lp, sect)
XR1 bStrT    lp;   /* Points to current position in line.*/
X   unsigned sect; /* Section.*/
X
X/* Function:
X *    Match start of a quote in a followup.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the first byte in the real name of the
X *    user quoted.  On failure, NULBSTR.
X * Notes:
X *    
X */
X{
Xif (S_BODY != sect) return (NULBSTR);
Xlp = prefix (S("In article <"), lp);
Xif (NULBSTR == lp) return (lp);
Xlp = bStrChr (lp, '(');
Xreturn ((NULBSTR == lp) ? lp : lp + 1);
X}
X
X/* IPP - test for number(s) in brackets */
X
XPRIVATE bStrT IPP (lp)
XR1 bStrT lp; /* Points to current position in line.*/
X
X/* Function:
X *    Match digits with optional decimal points in square brackets.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the byte after the right bracket.
X *    On failure, NULBSTR.
X * Notes:
X *    1) Typically these are IP addresses or array subscripts.
X */
X{
Xif (('[' != B(*lp)) || !isdigit (B(lp[1]))) return (NULBSTR);
Xfor (lp += 2; (EOS != B(*lp)) && (']' != B(*lp)); ++lp)
X   if (!isdigit (B(*lp)) && ('.' != B(*lp))) return (NULBSTR);
Xreturn ((']' == B(*lp)) ? lp + 1 : NULBSTR);
X}
X
X/* LaTeXP - test for LaTeX character with diacritical mark */
X
XPRIVATE bStrT LaTeXP (lp)
XR1 bStrT lp; /* Points to current position in line.*/
X
X/* Function:
X *    Match a LaTeX sequence for one letter with some diacritical mark.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the character after the LaTeX sequence.
X *    On failure, NULBSTR.
X * Notes:
X *    1) LaTeX sequences for diacritical marks are often used in
X *       news articles.
X */
X{
Xstatic char latex[] = "`'^\"~=.uvHtcdb"; /* LaTeX accent characters.*/
X
Xreturn ((('\\' == B(*lp)) && (NULCSTR != strchr (latex, B(lp[1]))) &&
X         ('{' == B(lp[2])) && isalpha (B(lp[3])) && ('}' == B(lp[4])))
X        ? lp + 5 : NULBSTR);
X}
X
X/* PipeP - look for pipe to command in double quotes */
X
XPRIVATE bStrT PipeP (lp, sect)
XR1 bStrT lp;      /* Points to current position in line.*/
X   unsigned sect; /* Section.*/
X
X/* Function:
X *    Look for "| /...".  White space is optional.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the byte after the closing double quote.
X *    On failure, NULBSTR.
X * Notes:
X *    1) Typically, this is a command from a mail alias file.
X */
X{
Xif ((S_BODY != sect) || (NULBSTR == (lp = prefix (S("\"|"), lp))))
X   return (NULBSTR);
Xlp += strspn ((cStrT) lp, " \t");
Xif ('/' != B(*lp)) return (NULBSTR);
Xlp = bStrChr ((lp + 1), '"');
Xreturn ((NULBSTR == lp) ? lp : lp + 1);
X}
X
X/* SigBegP - test for start of signature */
X
XPRIVATE boolT SigBegP (lp)
XR2 bStrT lp; /* Points to line buffer.*/
X
X/* Function:
X *    Look for a line consisting of 2 or more hyphens
X *    followed by optional spaces, or more than 52 = signs,
X *    or some special cases.
X * Algorithm:
X *    
X * Returns:
X *    TRUE iff match.
X * Notes:
X *    1) I really wish everyone followed the '--' convention for start of
X *       signature.
X */
X{
XR1 int i; /* Returned by strspn (3).*/
X
Xif (fixbody) return (FALSE);
Xif (bStrEQ (
XS("+------------------------------------+---------------------------------+"),
X            lp))
X   return (TRUE);
Xi = strspn ((cStrT) lp, "-");
Xif ((i > 1) && (EOS == B(lp[i + strspn ((cStrT) &lp[i], " ")]))) return (TRUE);
Xi = strspn ((cStrT) lp, "=");
Xreturn ((i > 51) && (EOS == B(lp[i])));
X}
X
X/* UunetP - test for {uunet,...} */
X
XPRIVATE bStrT UunetP (lp)
XR2 bStrT lp; /* Points to current position in line.*/
X
X/* Function:
X *    Look for part of a UUCP path starting with {uunet.  The closing }
X *    does *not* have to be on the same line.
X * Algorithm:
X *    
X * Returns:
X *    On success, a pointer to the byte after the closing } if it's on
X *    this line, otherwise a pointer to the EOS at the end of the line.
X *    On failure, NULBSTR.
X * Notes:
X *    1) Domain addresses make things *so* much easier.
X */
X{
XR1 bStrT p = prefix (S("{uunet,"), lp);
X
Xif (NULBSTR == p) return (p);
Xp = bStrChr (p, '}');
Xreturn ((NULBSTR == p) ? strend (lp) : p + 1);
X}
X
X/* wordp - match against a list of words */
X
XPRIVATE boolT wordp (s, e, lp)
XR2 bStrT  s;  /* String to match.*/
XR3 bStrT  e;  /* End of string.*/
XR1 bStrT *lp; /* List of words to match against.*/
X
X/* Function:
X *    Match word s against the words in the list to which lp points.
X * Algorithm:
X *    Linear search.
X * Returns:
X *    TRUE iff match.
X * Notes:
X *    1) This is for *small* lists of exceptional words.
X */
X{
Xfor (; NULBSTR != *lp; ++lp)
X   if (e == prefix (*lp, s)) return (TRUE);
Xreturn (FALSE);
X}
END_OF_FILE
if test 8319 -ne `wc -c <'78heur.h'`; then
    echo shar: \"'78heur.h'\" unpacked with wrong size!
fi
# end of '78heur.h'
fi
if test -f '8859-1.rc' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'8859-1.rc'\"
else
echo shar: Extracting \"'8859-1.rc'\" \(845 characters\)
sed "s/^X//" >'8859-1.rc' <<'END_OF_FILE'
X; 8859-1.rc - Character remappings for ISO 8859/1
X;
X; Copyright 1989 Howard Lee Gayle
X; This file is written in the ISO 8859/1 character set.
X;
X; $Header: 8859-1.rc,v 1.1 89/08/04 16:37:49 howard Exp $
X;
X; This program is free software; you can redistribute it and/or modify
X; it under the terms of the GNU General Public License version 1,
X; as published by the Free Software Foundation.
X;
X; This program is distributed in the hope that it will be useful,
X; but WITHOUT ANY WARRANTY; without even the implied warranty of
X; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
X; GNU General Public License for more details.
X;
X; You should have received a copy of the GNU General Public License
X; along with this program; if not, write to the Free Software
X; Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
X
XCommandFile common
END_OF_FILE
if test 845 -ne `wc -c <'8859-1.rc'`; then
    echo shar: \"'8859-1.rc'\" unpacked with wrong size!
fi
# end of '8859-1.rc'
fi
if test -f 'prolog-end.p4' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'prolog-end.p4'\"
else
echo shar: Extracting \"'prolog-end.p4'\" \(964 characters\)
sed "s/^X//" >'prolog-end.p4' <<'END_OF_FILE'
X% prolog-end.p4 - Common PostScript prolog ending
X%
X% $Header: prolog-end.p4,v 1.1 89/08/04 17:11:50 howard Exp $
X%
X% Copyright   1989 Howard Lee Gayle
X% This file is written in the ISO 8859/1 character set.
X%
X% This program is free software; you can redistribute it and/or modify
X% it under the terms of the GNU General Public License version 1,
X% as published by the Free Software Foundation.
X%
X% This program is distributed in the hope that it will be useful,
X% but WITHOUT ANY WARRANTY; without even the implied warranty of
X% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
X% GNU General Public License for more details.
X%
X% You should have received a copy of the GNU General Public License
X% along with this program; if not, write to the Free Software
X% Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
X
Xinclude(ps-abbrev.m4)
X
X% Stop putting procedures into packed arrays.
X/setpacking where
X   {% if
X   PPOP currpack setpacking
X   } if
END_OF_FILE
if test 964 -ne `wc -c <'prolog-end.p4'`; then
    echo shar: \"'prolog-end.p4'\" unpacked with wrong size!
fi
# end of 'prolog-end.p4'
fi
echo shar: End of archive 2 \(of 14\).
cp /dev/null ark2isdone
MISSING=""
for I in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ; do
    if test ! -f ark${I}isdone ; then
	MISSING="${MISSING} ${I}"
    fi
done
if test "${MISSING}" = "" ; then
    echo You have unpacked all 14 archives.
    rm -f ark[1-9]isdone ark[1-9][0-9]isdone
else
    echo You still need to unpack the following archives:
    echo "        " ${MISSING}
fi
##  End of shell archive.
exit 0