[alt.sources] Spelling utilities in C

gtoal@tharr.UUCP (Graham Toal) (06/28/90)

#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar:	Shell Archiver
#	Run the following text with /bin/sh to create:
#	README #	copyr.h #	dawg.h #	dictlib.h #	grope.h 
cat - << \SHAR_EOF > README

This is a suite of programs for working with words.  I am releasing
it to alt.sources primarily to invite people to test it on as many
platforms as possible (mail me back any bug reports please!) so that
the final code will be completely portable.

Unpack with unshar or sh < part1; sh < part2; sh < part3

Utilities for spelling checking and correction are supplied, but
the ultimate aim is to support all sorts of word-manipulation activities
such as a writers workbench and assorted word games (See PS).

This news posting is in three parts; part1: this file + headers;
part2: utility modules in .c files to be #include'd; part3: main
programs; just CC these and they should work -- no fancy makefiles
or link commands.

The code here doesn't have much of a user interface; I'm rather hoping
that people who pick it up will hook it in to their favourite operating
system as smoothly as they can.  Perhaps someone with the time to
spare (who am I kidding :-( ) might integrate it with ispell for instance.

I've experimented with various interfaces already; I can accept TeX
input and write output suitable for correcting text in either emacs
or uEmacs.  I'm not going to release these though, until I'm happy
with the internal code.  Incidentally, the code will be totally
public domain - meaning that I've no objection to its being used
in commercial projects.  HOWEVER I would appreciate if you didn't do so
until it is officially released (probably through comp.sources.misc)
because I would prefer to define portable file-formats and magic numbers
first.

OK, enough babble; here's how it all works:

1) acquire a list of words for yourself; I have one but at 690K it is
   too big to post. (actually even these sources come to around 120K
   so I hope they make it through OK.  I don't think I have a shar
   that splits, so I've divided into three shars myself.)
   The word list should be sorted by ascii order.

2) cc dawg.c
   I've deliberately made main programs into one source file, by
   including *.c for utility modules; I know this is bad style but
   it helps make compiling and porting easier (no worries about
   long external names for instance, or non-case-sensitive linkers)

3) dawg yourdict dict
   'yourdict' is your wordlist; but the second parameter should
   be 'dict' at least while testing.  This will generate dict.dwg
   which is a compacted and *FAST* data structure for word access.

   Building a dawg[See next section for meaning of name] from a word
   list is a real memory pig; so I've written code which will let you
   generate the dawg in chunks (traded off against a small loss of
   compression density).  (Interestingly enough it isn't a time-pig;
   using a hash-table for node merging gives almost linear time - thanks
   to Richard Hooker for that one).

      If you don't have enough memory (say 2Mb?) you'll get a run-time
   message suggesting that you recompile with cc -DSMALL_MEMORY dawg.c
   (OK, so one day I'll make it select this mode itself at run-time)

   IBM PC's and Mac's will get this mode by default. [See FOOTNOTE]

    If you want to check that the data structure generated is ok,
    3a) cc pdawg.c
    3b) pdawg dict.dwg
        This should decompile the data and print out the
        original word list

4) cc dwgcheck.c
   Compile a Mickey Mouse spelling checker

5) dwgcheck a few wurdz on the comand line ?
   ... and test it.

6) cc tell.c
   Now try the 'advanced' stuff! - soundex correction... (I hates it :-()

7) tell me sume wrongli speld werds
   and test it...

   If you're getting into this stuff, here's another incidental
   hack you might want to try...
    7a)  cc proot.c
    7b)  proot root
         print out all words of form 'root*'
         One day I'll hook this stuff into regexp; but not today :-)

----------

Enough examples for now.  If that all worked you are doing well!
If it didn't, and you have the time, please find out what went wrong;
If you can fix it, mail me the fix please, else mail me a log of
the symptoms, such as compiler error messages.

By the way, I know that the spelling correction uses a tacky algorithm;
don't worry about it -- I'm working on some red hot stuff which will
put soundex to shame (which isn't hard :) )  [Late note; it now works
(apparently well) but I need a phonetic dictionary before it is useful :-(
Anyone got a phonetic dictionary?  Alvey people out there?]

You might be interested in the data structures used; the main one is
a dawg - a 'Directed Acyclic Word Graph' -- it's essentially a
Directed Acyclic Graph implementing an FSM which describes the set of
all words in your lexicon.  The name DAWG comes from Appel's in his paper
on Scrabble (although none of the code or algorithms do).

The second most important data structure is superimposed on this
one; it is a packing algorithm due to Liang (of TeX hyphenation
fame) which allows you to convert an implicitly linked trie
into an indexed trie *without* taking any more space.  Liang is
a real smart cookie & it is a shame this technique of his is
not better known... (it should be up there among binary tree and
linked lists!)

OK, more examples:

8)  cc pack.c
    compile the packing code

9)  pack  dict.dwg dict.pck
    this takes rather a long time; you'll get messages saying '1000 nodes'
    '2000 nodes' etc roughly 1 every 20 seconds +/- 50%.  There shouldn't
    be more than a couple of screenfuls of these :-)  I'll speed it up one
    day.


    If you want to check that this data structure generated is ok too,
    9a) cc ppack.c
    9b) ppack dict.pck
        Just as you did for (3)

10) cc pckcheck.c
    Just like dwgcheck but using the other type of data.

11) pckcheck sume moar possibly incorrectly speld woards
    boring, huh?

12) Tha Tha Thats all folks!

---------------

Unless I've made some major cockup at the last minute, you should
actually have a reasonable chance of getting these working on your
machine, whatever it is. (Dec 20's and EBCDIC machines *possibly*
excepted - attempts welcome!)

As I said, please mail me your bugfixes, problems, suggestions etc.
By the way, the standard of coding isn't exceptionally high; this
implementation was always intended to be a prototype.  A rewrite
of dawg.c and pack.c is definitely high on the priority list...
The actual application programs aren't too bad, but the interfaces
to the various utility routines are liable to change on an hour-by-hour
basis! (I've been trying to make them more consistent -- you should have
seen the last lot ;-) ).

However, the algorithms are all complete now; anyone who has ever needed
to convert a tree to a dag might be interested in the linear-time solution
in dawg.c - most other solutions I've seen have been NlogN or N^2.

If you're interested, there's a file dictlib.h which is a proposed interface
for the next iteration; comments are invited.

Share & Enjoy,

Graham Toal <gtoal@uk.ac.ed>

<FOOTNOTE>
   I'm experimenting in this code with ways of finding out at compile
time various parameters needed for portability.  There's a file
grope.h which does this.  If things all work first time, great;
if not, you may have to change various #define's.  The best way
to do that is to add code to grope.h which identifies your system,
and only make changes of the form:

#ifdef SYS_MYSYSTEM
#undef something
#define something (a different value)
#endif /* my system */

There are instructions for customising in grope.h itself.  My overall
aim is to avoid special makefiles and special command line options.
(A bit ambitious I know, but nice idea if it works...)
(Steve Pemberton's config.c might be useful to you too if you are
hacking grope.h.  It was reposted on usenet recently.)

</FOOTNOTE>

<PS>

Future hacks:
   1) Anagrams
   2) Crosswords
   3) Scrabble
   4) Anything where a text key can be used to lookup another
      piece of text.  This is implemented with the 'dawg_locate_prefix'
      routine; it is effectively a cheap substitute for btrees etc.
      (and better than a hash table because you can *do* things with it!)

      4a) Reverse phone book      1234567=Fred Bloggs
      4b) house style enforcer    stupid=infelicitous
      4c) uk/us spelling advisor  sidewalk=pavement
      4d) shorthand/macro expander   f.y.i.=for your information
      etc etc... (Most of wwb in fact)

   5) Anything you can suggest! (or *implement* :-) )

</PS>

<MEMO>

Archive-name: dawgutils/part[1-3].shar
Reply-to: Graham Toal <gtoal%uk.ac.ed@nsfnet-relay.ac.uk>

1) Upload fixed version of dyntrie.c - remember bug with signed chars
   being used as an index [comment rule in dawg.c?]

2) known design flaw: can't handle strings with 0 in them

3) known bug: dyntrie.c has problems if you enter a string
   which *starts* with 255.  This is due to sending 'end_of_node'
   bit rather hackily.  Can be fixed.

4) Test version to be posted to alt.sources by running on machine
   with signed chars (eg MSDOS)

5) Remove hacky Malloc debugging macros in utils.c -- there might be
   a problem if compiled on MSDOS but not under MICROSOFT C

6) Check that LIB_STRINGS is *not* defined when UX63 is set.

7) tweak the constants in pack.c to speed it up a lot

8) hack the pack.c heuristics into dyntrie.c to speed it up too.

</MEMO>
SHAR_EOF
cat - << \SHAR_EOF > copyr.h
/*
 * Copyright 1989 Graham Toal & Richard Hooker
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose with or without fee is hereby granted provided
 * that the above copyright notice appear in all copies and that both that
 * copyright notice and this permission notice appear in supporting
 * documentation.
 *
 * This program is publicly available, but is NOT in the public domain.  The 
 * difference is that copyrights granting rights for unrestricted use and 
 * redistribution have been placed on the software to identify its 
 * authors.  You are allowed and encouraged to take this software and
 * build commercial products, freeware products, shareware products, and
 * any other kind of product you can think up, with the following restriction:
 *
 * If you redistribute the source to this program, or a derivitive of that
 * source, you must include this copyright notice intact. If the system
 * this source is distributed with is under a stricter license (such as
 * a commercial license, the typical freeware "no commercial use" license,
 * or the FSF copyleft) then you must provide the original source under the
 * original terms.
 *
 * Edinburgh Software Products (E.S.P.) makes no representations about the
 * suitability of this software for any purpose.  It is provided "as is"
 * without express or implied warranty.
 *
 * E.S.P. DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
 * ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
 * E.S.P. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
 * ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
 * IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
 * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 *
 * Authors:  Graham Toal & Richard Hooker, Edinburgh Software Products,
 *           6 Pypers Wynd, Prestonpans, East Lothian, Scotland.
 */
SHAR_EOF
cat - << \SHAR_EOF > dawg.h
/**
*
*       DAWG.H
*
*       Header file for Directed Acyclic Word Graph access
*
*       The format of a DAWG node (8-bit arbitrary data) is:
*
*        31                24 23  22  21                                     0
*       +--------------------+---+---+--+-------------------------------------+
*       |      Letter        | W | N |??|            Node pointer             |
*       +--------------------+---+---+--+-------------------------------------+
*
*      where N flags the last edge in a node and W flags an edge as the
*      end of a word. 21 bits are used to store the node pointer, so the
*      dawg can contain up to 262143 edges. (and ?? is reserved - all code
*      generating dawgs should set this bit to 0 for now)
*
*      The root node of the dawg is at address 1 (because node 0 is reserved
*      for the node with no edges).
*
*      **** PACKED tries do other things, still to be documented!
*
**/

#define TRUE (0==0)
#define FALSE (0!=0)

#define DAWG_MAGIC 0x01234567
#define PACK_MAGIC 0x34567890

#define TERMINAL_NODE 0
#define ROOT_NODE 1

#define V_END_OF_WORD   23
#define M_END_OF_WORD   (1L << V_END_OF_WORD)
#define V_END_OF_NODE   22                     /* Bit number of N */
#define M_END_OF_NODE   (1L << V_END_OF_NODE)   /* Can be tested for by <0 */


#define V_FREE          22             /* Overloading this bit, as
                                          packed tries need no end-of-node */
#define M_FREE          (1L << V_FREE)

#define V_LETTER        24
#define M_LETTER        0xFF
#define M_NODE_POINTER  0x1FFFFFL     /* Bit mask for node pointer */

/* max_chars and base_chars are a hangover from the days when this
   was trying to save space, and dawgs were only able to contain
   characters in the range 'a'..'z' 'A'..'Z' (squeezed into 6 bits).
   Now that we're '8-bit clean' we no longer need these.  Later code
   in fact doesn't support the old style; but some procedures still
   subtract 'BASE_CHAR' from the character to normalize it.  Since it
   is now 0 it is a no-op... */
#define MAX_CHARS       256
#define BASE_CHAR       0

/* Enough? */
#define MAX_WORD_LEN    256

#define MAX_LINE        256

typedef long NODE;
typedef long INDEX;

SHAR_EOF
cat - << \SHAR_EOF > dictlib.h
/* This header file does not describe any of the code in the
   accompanying files; it is a rough sketch of the NEXT iteration
   of the spelling/word utilities.  Comments are invited.  Missing
   functionality or bits that won't work should be pointed out
   please!   Send mail to gtoal@ed.ac.uk  (via nsfnet-relay.ac.uk
   from US sites)
   many thanks.

      Graham Toal 23/06/90
 */






typedef long INDEX;
typedef long NODE;
#define ROOTNODE 1L

typedef struct DICT {

  /* Although primitive procedures will be provided for handling the
     basic dawg array, by using this interface applications can
     remain independent of the implementation details of the
     dictionary. */

  /* So far there are three forms of dict; 1: a simple dawg with edges
     as 4-byte integers; each node (a set of branches) is stored
     sequentially.  2: An *indexed* form of the above -- much faster
     to lookup in, but slower to walk through.  Although indexing
     would normally be heavy on space, this uses Liang's packed
     overlapping tries, thus using almost exactly the same space as
     method 1.  3: A form of 1, but with all the stops pulled out
     to get as much bit-level compaction as possible; edges can
     be two bytes instead of 4. (by Keith Bostic) */

  /* This struct contains both access procedures and information;
     most fields will be filled in by our code when the dictionary
     is loaded by load_dict().  The fields may be added to in
     later releases, but they will always be upwards compatible;
     none of this data is saved to disk in this format, so changing
     the struct layout will not cause problems as long as the names
     remain the same. */

  /* Note that some procedures take a user-supplied 'apply' parameter;
     This 'apply' procedure has a 'context' parameter which is merely
     a handle to allow the user to pass data in to his code without
     having to use global variables.  If the user's apply procedure
     returns anything other than 0, it will cause an early exit, with
     the users return code being returned from the calling function. */

  /* It is intended that these are used in an object-oriented way;
     although we will supply an external dict_check(word, dict) procedure, for
     instance, all it will do is call dict->check(word, dict->dawg) */

  NODE *dawg;
                                   /* Root of the dictionary tree */
  char *dict_name,                 /* "smiths" - used in command lines etc. */
       *dict_info,                 /* "Johns Smiths Rhyming Dictionary,\n
                                       (c) Smith & Jones, 1892\n
                                       Whitechapel, London.\n" */
       *dict_fname,                /* "/usr/dict/smiths" */

       *lang_charset;              /* "ISO 8859/1 Latin 1" */

  /* These are filled in by us to make life easier for the applications
     programmer.  We'll supply a default table for simple 7-bit ascii
     dawgs; but this mechanism allows us the option of including all
     the salient points about the character set used in a dictionary
     in the dictionary itself.  Because they are char*'s, we'll probably
     end up sharing the same table between multiple dictionaries --
     so don't write to them. (Although you may replace the _pointer_
     with one to a table of your own. */

  /* Later note: I've just found out about LOCALE.H etc.  This stuff
     may therefore change... */

  int *chartype;
                                    /*  1             whitespace           */
                                    /*  2             punctuation          */
                                    /*  4             blank                */
                                    /*  8             lower case letter    */
                                    /*  16            upper case letter    */
                                    /*  32            (decimal) digit      */
                                    /*  64            underscore           */
                                    /*  128           A-F and a-f          */
                                    /*  256           Ignore words starting
                                                      with this char       */
                                    /*  512           Include in definition
                                                      of a word -- mainly
                                                      alpha but also hyphen
                                                      and apostrophe       */

char   *lower,                     /* personal implementation of tolower() */
       *unaccented,                /* map accented chars to non-accenmted */
       *lexorder;                  /* an arbitrary ordering for sorting:
                                      e-acute would come after e and before f
                                      */

  int  (* check)  (NODE *base, INDEX state, struct DICT *d, char *word);
                     /* Check a word - exact match; return true/false */

  int  (* fuzzy)  (NODE *base, INDEX state, struct DICT *d, char *word,
                   char *value);
     /* Check a word - fuzzy case, accents, hyphens etc; return true/false */
     /* value is filled in with the dictionary's idea of what the
        word should look like. */

  /* Fancier correction procedures may be synthesised out of user code and
     'walk' */
  int  (* correct) (
                char *word,
                NODE *base, INDEX state,
                                /* Normally dict's root, but may be subtree */
                struct DICT *d,   /* lowercase/accent information from here */
                int   maxwords,
                int   order,                /* 0: by alpha
                                               1: by highest score first
                                               2: by lexical ordering
                                            (ignores case, accents, hyphens)
                                             */
                int   (* apply) (
                      char *word,     /* Corrected word */
                            int   score,    /* Normalised to range 0..255 */
                            void *context),
                void  *context);
                                   /* Offer spelling corrections for the
                                      given word.  If maxwords > 0, return
                                      the best <maxwords> words in order
                                      of most-likely first; otherwise
                                      many words may be returned, in
                                      alphabetical order, coupled with
                                      a score: 255 means exact match; 0
                                      means totally different */

                                    /* returns 0 if user returned 0 for
                                       all applications */
  int  (* walk) (
               NODE *base, INDEX state,
               int   (* apply) (char *word, void *context),
               void  *context);
                                   /* Apply the procedure given to every
                                      word in the dictionary tree or subtree,
                                      passing in the user-supplied context */

                                    /* returns 0 if user returned 0 for
                                       all applications */

  int (* lookup) (
              NODE *base, INDEX state,
              char  *key,
              /* Out */
              INDEX values);
                                   /* Return the index of the subtree of the
                                      dictionary which has 'key' as a prefix;
                                      for instance, in a reverse telephone
                                      book, you might find entries:
                                         0874=Anytown
                                         0875=Cockenzie
                                         0875=Port Seton
                                         0875=Prestonpans
                                         0876=Toytown
                                       In this case, if searching for "0875=",
                                       a subtree would be returned containing:
                                         Cockenzie
                                         Port Seton
                                         Prestonpans
                                       This subtree would then be walked
                                       using the 'walk' function and a user-
                                       written 'apply' procedure. */

                                    /* returns 0 if successful */

  /* The following two are performance hints ONLY.  They should not
     affect the correct functioning of a program. */

  int  (* uncache) (struct DICT *d);/* The space a dictionary uses may be
                                       reclaimed by calling this.  If the
                                       dictionary is subsequently accessed,
                                       there *may* be a performance penalty --
                                       for instance the dictionary may be
                                       accessed off disk.  However on a machine
                                       with sufficient memory, the system
                                       may choose to leave the data in ram
                                       until, say, malloc runs out of space.
                                         If there is no convenient mechanism
                                       for organising the demand-recacheing,
                                       this may be a no-op. */

                                    /* returns 0 if successful */

  int  (* recache) (struct DICT *d);/* The data of the dictionary is reloaded
                                       into active memory where it will stay */

                                    /* returns 0 if successful */
  /* May be extended in the future */
} DICT;



/*
                            dict_load

   This is provided as a primitive so that dictionaries can be loaded
in the most efficient way the machine supports.  The space for the dict
is supplied *by* this procedure - not *to* it.  This allows a Mach (or Emas)
implementation to mmap (or Connect) the file in memory - the connection
address being chosen by the OS and outside our control.  It also allows
systems like RISC OS on the Acorn Archimedes to use an optimised whole-
file load call to pull the file off disk into real ram in a single disk
operation.

   Since the data parts of the type 1 & 2 dicts are just a set of 4-byte
integers, it is easy to correct a faulty byte-sex by running down this
array *once at start-up* to reverse the order.  This is easy if the
data is loaded into physical ram.  If it is connected as a file by a VM system
however, the VM system must support copy-on-write; if it only has read-only
mapping there must be two files available -- one of each sex.

   A *possible* scheme on VM systems is to byte-sex-reverse a whole
page the first time that page is touched.  This may be a lot of unneccesary
work - I'd recommend keeping a correct-sex file on-line.

   This code will create the DICT struct, and fill in the various fields,
either from the dictionary file itself or from private knowledge if not available.

*/

int dict_load_file(char *fname, DICT **dict_struct);

                                   /* fname is the literal file name */

                                   /* dict_struct is allocated by us,
                                      not by the user. */

                                   /* returns 0 if successfully loaded */

int dict_load_dict(char *dname, DICT **dict_struct);

                                   /* dname is the dictionary name,
                                      e.g. "words", or "web2" etc.
                                      The corresponding file will be searched
                                      for via a system-dependent path, logical
                                      variable, or fixed place. */

                                   /* dict_struct is allocated by us,
                                      not by the user. */

                                   /* returns 0 if successfully loaded */

 int dict_unload(DICT *d);         /*  Use when totally finished with data */

 /* These are user-level veneers for the internal procedures used above.
    Use them in preference to the above.  Use the above only if
    you are a speed-freak! (but disguise them in macros such as:

  #define  dict_check(dawg, dict, word) dict->check(dict->dawg, dict, word)

   These macros will of course go 'Bang!' if dict ptr is unassigned or NULL.
 */

 /* In the worst case, if a machine has a poor quality compiler which
    doesn't support procedure variables well, these routines could be
    the *acual* implementation rather than just a pointer to it. */

 int  check  (NODE *base, INDEX state, DICT *d, char *word);
 int  fuzzy  (NODE *base, INDEX state, DICT *d, char *word, char *value);
 int  correct (char *word,
               NODE *base, INDEX state,
                           /* Normally dictionary root, but may be subtree */
               DICT *d,          /* lowercase/accent information from here */
               int   maxwords,
               int   order,                /* 0: by alpha
                                              1: by highest score first
                                              2: by lexical ordering
                                            */
               int   apply(char *word,     /* Corrected word */
                           int   score,    /* Normalised to range 0..255 */
                           void *context),
               void  *context);
 int  walk   (NODE *base, INDEX state,     /* Yoyo fanatics will be pleased
                                 to learn that they can 'walk the dawg'... :-) */
              int    apply(char *word, void *context),
              void  *context);
 int lookup (NODE *base, INDEX state,
             char  *key,
             /* Out */
             INDEX values);


/* Now here come the fast internal procedures which the above eventually call */
int dawg_check(NODE *base, INDEX state, char *word);
  /* Standard dawg */
int pack_check(NODE *base, INDEX state, char *word);
  /* Overlapped indexed dawg */
int bsd_check(NODE *base, INDEX state, char *word);
  /* Space-optimised dawg (called bsd because Keith Bostic is working
                   on this format for the next bsd4.4 spelling checker) */
/* Must make the internl procs agree with these... save the macros for later... 
#define dict_check(word,dict) \
          (dict != NULL ? (dict->check(dict->dawg, ROOTNODE, word)) : -1)
*/
int dawg_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
int pack_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
int bsd_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
/*
#define dict_walk(dict, apply, context) \
          (dict != NULL ? (dict->walk(dict->dawg, ROOTNODE, apply, context)) : -1)
*/

int dawg_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
int pack_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
int bsd_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
/* etc... */

/* Get list of possible next characters from this point in the tree,
   put it into user-supplied array branch[256].  Filled with
   NULL if no branch, or an INDEX if it points to more of the tree.
   Words which may terminate on a particular letter have terminate[let] != 0

   The user-supplied terminate array is deliberately a long array to allow
   for expansion; it may be used one day to return arbitrary codes such as
  'Noun' etc...

 */

void dawg_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void pack_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void bsd_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
/* No internal proc yet... add it in the morning. */

/* In fact there will be even simpler procedures which can be used by
   people who prefer to include the whole source of this package rather
   than just the library interface parts.  In applications where utmost
   speed is a neccessity (eg Scrabble(tm)) you could use these, or write
   your own.  But most of us can use this clean interface with unnoticable
   overhead. */
SHAR_EOF
cat - << \SHAR_EOF > grope.h
#ifndef grope_h
#define grope_h 1

#ifdef TESTING_DEFS
/*###################################################################*/
/*
 * This is an exercise to assist in writing portable programs.
 *
 * To use as a header file, eg "grope.h", just leave this file
 * as it is.  To use it to define new entries, rename it as
 * "grope.c" and compile it as "cc -DTESTING_DEFS grope".
 *
 * To customise this test program for your system, I have set it up
 * so that all features are enabled.  If you find that one feature
 * causes a compile-time error, test a suitable set of '#ifdef's for
 * your machine and #undef the values below which are not appropriate.
 *
 * Please do your best to see that your tests are unique, and that
 * you do not undo any tests already implemented.
 *
 * One last request; PLEASE PLEASE PLEASE send me any updates you
 * may make. If you cannot get through to me on email, send me a
 * paper copy (or even better, a floppy copy :-)) to Grahan Toal,
 * c/o 6 Pypers wynd, Prestonpans, East Lothian, Scotland EH32 9AH.
 *
 *   Graham Toal <gtoal@uk.ac.ed>
 * [PS: I can read DOS and RISC OS floppies only]
 */
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
#define KNOWS_LINE 1
#define KNOWS_REGISTER 1
#endif /* TESTING_DEFS */
/*###################################################################*/
#define SYS_ANY 1
#define COMPILER_ANY 1
  /* Don't yet know what machine it is.  Add a test once identified. */
/*--------------------------*/
#ifdef MSDOS
#define SYS_MSDOS 1
#endif
/*--------------------------*/
#ifdef __STDC__
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
  /* If the compiler defines STDC it should have these. We can undef
     them later if the compiler was telling lies! */
#endif
/*--------------------------*/
#ifdef MPU8086
#define SYS_MSDOS 1
  /* Aztec does not #define MSDOS.
     We assume it is Aztec on MSDOS from the MPU8086.
   */
#ifdef __STDC__
#define COMPILER_AZTEC 1
  /* Aztec is known to also define __STDC__ (illegally).  Therefore if
     __STDC__ is not defined, it is probably not Aztec */
#endif
#endif

#ifdef SYS_MSDOS
/*--------------------------*/
#ifdef __STDC__
/*----------------------*/
#define COMPILER_MICROSOFT 1
  /* I assume that the combination of #define MSDOS and #define __STDC__
     without any other collaboration means MICROSOFT.  (Who, incidentally,
     are being naughty by declaring __STDC__) */
#define KNOWS_LINE 1
#else
/*----------------------*/
#ifdef __ZTC__
/*------------------*/
#define COMPILER_ZORTECH 1
#undef STDLIBS
  /* Another system without locale.h */
#define PROTOTYPES 1
#else
/*------------------*/
/* A non-std msdos compiler */
#undef STDLIBS
#undef PROTOTYPES
/*------------------*/
#endif
/*----------------------*/
#endif
/*--------------------------*/
#endif
#ifdef TURBOC
  /* Although Turbo C correctly does not define __STDC__, it has SOME
     standard libraries (but not all - locale.h?) and accepts prototypes. */
#undef STDLIBS
#define PROTOTYPES 1
#define SYS_MSDOS 1
#define COMPILER_TURBOC 1
  /* TURBOC is defined, but has no value.  This allows it to be tested
     with #if as well as #ifdef. */
#endif
/*--------------------------*/
#ifdef COMPILER_MICROSOFT
#undef STDLIBS
  /* Again, like Turbo C, does not know locale.h */
#define PROTOTYPES 1
#endif
/*--------------------------*/
#ifdef SYS_MSDOS
#define MEMMODELS 1
  /* We assume ALL MSDOS machines use memory models */
#endif
/*--------------------------*/
#ifdef UX63
#undef STDLIBS
#undef PROTOTYPES
#define SYS_UX63 1
#define COMPILER_PCC 1
/*#define LIB_STRINGS 1 - apparently not... */
#endif
/*--------------------------*/
#ifdef sun
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SUN 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef THINK_C
#define SYS_MAC 1
#define COMPILER_THINKC 1
#define KNOWS_VOID 1
#endif
/*--------------------------*/
#ifdef sparc
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SPARC 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef ARM
#define SYS_RISCOS 1
  /* I fear Acorn may define 'ARM' on both unix and risc os versions.
     Should find out whether they define others as well, to differentiate. */
#endif
#ifdef __ARM__
#define SYS_RISCOS 1
  /* I fear Acorn may define 'ARM' on both unix and risc os versions.
     Should find out whether they define others as well, to differentiate. */
#endif
/*--------------------------*/
#ifdef SYS_RISCOS
#define COMPILER_NORCROFT 1
#define KNOWS_REGISTER 1
#define KNOWS_LINE 1
#endif
/*--------------------------*/
#ifdef vms
#define SYS_VMS 1
#endif
/*--------------------------*/

/*--------------------------*/
#ifdef SYS_UX63
#undef SYS_ANY
#endif
#ifdef SYS_ARM
#undef SYS_ANY
#endif
#ifdef SYS_MSDOS
#undef SYS_ANY
#endif
#ifdef SYS_SUN
#undef SYS_ANY
#endif
#ifdef SYS_SPARC
#undef SYS_ANY
#endif
#ifdef SYS_RISCOS
#undef SYS_ANY
#endif
#ifdef SYS_MAC
#undef SYS_ANY
#endif
#ifdef SYS_VMS
#undef SYS_ANY
#endif
/*--------------------------*/
#ifdef COMPILER_PCC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_MICROSOFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_TURBOC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_ZORTECH
#undef COMPILER_ANY
#endif
#ifdef COMPILER_AZTEC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_NORCROFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_THINKC
#undef COMPILER_ANY
#endif
/*--------------------------*/
/* ##################################################################### */
#ifdef TESTING_DEFS
#include <stdio.h>
/* ======================================================================= */
#ifdef STDLIBS
              /* If any of these fail, make sure STDLIBS is not
                 defined for your machine. */
#include <stdlib.h>      /* STDLIBS should not be defined. */
#include <time.h>        /* STDLIBS should not be defined. */
#include <locale.h>      /* STDLIBS should not be defined. */
#endif
/* ======================================================================= */
#ifdef KNOWS_VOID
void test() {            /* KNOWS_VOID should not be defined */
  /* Make sure your ifdef's don't #define KNOWS_VOID if this fails */
}
#endif
/* ======================================================================= */
#ifdef KNOWS_REGISTER
int regtest() {
register int i = 0;          /* KNOWS_REGISTER should not be defined */
  /* Make sure your ifdef's don't #define KNOWS_REGISTER if this fails */
  return(i);
}
#endif
/* ======================================================================= */
#ifdef PROTOTYPES
int main(int argc, char **argv)   /* PROTOTYPES should not be defined */
/* If this fails, make sure PROTOTYPES is not defined for your machine. */
#else
int main(argc,argv)
int argc;
char **argv;
#endif
{
/*-------------------------------------------------------------------------*/
  printf("We should know just what the machine is, or 'SYS_ANY':\n");
#ifdef SYS_UX63
  printf("#define SYS_UX63 %d\n", SYS_UX63);
#endif
#ifdef SYS_ARM
  printf("#define SYS_ARM %d\n", SYS_ARM);
#endif
#ifdef SYS_MSDOS
  printf("#define SYS_MSDOS %d\n", SYS_MSDOS);
#endif
#ifdef SYS_SUN
  printf("#define SYS_SUN %d\n", SYS_SUN);
#endif
#ifdef SYS_SPARC
  printf("#define SYS_SPARC %d\n", SYS_SPARC);
#endif
#ifdef SYS_RISCOS
  printf("#define SYS_RISCOS %d\n", SYS_RISCOS);
#endif
#ifdef SYS_MAC
  printf("#define SYS_MAC %d\n", SYS_MAC);
#endif
#ifdef SYS_VMS
  printf("#define SYS_VMS %d\n", SYS_VMS);
#endif
#ifdef SYS_ANY
  printf("#define SYS_ANY %d\n", SYS_ANY);
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine has different memory models or not:\n");
#ifdef MEMMODELS
  printf("#define MEMMODELS %d\n", MEMMODELS);
#else
  printf("#undef MEMMODELS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("One compiler name should be given, or 'COMPILER_ANY':\n");
#ifdef COMPILER_PCC
  printf("#define COMPILER_PCC %d\n", COMPILER_PCC);
#endif
#ifdef COMPILER_MICROSOFT
  printf("#define COMPILER_MICROSOFT %d\n", COMPILER_MICROSOFT);
#endif
#ifdef COMPILER_TURBOC
  printf("#define COMPILER_TURBOC %d\n", COMPILER_TURBOC);
#endif
#ifdef COMPILER_ZORTECH
  printf("#define COMPILER_ZORTECH %d\n", COMPILER_ZORTECH);
#endif
#ifdef COMPILER_AZTEC
  printf("#define COMPILER_AZTEC %d\n", COMPILER_AZTEC);
  /* Can exist on other machines as well as MSDOS */
#endif
#ifdef COMPILER_LATTICE
  /* Currently coming through as 'COMPILER_ANY' - haven't found one to test */
  /* Can exist on other machines as well as MSDOS. Meanwhile pass it in     */
  /* on the command_line?                                                   */
  printf("#define COMPILER_LATTICE %d\n", COMPILER_LATTICE);
#endif
#ifdef COMPILER_GCC
  /* Ditto. Test in appropriate places for each machine. */
  printf("#define COMPILER_GCC %d\n", COMPILER_GCC);
#endif
#ifdef COMPILER_NORCROFT
  printf("#define COMPILER_NORCROFT %d\n", COMPILER_NORCROFT);
#endif
#ifdef COMPILER_THINKC
  printf("#define COMPILER_THINKC %d\n", COMPILER_THINKC);
#endif
#ifdef COMPILER_ANY
  printf("#define COMPILER_ANY %d\n", COMPILER_ANY);
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the compiler accepts ANSI-like libraries or not:\n");
#ifdef STDLIBS
  printf("#define STDLIBS %d\n",STDLIBS);
#else
  printf("#undef STDLIBS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts ANSI prototypes or not:\n");
#ifdef PROTOTYPES
  printf("#define PROTOTYPES %d\n", PROTOTYPES);
#else
  printf("#undef PROTOTYPES\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine needs nonstandard #include <strings.h> or not:\n");
#ifdef LIB_STRINGS
  printf("#define LIB_STRINGS %d\n", LIB_STRINGS);
#else
  printf("#undef LIB_STRINGS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts the 'void' keyword or not:\n");
#ifdef KNOWS_VOID
  printf("#define KNOWS_VOID %d\n", KNOWS_VOID);
#else
  printf("#undef KNOWS_VOID\n");
#endif
  printf("Either the machine accepts the 'register' keyword or not:\n");
/*-------------------------------------------------------------------------*/
  printf("Either the compiler accepts the '__LINE__' directive or not:\n");
#ifdef KNOWS_LINE
  printf("#define KNOWS_LINE %d\n", KNOWS_LINE);
#else
  printf("#undef KNOWS_LINE\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts the 'register' keyword or not:\n");
#ifdef KNOWS_REGISTER
  printf("#define KNOWS_REGISTER %d\n", KNOWS_REGISTER);
#else
  printf("#undef KNOWS_REGISTER\n");
#endif
  /* Note - this is intended to be used only if your compiler accepts
     'register' *AND* compiles code with it correctly!  Some compilers
     show up bugs when register optimisations are attempted! */
/*-------------------------------------------------------------------------*/
  if (argv[argc] != 0) printf("Warning! argv[argc] != NULL !!! (Non ansii)\n");
}
#endif /* TESTING_DEFS */
#endif /* Already seen? */

SHAR_EOF

gtoal@tharr.UUCP (Graham Toal) (06/28/90)

#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar:	Shell Archiver
#	Run the following text with /bin/sh to create:
#	check.c #	correct.c #	dyntrie.c #	init.c #	locate.c #	print.c #	similcmp.c #	soundex.c #	utils.c 
cat - << \SHAR_EOF > check.c
/*

    File:    check.c
    Author:  Graham Toal
    Purpose: Check a word using DAWG or TRIE.
    Functions exported:  dawg_check, pack_check
    Internal function:   dawg_ck, pack_ck

Description:

    call as:     dawg_check(dawg, "word")
                 pack_check(trie, "word")

   Simple spelling-check. Unfortunately the two interfaces for
   DAWGs and TRIEs are different, and even DAWG stuff differs from
   program to program.  The problem is primarily in how to spot
   the end of a search; at the moment it is done when a particular
   pointer points to the root. Unfortunately we enter the data
   structure either at root+1 or at some arbitrary point, so this
   scheme means passing in two pointers.  The next idea is to check
   that the *data* pointed to by the terminating pointer is 0; this
   would be OK but I was hoping to leave the initial word in the
   dawg free for expansion... (dawg contains [0, Node1..NodeN])

   *best* idea would be to terminate on *INDEX* being 0.  This is
   probably also simplest & I'll code it up properly when I'm awake...

*/

static int
#ifdef PROTOTYPES
dawg_ck(NODE PCCRAP *dict, NODE PCCRAP *edge, char *word)
#else
dawg_ck(dict, edge, word)
NODE PCCRAP *dict;
NODE PCCRAP *edge;
char *word;
#endif
{
  if (edge == dict) return(0!=0);
  for (;;) {
    if (*word == (((*edge >> V_LETTER) & M_LETTER))) {
      if (*++word == '\0') {
        return((*edge & M_END_OF_WORD) != 0);
      } else {
        if ((edge = dict+(*edge & M_NODE_POINTER)) == dict) break;
        continue;
      }
    }
    if (((*edge++) & M_END_OF_NODE) != 0) break;
  }
  return(0!=0);
}

int
#ifdef PROTOTYPES
dawg_check(NODE PCCRAP *dict, char *word)
#else
dawg_check(dict, word)
NODE PCCRAP *dict;
char *word;
#endif
{
  return(dawg_ck(dict, dict+ROOT_NODE, word));
}

static int
#ifdef PROTOTYPES
pack_ck(NODE PCCRAP *ptrie, INDEX p, char *s)
#else
pack_ck(ptrie, p, s)
NODE PCCRAP *ptrie;
INDEX p;
char *s;
#endif
{
/* Note: node is passed in as a parameter so that it is in a register -
   otherwise it would take an extra instruction every time it was accessed.
   We trust the compiler to allocate register variables most efficiently --
   when people tinker, it might improve one machine but ruin another...
 */

#define next_letter(s) \
  p = ptrie[p + *s]; \
  if (((p >> V_LETTER) & M_LETTER) != *s++) return(0!=0); \
  if (*s == '\0') return((p >> V_END_OF_NODE) != 0); \
  if ((p &= M_NODE_POINTER) == 0) return(0!=0)

  /* Depending on whether your machine has a cache or not, and whether
     it optimises pipeline fetches, you should decrease or increase the
     number of macro instantiations in the loop below appropriately */

  for (;;) {
    next_letter(s);    next_letter(s);    next_letter(s);    next_letter(s);
  }
}

int
#ifdef PROTOTYPES
pack_check(NODE PCCRAP *ptrie, char *s)
#else
pack_check(ptrie, s)
NODE PCCRAP *ptrie;
char *s;
#endif
{
  return(pack_ck(ptrie, 1L, s));
}
SHAR_EOF
cat - << \SHAR_EOF > correct.c
/*

    File:    correct.c
    Author:  Graham Toal
    Purpose: Correct a word using Soudex and similcmp on DAWG.
    Functions exported:  dawg_correct
    Internal functions:  reccmp, walk_rest_dawg, walk_dawg

Description:
   Reduce word to rough phonetic representation.
   Reduce all words in dawg to same form.  Compare.  This
   results in a large list of plausible (HA!) candidates.

   Then apply 'similcmp' which returns a distance metric representing
   closeness of two words.  Sort by this metric, and return N best
   matches.  If one stands out above others, return it alone.

   I'm working on a *much better* algorithm, but this will do
   for prototyping the interface.

   Final version of this will have callbacks to handle corrected
   words as they are produced.

*/

#include "soundex.c"
#include "similcmp.c"

typedef struct {
  char *possible;
  int score;
} rec;


/* Numbers of parameters exploded during process of removing global state :( */

static void
#ifdef PROTOTYPES
walk_rest_dawg(NODE PCCRAP *dawg, INDEX node, int len,
  char *word, char *original, char *target,
  rec *ans, int *nextfreep)
#else
walk_rest_dawg(dawg, node, len, word, original, target, ans, nextfreep)
  NODE PCCRAP *dawg;
  INDEX node;
  int len;
  char *word;
  char *original;
  char *target;
  rec *ans;
  int *nextfreep;
#endif
#define nextfree (*nextfreep)
{
  NODE PCCRAP *edge;
  long c;

    for (edge = (NODE PCCRAP *)&dawg[node]; ;edge++) {
    c = *edge;            /* Avoid MS C compiler bug :-( */
    c = c >> V_LETTER;
    c = c & M_LETTER;
    word[len] = (char)c;
    if ((*edge & M_END_OF_WORD) != 0) {
      word[len+1] = 0;
      if (strcmp(soundex(word), target) == 0) {
        ans[nextfree].possible = (char *)Malloc(len+1, 1);
        if (ans[nextfree].possible == NULL) {
          fprintf(stderr,"I\'ve no idea - it could be anything...\n");
          exit(0);
        }
        strcpy(ans[nextfree].possible, word);
        ans[nextfree].score = simil(word, original);
        if (nextfree < 511) nextfree++;
      }
    }
    c = *edge & M_NODE_POINTER;
    if (c != 0) {
      walk_rest_dawg(dawg, c, len+1, word, original, target, ans, nextfreep);
    }
    if ((*edge & M_END_OF_NODE) != 0) break; /* End of node */
  }
#undef nextfree
}


static void
#ifdef PROTOTYPES
walk_dawg(
  NODE PCCRAP *dawg, INDEX node, int len,
  char *original, char *target,
  char targets, char *word,
  rec *ans, int *nextfreep)
#else
walk_dawg(dawg, node, len, original, target, targets, word, ans, nextfreep)
  NODE PCCRAP *dawg;
  INDEX node;
  int len;
  char *original;
  char *target;
  char targets;
  char *word;
  rec *ans;
  int *nextfreep;
#endif
#define nextfree (*nextfreep)
{
  NODE PCCRAP *edge;
  long c;

  edge = (NODE PCCRAP *)&dawg[node];
  /* Only search trie starting with initial letter */
  for (;;) {
    c = *edge;            /* Avoid MS C compiler bug :-( */
    c = c >> V_LETTER;
    c = c & M_LETTER;
    word[len] = (char)c;
    if (c == targets) {
      c = *edge & M_NODE_POINTER;
      walk_rest_dawg(dawg, c, len+1, word, original, target, ans, nextfreep);
      return;
    }
    edge++;
  }
#undef nextfree
}

static int
#ifdef PROTOTYPES
reccmp(const void *a, const void *b)
#else
reccmp(a, b)
  void *a;
  void *b;
#endif
{
  return(((rec *)b)->score - ((rec *)a)->score);
}


int
#ifdef PROTOYPES
dawg_correct(NODE PCCRAP *dawg, char *original)
#else
dawg_correct(dawg, original)
NODE PCCRAP *dawg;
char *original;
#endif
{
char word[81];
char target[256];
static rec ans[256];
  /* static because brain-dead MSC can't hack a big stack :-( */
int targets;
int nextfree = 0;
int i, prev=0;

  targets = original[0];
  strcpy(target, soundex(original));
  walk_dawg(
    dawg, (INDEX)ROOT_NODE, 0, original, target, targets, word, ans, &nextfree);
  if (nextfree == 0) return(FALSE);
  qsort(ans, (size_t)nextfree, sizeof(rec), /*&*/reccmp);
  for (i = 0; i < nextfree; i++) {
    if (((prev*100) / (ans[i].score)) > 120) break;
    fprintf(stdout, "%s (%d)\n", ans[i].possible, ans[i].score);
    if (ans[i].score >= 100) break;
    if (i == 5) break;
    prev = ans[i].score;
  }
  return(TRUE);
}
SHAR_EOF
cat - << \SHAR_EOF > dyntrie.c
/* The #define is to repair mangling by BITNET mailers */
#define NOT ~
              /* This should be Tilde (twiddle) -- C unary Not   */
/*

    File:    dyntrie.c
    Author:  Graham Toal
    Purpose: Check a word using DAWG or TRIE.
    Functions exported:  trie_new, trie_add, trie_dump
    Internal functions:  insert_simple recurse_add

Description:

   Builds DAWG-compatible trie in store, incrementally; words do not
   have to be presented in order.  (Note it is NOT a 'packed-trie';
   in our scheme it is closer to a dawg - but without the tail-compression
   precomputed dawgs have).

   Any tries built using this should be used by check_dawg, print_dawg etc.

   Has rather nice property of only claiming enough memory for the
   job; dynamically moves the data structure when it fills!

*/

  /* To avoid global state, I'm putting these useful bits of info
     in the two words before the dawg itself; I *HOPE* that all systems
     allow negative indexing off arrays; I'm not at all sure they do
     though.  However since I moved the base of the dict up by adding
     2 to it, I *should* be able to get the old address by by subtracting
     2 from it, so I'm using the first pair of macros below, not the
     more intuitive second pair.
       I should really do this for ordinary dawgs & tries too.  Indeed
     I should *really* implement the dictlib.h interface, but thats a
     month or two off.  Also I'ld like some feedback first.
   */

#define EXTRAS 2
#define MAX_ENTRIES(dawg) (*(dawg-2))          /* Safe way of aliasing */
#define NEXT_FREE(dawg) (*(dawg-1))


  /* #define MAX_ENTRIES(dawg) dawg[-2] */     /* 'Obvious' alias      */
  /* #define NEXT_FREE(dawg) dawg[-1] */       /* may be buggy         */

#define EMPTY 0
#define INIT_MAX_ENTRIES 512

/* Slop is so that we don't need to be continually checking nextfree
   as being close to max_entries */
#define SLOP 12

/* A debugging macro which rangechecks arrays -- happens when
   utils.c is included with RCHECK defined. Otherwise no overhead. */
#define DICT(idx) dict[RANGECHECK(idx, MAX_ENTRIES(dict))]


  /*
      This quick hack job has special-case code in each function
      for the root node-set.  Its rather tacky; I could clean it
      up and make the root node like all other nodes, but why
      bother; this is only test code anyway...
   */



static INDEX
#ifdef PROTOTYPES
insert_simple(NODE PCCRAP **dictp, char *word)
#else
insert_simple(dictp, word)
NODE PCCRAP **dictp;
char *word;
#endif
#define dict (*dictp)
{
NODE bugfix;
INDEX i; NODE n; int c;

  if (NEXT_FREE(dict) >= MAX_ENTRIES(dict)-SLOP) {
    NODE PCCRAP *moved;
    moved = MALLOC((SIZE_T)MAX_ENTRIES(dict)*2+EXTRAS, sizeof(NODE));
    if (moved != NULL) {
      moved += EXTRAS;
      MAX_ENTRIES(moved) = MAX_ENTRIES(dict);
      NEXT_FREE(moved) = NEXT_FREE(dict);
      /* Should use realloc but appears to be buggy :-( */
      for (i = MAX_ENTRIES(dict); i < MAX_ENTRIES(dict)*2; i++) {
        moved[i] = EMPTY;
      }
      for (i = 0; i < MAX_ENTRIES(dict); i++) moved[i] = DICT(i);
      dict -= EXTRAS;
      FREE(dict);
      dict = moved; MAX_ENTRIES(dict) *= 2;
    } else {
      fprintf(stderr, "Can\'t add any more words again\n");
    }
  }

  c = (*word++)&255;
  if (c == '\0') return(TERMINAL_NODE);
  i = NEXT_FREE(dict)++;
  bugfix = insert_simple(&dict, word);
  n = ((NODE)c << (NODE)V_LETTER) | M_END_OF_NODE | bugfix;
  DICT(i) = n; if (*word == '\0') DICT(i) |= M_END_OF_WORD;
  return(i);
#undef dict
}

static void
#ifdef PROTOTYPES
recurse_add(NODE PCCRAP **dictp, INDEX base, char *word)
#else
recurse_add(dictp, base, word)
NODE PCCRAP **dictp;
INDEX base;
char *word;
#endif
#define dict (*dictp)
{
NODE bugfix;
INDEX i = base, newbase;
NODE unpicked[MAX_CHARS], n;
int c, ch;
int gap, nextslot = 0, j;
  /* First see if first char is already in trie */
  ch = (*word++)&255;
  for (;;) {
    c = (int)(((dict[i] >> V_LETTER) & M_LETTER)&255);
    if (ch == c) {
      newbase = dict[i]&M_NODE_POINTER;
      if (newbase == 0) {
        /* have abc (this is c), adding (abc)defg */
        dict[i] &= (NOT M_NODE_POINTER);
        bugfix = insert_simple(&dict, word);
        dict[i] |= bugfix;
      } else {
        if (*word == '\0') {
          dict[i] |= M_END_OF_WORD;
        } else {
          recurse_add(dictp, newbase, word);
        }
      }
      return;
    }
    if ((dict[i]&M_END_OF_NODE) != 0) break;
    i++;
  }

  /* unpick base */
  for (nextslot = 0; nextslot < MAX_CHARS; nextslot++) {
    unpicked[nextslot] = EMPTY;
  }
  nextslot = 0;

  i = base; j = 0;
  for (;;) {
    if ((j == 0) && (((dict[i] >> V_LETTER) & M_LETTER) > ch)) {
      j = 1; newbase = nextslot++;
    }
    n = dict[i]; dict[i] = EMPTY; i++;
    unpicked[nextslot++] = n & (NOT M_END_OF_NODE);
    if ((n & M_END_OF_NODE) != 0) break;
  }
  if (j == 0) newbase = nextslot++; /* Was it last alphabetically? */
  /* add this char to the node */
  if (*word == '\0') {
    unpicked[newbase] =
      ((NODE)ch << (NODE)V_LETTER) | TERMINAL_NODE | M_END_OF_WORD;
  } else {
    /* and insert the rest of the
       string with insert_simple */
    bugfix = insert_simple(&dict, word);
    unpicked[newbase] =
      ((NODE)ch << (NODE)V_LETTER) | bugfix;
      /* The insert_simple call has to come after above loop, because
         of freeing slots... */
  }
  unpicked[nextslot-1] |= M_END_OF_NODE;

  /* scan for suitably-sized gap */
  /* MAX_CHARS+1 should be first 'real' node base (0, 1..MAX_CHARS reserved) */

  /*
     The particular application this is wanted for doesn't have a large
     number of words to be added dynamically, so the BFI code below which
     finds free holes in the trie space works fine.  However if some bright
     spark cares to keep a freelist *in the holes* then this program
     effectively implements a linear-time sorting algorithm.

     (I know it's not *really*, but think about it before jumping
     down my throat; the N^2 case is very statistically unlikely)
   */

  for (i = (INDEX)MAX_CHARS+1; i < MAX_ENTRIES(dict)-nextslot-SLOP; i++) {
    /*
        Even without the freelist, the sneaky hack in pack.c for
        noting earliest bunch of free holes might well make a
        significant difference... (It was a lowest_search_start
        variable and used a bit of hysteresis)
     */
    gap = TRUE;
    for (j = 0; j < nextslot; j++) {
      if (dict[i+j] != EMPTY) {
        gap = FALSE;
        break;
      }
    }
    if (gap) break;
  }
  if (!gap) {
    NODE PCCRAP *moved;
    moved = MALLOC((SIZE_T)MAX_ENTRIES(dict)*2+EXTRAS, sizeof(NODE));
    if (moved != NULL) {
      moved += EXTRAS;
      MAX_ENTRIES(moved) = MAX_ENTRIES(dict);
      NEXT_FREE(moved) = NEXT_FREE(dict);
      /* Should use realloc but appears to be buggy :-( */
      for (i = MAX_ENTRIES(dict);
           i < MAX_ENTRIES(dict)*2; i++) moved[i] = EMPTY;
      for (i = 0; i < MAX_ENTRIES(dict); i++) moved[i] = DICT(i);
      dict -= EXTRAS;
      FREE(dict);
      dict = moved; /* If your compiler has aliasing bugs they may hit here. */
      MAX_ENTRIES(dict) *= 2;
      /* scan for suitably-sized gap */
      for (i = ((MAX_ENTRIES(dict)/2)-(MAX_CHARS+1));
           i < MAX_ENTRIES(dict)-nextslot; i++) {
        gap = TRUE;
        for (j = 0; j < nextslot; j++) {
          if (dict[i+j] != EMPTY) {
            gap = FALSE;
            break;
          }
        }
        if (gap) break;
      }
    }
    if (!gap) {
      fprintf(stderr, "Can\'t add any more words\n");
      return;
    }
  }
  newbase = i;
  /* reinsert */
  for (j = 0; j < nextslot; j++) {
    dict[newbase+j] = unpicked[j];
  }
  if (newbase+nextslot-1 >= NEXT_FREE(dict)) NEXT_FREE(dict) = newbase+nextslot;
  /* update pointer to the base of this node */
  for (i = 0; i < MAX_ENTRIES(dict); i++) {
    if ((dict[i] & M_NODE_POINTER) == base) {
      dict[i] &= (NOT M_NODE_POINTER);
      dict[i] |= newbase;
      break;            /* (should only be one since trie, not dawg) */
    }
  }
#undef dict
}

void
#ifdef PROTOTYPES
trie_add(NODE PCCRAP **dictp, char *word)
#else
trie_add(dictp, word)
NODE PCCRAP **dictp;
char *word;
#endif
#define dict (*dictp)
{
INDEX next;
INDEX bugfix;
int c;
  /* Root node is pre-reserved MAX_CHARS entries */
  c = (*word++)&255; /* bugfix - remember chars can be signed :-( */
  if (DICT((INDEX)ROOT_NODE+c) == EMPTY) {
    /* I'm allowing the root node to be sparse for speed; it *should*
       be dense like any other node.  May change this later */
    if (*word == '\0') {
      DICT((INDEX)ROOT_NODE+c) =
        ((NODE)c << (NODE)V_LETTER) | M_END_OF_WORD;
    } else {
      bugfix = insert_simple(&dict, word);
      DICT((INDEX)ROOT_NODE+c) =
        ((NODE)c << (NODE)V_LETTER) | bugfix;
    }
  } else {
    if (*word == '\0') {
      DICT((INDEX)ROOT_NODE+c) |= M_END_OF_WORD;
    } else {
      next = DICT((INDEX)ROOT_NODE+c) & M_NODE_POINTER;
      if (next == 0) {
        /* have 'a', adding 'abcd' */
        /* (Should really check the letter for safety...) */
        bugfix = insert_simple(&dict, word);
        DICT((INDEX)ROOT_NODE+c) = ((NODE)c << (NODE)V_LETTER)
          | bugfix | M_END_OF_WORD;
      } else {
        recurse_add(dictp, next, word);
      }
    }
  }
#undef dict
}

int
#ifdef PROTOTYPES
trie_new(NODE PCCRAP **dawgp)
#else
trie_new(dawgp)
NODE PCCRAP **dawgp;
#endif
#define dawg (*dawgp)
{
INDEX i;

  dawg = MALLOC(INIT_MAX_ENTRIES+EXTRAS, sizeof(NODE));
  dawg += EXTRAS;

  MAX_ENTRIES(dawg) = INIT_MAX_ENTRIES; NEXT_FREE(dawg) = MAX_CHARS+1;

  dawg[0] = 0;
  for (i = 1; i <= MAX_CHARS; i++) dawg[i] = 0;
  dawg[MAX_CHARS] |= M_END_OF_NODE;
  /* 0 base, MAX_CHARS chars */

  return(TRUE);
#undef dawg
}

int
#ifdef PROTOTYPES
trie_dump(NODE PCCRAP *dawg, char *filename)
#else
trie_dump(dawg, filename)
NODE PCCRAP *dawg;
char *filename;
#endif
{
INDEX cnt, max;
FILE *dumper;
  max = NEXT_FREE(dawg)*sizeof(NODE);
    /* I *knew* the next_free variable would come in handy :-) */
  dumper = fopen(filename, WRITE_BIN);
  if (dumper == NULL) {
    fprintf(stderr, "Failed to dump trie on file \'%s\'\n", filename);
    return(FALSE);
  }
  putword(max, dumper);
  if ((cnt = putwords(dawg, max, dumper)) != max) {
    fprintf(stderr, "Failed to dump: %ld bytes written instead of %ld\n",
      cnt, max);
    return(FALSE);
  }
  fclose(dumper);
  return(TRUE);
}

#undef MAX_ENTRIES
#undef NEXT_FREE
#undef DICT
#undef SLOP
SHAR_EOF
cat - << \SHAR_EOF > init.c
/*

    File:    init.c
    Authors: Richard Hooker, Graham Toal
    Purpose: Loading of dictionary for spelling checker.
    Functions exported:  dawg_init, pack_init
    Internal functions:  spell_init

Description:

The address of a dictionary (either a PACKed dict or a DAWG) is set up by
this procedure.  This gives us the option of connecting the dict in read-only
(or copy-on-write) virtual memory.  On non-VM machines, we simply allocate a
large buffer into which the relevant part of the dictionary is loaded.

The magic number is used to check the dict type, *and* the machine byte-sex.
If the sex is reversed, even a VM system has to load the data into writable
memory (so that it can reverse it).

*/

/*######################### INTERNAL FUNCTIONS #########################*/


static int
#ifdef PROTOTYPES
spell_init(char *filename, NODE PCCRAP **dictp,
  char *DEFAULT_DICT, long magic_number, INDEX *nedges)
#else
spell_init(filename, dictp, DEFAULT_DICT, magic_number, nedges)
char *filename;
NODE PCCRAP **dictp;
char *DEFAULT_DICT;
long magic_number;
INDEX *nedges;
#endif
#define dict (*dictp)
{
FILE *fp; INDEX count;

  /*  init_dict("") gets default */
  if (*filename == '\0') filename = DEFAULT_DICT;

  /* Open the file and find out the size of the dict -- which
     is stored in the first word.  Later I'll change the dict format
     to have a header, and the header will have to be skipped by
     this module. */

  if ((fp = fopen(filename, "rb")) == NULL) {
    fprintf (stderr, "Can\'t open file \"%s\"\n", filename);
    return(FALSE);
  }
  *nedges = getword(fp);
#ifdef DEBUG
fprintf(stderr, "dawg contains %8lx edges\n", *nedges);
#endif
  /* Allocate precisely enough memory for all edges + 0 at root node. */
  if ((dict = MALLOC((SIZE_T)((*nedges)+1), sizeof(NODE PCCRAP *))) == 0) {
    fprintf (stderr, "Can\'t allocate space for dictionary\n");
    return(FALSE);
  }

  dict[0] = 0; /* Root node set to 0; terminal nodes point to 0. */

  /* Load in the dictionary.  Routine 'getwords' should be efficient */
  count = getwords(&dict[1], (long)(4*(*nedges)), fp);
  if (count != 4*(*nedges)) {
    fprintf(stderr,
      "Failed to read dictionary %s - wanted %ld bytes, got %ld\n",
      filename, 4*(*nedges), count);
    return(FALSE);
  }
  fclose(fp);

  return(TRUE);
#undef dict
}

/*####################### EXPORTED FUNCTIONS #########################*/

int
#ifdef PROTOTYPES
dawg_init(char *filename, NODE PCCRAP **dawgp, INDEX *nedges)
#else
dawg_init(filename, dawgp, nedges)
char *filename;
NODE PCCRAP **dawgp;
INDEX *nedges;
#endif
{
  return(spell_init(filename, dawgp, DEFAULT_DAWG, DAWG_MAGIC, nedges));
}

int
#ifdef PROTOTYPES
pack_init(char *filename, NODE PCCRAP **packp, INDEX *nedges)
#else
pack_init(filename, packp, nedges)
char *filename;
NODE PCCRAP **packp;
INDEX *nedges;
#endif
{
  return(spell_init(filename, packp, DEFAULT_PACK, PACK_MAGIC, nedges));
}

SHAR_EOF
cat - << \SHAR_EOF > locate.c
/*

    File:    locate.c
    Author:  Graham Toal
    Purpose: Find all words beginning with particular prefix.
    Functions exported:  locate_prefix

Description:
   Searches as in spell-check for prefix in dict, but doesn't
   fail if word doesn't terminate at that point.  It returns
   an index into the dict which can be used with print_dawg_prefix
   to display all the words found.  However it is more useful
   than that; text-analysis programs can check that a word matches
   "root*", for instance, when doing stylistic analysis etc.

*/

INDEX
#ifdef PROTOTYPES
dawg_locate_prefix(NODE PCCRAP *dawg, char *word, INDEX edge)
#else
dawg_locate_prefix(dawg, word, edge)
NODE PCCRAP *dawg;
char *word;
INDEX edge;
#endif
{
  for (;;) {
    if (*word == (((dawg[edge] >> V_LETTER) & M_LETTER))) {
      if (*++word == '\0') {
        return(dawg[edge]&M_NODE_POINTER);
      } else {
        if ((edge = (dawg[edge] & M_NODE_POINTER)) == ROOT_NODE) break;
        continue;
      }
    }
    if (((dawg[edge++]) & M_END_OF_NODE) != 0) break;
  }
  /* What to do when none found? -- fail-safe, or some error code...? */
  return(ROOT_NODE);
}
SHAR_EOF
cat - << \SHAR_EOF > print.c
/*

    File:    print.c
    Author:  Graham Toal
    Purpose: Print a packed trie to stderr.
    Functions exported:  dawg_print, pack_print, dawg_print_prefix
    Internal functions:  pack_pr dawg_pr dawg_pr_prefix

Description:
  Pre-order traverse of packed TRIE or DAWG.  Will be modified
  soon to take output file as parameter.  Then sometime after that
  to callback with each string at it is generated.  Meanwhile,
  people requiring custom versions of dawg-walking stuff might
  as well just copy this code and edit it to do what they want.

  The special version print_dawg_prefix takes a NODE from somewhere
  in a dict as a parameter, and prints out only those words which
  contain that node.  You have to locate the node seperately with
  'locate_prefix', and pass the prefix string in so it can be printed.

*/

static void
#ifdef PROTOTYPES
dawg_pr(NODE PCCRAP *dawg, INDEX node, int len)
#else
dawg_pr(dawg, node, len)
NODE PCCRAP *dawg;
INDEX node;
int len;
#endif
{
  static char word[MAX_WORD_LEN];
  NODE PCCRAP *edge;

  for (edge = (NODE PCCRAP *)&dawg[node]; ; edge++) {
  long c;
    c = *edge;           /* Don't rewrite this - its avoiding a MSC bug */
    c = c >> V_LETTER;
    c = c & M_LETTER;
    word[len] = (char)c;
    if ((*edge & M_END_OF_WORD) != 0) {
      word[len+1] = '\0';
      fprintf(stdout, "%s\n", word);
    }
    c = *edge & M_NODE_POINTER;
    if ((*edge & M_NODE_POINTER) != 0)
      dawg_pr (dawg, c, len + 1);
    if ((*edge & M_END_OF_NODE) != 0) break; /* End of node */
  }
}

void
#ifdef PROTOTYPES
dawg_print(NODE PCCRAP *dawg, INDEX node)
#else
dawg_print(dawg, node)
NODE PCCRAP *dawg;
INDEX node;
#endif
{
  dawg_pr(dawg, node, 0);
}

static void
#ifdef PROTOTYPES
pack_pr(NODE PCCRAP *ptrie, INDEX i, int pos)
#else
pack_pr(ptrie, i, pos)
NODE PCCRAP *ptrie;
INDEX i;
int pos;
#endif
{
static char s[MAX_WORD_LEN+1];
int b;
INDEX p;
  for (b = BASE_CHAR; b < BASE_CHAR+MAX_CHARS; b++) {
    if (b != 0) {
      p = ptrie[i+b-BASE_CHAR];
      if (((p>>V_LETTER)&M_LETTER) == b) {
      	s[pos] = b; s[pos+1] = '\0';
        if ((p & M_END_OF_WORD) != 0) fprintf(stderr, "%s\n", s);
        if ((p &= M_NODE_POINTER) != 0) {
          pack_pr(ptrie, p, pos+1);
        }
      }
    }
  }
}


void
#ifdef PROTOTYPES
pack_print(NODE PCCRAP *ptrie, INDEX node)
#else
pack_print(ptrie, node)
NODE PCCRAP *ptrie;
INDEX node;
#endif
{
  pack_pr(ptrie, node, 0);
}


static void
#ifdef PROTOTYPES
dawg_pr_prefix(NODE PCCRAP *dawg, char *prefix, INDEX node, int len)
#else
dawg_pr_prefix(dawg, prefix, node, len)
NODE PCCRAP *dawg;
char *prefix;
INDEX node;
int len;
#endif
{
  NODE PCCRAP *edge;
  static char word[MAX_WORD_LEN];
  long c;

  for (edge = (NODE PCCRAP *)&dawg[node]; ; edge++) {
    /* Don't 'fix' - compiler bugaround for microsoft :-( */
    c = *edge; c = c >> V_LETTER; c = c & M_LETTER;
    word[len] = (char)c;
    if ((*edge & M_END_OF_WORD) != 0) {
      word[len+1] = 0;
      fprintf(stdout, "%s%s\n", prefix, word);
    }
    c = *edge & M_NODE_POINTER;
    if (c != 0) dawg_pr_prefix(dawg, prefix, c, len + 1);
    /* End of node - check repair later - I accidentally nobbled it */
    if ((*edge & M_END_OF_NODE) != 0) break;
  }
}

void
#ifdef PROTOTYPES
dawg_print_prefix(NODE PCCRAP *dawg, char *prefix, INDEX node)
#else
dawg_print_prefix(dawg, prefix, node)
NODE PCCRAP *dawg;
char *prefix;
INDEX node;
#endif
{
  dawg_pr_prefix(dawg, prefix, node, 0);
}
SHAR_EOF
cat - << \SHAR_EOF > similcmp.c
#ifndef similcmp_c
#define similcmp_c 1

#ifdef LIB_STRINGS
#include <strings.h>
#else
#include <string.h>
#endif

/*

    File:    similcmp.c
    Authors: John Ratcliffe, and an anonymous coder.
    Purpose: Better approximate string equality test.
    Functions exported:  simil
    Internal functions:  GCsubstr

Description:
  See below.  I'll replace this one eventually with my own
  algorithm which does sophisticated compound-grapheme analysis
  and returns a degree of phonetic similarity and a probability
  that the transformations made were valid.


  'ghoti==fish' (Match = 90%, Prob = 1%) ;-)
  lauGH = f       match 100% prob 30%
  wOmen = i       match  90% prob 10%
  staTIon = sh    match 100% prob 5%

*/

/* Ratcliff/Obershelp pattern matching
 *
 *      Approximate word matching: takes two words and returns a
 *      similarity score based on co-occurrence of subpatterns.
 *
 *      This code appeared in a letter to the editor in DDJ, 11/88.
 *      The original article on the pattern matching, "Pattern Matching
 *      by Gestalt" by John Ratcliff, appeared in the July 1988
 *      issue (#181) but the algorithm was presented in assembly.  
 *
 *      The 11/88 issue also contained another verison in C which was
 *      a more faithful translation of the original; it has the
 *      advantage of not being recursive.
 *
 *      The algorithm seems to work nicely for a variety of test cases.
 *      The main drawback of the algorithm is the cost of the pairwise
 *      comparisons.  It is significantly more expensive than stemming,
 *      soundex, and the like.  Might consider using this as a second
 *      phase...
 */

int
#ifdef PROTOTYPES
GCsubstr(char *st1, char *end1, char *st2, char *end2)
#else
GCsubstr(st1, end1, st2, end2)
  char *st1;
  char *end1;
  char *st2;
  char *end2;
#endif
{
register char *a1, *a2;
char *b1, *s1, *b2, *s2;
short max, i;

  if (end1 <= st1 || end2 <= st2) return(0);
  if (end1 == st1 + 1 && end2 == st2 + 1) return(0);
                
  max = 0; b1 = end1; b2 = end2;
        
  for (a1 = st1; a1 < b1; a1++) {
    for (a2 = st2; a2 < b2; a2++) {
      if (*a1 == *a2) {
        /* determine length of common substring */
        for (i = 1; a1[i] && (a1[i] == a2[i]); i++) /* do nothing */;
        if (i > max) {
          max = i; s1 = a1; s2 = a2;
          b1 = end1 - max; b2 = end2 - max;
        }
      }
    }
  }
  if (!max) return(0);
  max += GCsubstr(s1 + max, end1, s2 + max, end2);        /* rhs */
  max += GCsubstr(st1, s1, st2, s2);                      /* lhs */
  return(max);
}

int
#ifdef PROTOTYPES
simil(char *s1, char *s2)
#else
simil(s1, s2)
 char *s1;
 char *s2;
#endif
{
int l1 = strlen(s1), l2 = strlen(s2);
 if (strcmp(s1, s2) == 0) return(100); /* exact match end-case */
 return(200 * GCsubstr(s1, s1 + l1, s2, s2 + l2) / (l1 + l2));
}
#endif /* similcmp_c */
SHAR_EOF
cat - << \SHAR_EOF > soundex.c
#ifndef soundex_c
#define soundex_c 1

#ifdef LIB_STRINGS
#include <strings.h>
#else
#include <string.h>
#endif


/* isalpha & toupper; define your own if you don't have ctype.h */
#include <ctype.h>

/*

    File:    soundex.c
    Authors: Jonathan Leffler
    Purpose: Approximate string equality test.
    Functions exported:  soundex
    Internal functions:  nsoundex

Description:

 There are umpteen soundexes around. At least this one is commented...
 (Actually I'd prefer one which understood ph/f and Mac/Mc as special
  cases; short of a proper phonetic alg such as I'm currently developing)
 See below for description:

*/

/*
**      SOUNDEX CODING
**
**      Rules:
**      1.      Retain the first letter; ignore non-alphabetic characters.
**      2.      Replace second and subsequent characters by a group code.
**              Group   Letters
**              1               BFPV
**              2               CGJKSXZ
**              3               DT
**              4               L
**              5               MN
**              6               R
**      3.      Do not repeat digits
**      4.      Truncate or ser-pad to 4-character result.
**
**      Originally formatted with tabstops set at 4 spaces -- you were warned!
**
**      Code by: Jonathan Leffler (john@sphinx.co.uk)
**      This code is shareware -- I wrote it; you can have it for free
**      if you supply it to anyone else who wants it for free.
**
**      BUGS: Assumes 7-ASCII; fails on ISO Latin1
*/

static char lookup[] = {
        '0',    /* A */        '1',    /* B */        '2',    /* C */
        '3',    /* D */        '0',    /* E */        '1',    /* F */
        '2',    /* G */        '0',    /* H */        '0',    /* I */
        '2',    /* J */        '2',    /* K */        '4',    /* L */
        '5',    /* M */        '5',    /* N */        '0',    /* O */
        '1',    /* P */        '0',    /* Q */        '6',    /* R */
        '2',    /* S */        '3',    /* T */        '0',    /* U */
        '1',    /* V */        '0',    /* W */        '2',    /* X */
        '0',    /* Y */        '2'     /* Z */
};

/* Soundex for arbitrary number of characters of information */
#define MAX_SOUNDEX_LEN 10

static char *
#ifdef PROTOTYPES
nsoundex(char *str, int n)
#else
nsoundex(str, n)
char *str;                   /* In: String to be converted */
int n;                       /* In: Number of characters in result string */
#endif
{
  static char buff[MAX_SOUNDEX_LEN];
  register char *s;
  register char *t;
  char c;
  char l;

  if (n <= 0) n = 4;  /* Default */
  if (n > sizeof(buff) - 1)  n = sizeof(buff) - 1;
  t = &buff[0];

  for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++) {
    if (!isalpha(c)) continue;
    c = toupper(c);
    if (t == &buff[0]) {
      l = *t++ = c;
      continue;
    }
    c = lookup[c-'A'];  /* Fails on any alpha not in 'a'..'z' */
    if (c != '0' && c != l) l = *t++ = c;
  }
  while (t < &buff[n]) *t++ = '0'; *t = '\0';
  return(&buff[0]);
}

/* Normal external interface */
char *
#ifdef PROTOTYPES
soundex(char *str)
#else
soundex(str)
char *str;
#endif
{
  return(nsoundex(str, 4));  /* vary this? - try 5 or 6? */
}
#endif /* soundex_c */

SHAR_EOF
cat - << \SHAR_EOF > utils.c
/*

    File:    utils.c
    Author:  Graham Toal
    Purpose: Portability library

Description:

   Most of what differs between operating systems is handled here.
   This module is *incredibly* hacky -- I've just been interested
   in finding the problems so far, not in making the solutions neat.

   The final release of this suite will concentrate on cleaning up
   this file!
*/



/* PENDING: Generic load dict; meanwhile should also put efficient
   msdos file loading into getwords().  See spelldawg for best coding. */

#define SIZE_T size_t
/* MSDos redefines this for huge allocation */

#ifdef SYS_RISCOS
#define DEFAULT_DAWG "run:dict-dwg"
#define DEFAULT_PACK "run:dict-pck"
#else
#ifdef SYS_EMAS
#define DEFAULT_DAWG "dict#dwg"
#define DEFAULT_PACK "dict#pck"
#else
/* Unix, MessDross, etc... */
#define DEFAULT_DAWG "dict.dwg"
#define DEFAULT_PACK "dict.pck"
#endif
#endif


/* Undo any #define's here if they are inappropriate for some system */

#define EXT_CHAR '.'

#define READ_TEXT "r"
#define WRITE_BIN "wb"
#define READ_BIN "rb"


/* system configuration */

#ifdef KNOWS_VOID
#define VOID void
#else
/* As in... void fred(VOID) */
#define void int
#define VOID
#endif

#ifdef SYS_MSDOS
#ifdef COMPILER_ZORTECH
int _stack = 0x3000;
#define PCCRAP far
#else
#ifdef COMPILER_TURBOC
#define PCCRAP far
#else
#define PCCRAP huge
#endif
#endif
#else
#define PCCRAP
#endif

/* Hmph... I don't really want to do realloc too.  Just as well that
   one implmentation is buggy; saves me having to work out what to do :-) */


#ifdef SYS_MSDOS
/* Normally int... */
#undef SIZE_T
#define SIZE_T long
/* (Still to test AZTEC & LATTICE) */
/* Mallocs of > 64K (maybe even > 32K) have to come off the far/huge heap */
#ifdef COMPILER_ZORTECH
#include <dos.h>
#define MALLOC(x,s) (NODE PCCRAP *)farmalloc((x)*(s)) /* Zortech */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) farfree(x)
#define Free(x) free(x)
#else /* else not microsoft */
#ifdef COMPILER_TURBOC
#include <alloc.h>
#define MALLOC(x,s) (NODE PCCRAP *)farmalloc((x)*(s))  /* Turbo */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) farfree(x)
#define Free(x) free(x)
#else /* else not turbo */
#include <malloc.h>
#ifdef NEVER
#define MALLOC(x,s) (NODE PCCRAP *)my_halloc((long)(x),(size_t)(s))
 /* Microsoft, Aztec */
#define Malloc(x,s) my_malloc((x)*(s))
#define FREE(x) my_hfree((void huge *)(x))  /* For some reason MICROSOFT    */
#define Free(x) my_free((void *)x)          /* complains of a type mismatch */
                                         /* without these casts */
#endif
#define MALLOC(x,s) (NODE PCCRAP *)halloc((long)(x),(size_t)(s))
 /* Microsoft, Aztec */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) hfree((void huge *)(x))  /* For some reason MICROSOFT    */
#define Free(x) free((void *)x)          /* complains of a type mismatch */
                                         /* without these casts */
#endif
#endif
#else /* else not SYS_MSDOS */
#ifdef STDLIBS
#include <stdlib.h>  /* for malloc, free & exit */
#define MALLOC(x,s) (NODE PCCRAP *)malloc(((x)*(s)))
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) free(x)
#define Free(x) free(x)
#else
#define MALLOC(x,s) (NODE PCCRAP *)malloc(((x)*(s)))
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) free(x)
#define Free(x) free(x)
#ifndef size_t       /* Doesn't work if size_t was typedef'd */
#define size_t int   /* And if an int isn't big enough we're in trouble! */
#endif
#ifdef PROTOTYPES
extern void *malloc(size_t bytes);
extern void exit(int rc);
#else
extern void *malloc();
extern void exit();
#endif
#endif /* not stdlibs */
#endif /* Not msdos */

#ifdef SYS_RISCOS
#undef EXT_CHAR
#define EXT_CHAR '-'
#endif

#ifdef SYS_EMAS
#undef EXT_CHAR
#define EXT_CHAR '#'
#endif

#ifdef SYS_MAC
#ifdef COMPILER_THINK
#undef WRITE_BIN
#undef READ_BIN
#define WRITE_BIN "w"
#define READ_BIN "r"
#endif
#endif

#ifdef MEMMODELS
#define SMALL_MEMORY 1
#endif

/*
                     Error handling

     At the moment we'll just go for OK / NOT OK.  Later we should
   fix all exit's to use a specific code symbol (eg EXIT_MALLOCFAIL)
   but this has to be done system by system.  Whenever a new one is
   added, a default should be set up (as 0?)
 */

/*#include <errno.h>*/        /* Only later when we want to be more precise! */
#define EXIT_OK       (0)     /* Generic success              */
#define EXIT_ERROR    (1)     /* Generaic failure             */

/* For each system, replace generic errors with system-dependant ones. */
#ifdef vms
/*
 * VMS-specific error status codes.  Approximate Ultrix equivalents.
 */
#include <ssdef.h>
#include <stsdef.h>
#undef  EXIT_OK
#define EXIT_OK     SS$_NORMAL     /* Generic success                  */
#undef  EXIT_ERROR
#define EXIT_ERROR  SS$_NORMAL     /* Don't want random error message! */
#endif

#define DELETE(filename)

#ifdef __STDC__
#undef DELETE
#define DELETE(filename) remove(filename)
#else
#ifdef unix
#undef DELETE
#define DELETE(filename) unlink(filename)
#endif
#endif

#ifdef NEVER

/* these macros caught the fact that on MICROSOFT, the parameters
   being passed in were ints -- and I hadn't been given a warning
   because I had explicitly cast the to size-t.  Hence why I've
   declared SIZE_T as long.  This is all a bit messy. Curse you IBM PCs
 */

void PCCRAP *my_halloc(long n,size_t s) {
char PCCRAP *p;
  p = halloc(n, s);
  fprintf(stderr, "halloc(%8lx*%d) -> %8lx\n", n, s, (long)p);
  return(p);
}

void *my_malloc(size_t s) {
char *p;
  p = malloc(s);
  fprintf(stderr, "malloc(%4x) -> %4x\n", s, (int)p);
  return(p);
}

void my_hfree(void PCCRAP *p) {
  fprintf(stderr, "hfree(%8lx)\n", (long)p);
  hfree(p);
}

void my_free(void *p) {
  fprintf(stderr, "free(%4x)\n", (int)p);
  free(p);
}
#endif


#ifdef RCHECK
#ifndef PROTOTYPES
long toosmall(idx, max, line)
long idx;
long max;
int line;
#else
long toosmall(long idx, long max, int line)
#endif
{
  if (line == 0) {
    fprintf(stderr,
      "RANGE CHECK: %ld should not be less than 0 (max %ld)\n", idx, max);
  } else {
    fprintf(stderr,
      "RANGE CHECK AT LINE %d: %ld should not be less than 0 (max %ld)\n",
      line, idx, max);
  }
  return(0L);
}
#ifndef PROTOTYPES
long toobig(idx, max, line)
long idx;
long max;
int line;
#else
long toobig(long idx, long max, int line)
#endif
{
  if (line == 0) {
    fprintf(stderr,
      "RANGE CHECK: %ld should not be larger than %ld\n", idx, max);
  } else {
    fprintf(stderr,
      "RANGE CHECK AT LINE %d: %ld should not be larger than %ld\n",
      line, idx, max);
  }
  return(max);
}

#ifdef KNOWS_LINE
#define TOOSMALL(idx, max) toosmall((long)idx, (long)max, __LINE__)
#define TOOBIG(idx, max) toobig((long)idx, (long)max, __LINE__)
#else
#define TOOSMALL(idx, max) toosmall((long)idx, (long)max, 0)
#define TOOBIG(idx, max) toobig((long)idx, (long)max, 0)
#endif

#define RANGECHECK(idx,max) \
  (idx < 0 ? (TOOSMALL(idx,max)) : (idx >= max ? (TOOBIG(idx,max)) : idx))
#else
#define RANGECHECK(idx,max) (idx)
#endif

#ifdef PROTOTYPES
long getwords(long PCCRAP *data, long count, FILE *fp)
#else
long getwords(data, count, fp)
long PCCRAP *data;
long count;
FILE *fp;
#endif
{
#ifdef SYS_MSDOS
char PCCRAP *p; int c; long byte_count;
  p = (char PCCRAP *)(&data[0]);
  /* Somewhere I have a version which fread()s into a 16K near
     buffer, then copies it out; use that technique here - *MUCH*
     faster */
  for (byte_count = 0; byte_count < count; byte_count++) {
    c = fgetc(fp);
    if (c == -1 || ferror(fp)) {
      printf ("Dictionary read error - %ld wanted - %ld got\n",
        count, byte_count)/*, exit(0)*/;
      break;
    }
    *p++ = (c & 255);
  }
  return(count);
#else
  return((long)fread(&data[0], (size_t)1, (size_t)count, fp));
#endif
}

#ifdef PROTOTYPES
long putwords(long PCCRAP *data, long count, FILE *fp)
#else
long putwords(data, count, fp)
long PCCRAP *data;
long count;
FILE *fp;
#endif
{
#ifdef SYS_MSDOS
extern int _NEAR _CDECL errno;
long i; char PCCRAP *p;
  p = (char PCCRAP *)&data[0];
  if (fp == NULL) {
    fprintf(stderr, "putwords: no file?\n");
    exit(0);
  }
  for (i = 0L; i < count; i++) {
  /* Somewhere I have a version which copies to a 16K near
     buffer, then frwrite()s it out; use that technique here - *MUCH*
     faster */
    int rc = fputc(*p++, fp);
    if (ferror(fp)) {
      fprintf(stderr, "rc = %d\n", rc);
      perror("dawg");
      fprintf (stderr, "Error writing to output file\n");
      exit(0);
    }
  }
  return(count);
#else
  return(fwrite(&data[0], (size_t)1, (size_t)count, fp));
#endif
}

static long PCCRAP *WTMP = NULL;
/* A bit hacky having this single 4 bytes in heap space, but it makes
   things a little more consistent.  it'll all go away eventually
   anyway... */

#ifdef PROTOTYPES
void putword(long w, FILE *fp)
#else
void putword(w, fp)
long w;
FILE *fp;
#endif
{
  if (WTMP == NULL) {
    WTMP = (long PCCRAP *)MALLOC(1,sizeof(long));
  }
  *WTMP = w;
  if (putwords(WTMP, (long)sizeof(long), fp) != sizeof(long)) {
    /* (using putwords avoids confusion over bytesex) */
    fprintf(stderr, "Data write error - putword\n");
  }
}

#ifdef PROTYPES
long getword(FILE *fp)
#else
long getword(fp)
FILE *fp;
#endif
{
  if (WTMP == NULL) {
    WTMP = (long PCCRAP *)MALLOC(1,sizeof(long));
  }
  if (getwords(WTMP, (long)sizeof(long), fp) != sizeof(long)) {
    fprintf(stderr, "Data read error - getword\n");
  }
  return(*WTMP);
}
SHAR_EOF

gtoal@tharr.UUCP (Graham Toal) (06/28/90)

#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar:	Shell Archiver
#	Run the following text with /bin/sh to create:
#	dawg.c #	dwgcheck.c #	pack.c #	pckcheck.c #	pdawg.c #	ppack.c #	proot.c #	sample.c #	tell.c 
cat - << \SHAR_EOF > dawg.c
/* The first three #define's are to repair mangling by BITNET mailers */
#define EOR ^
              /* This should be Caret (hat, up arrow) -- C Ex-or */
#define MOD %
              /* This should be Percent -- C Modulus             */
#define NOT ~
              /* This should be Tilde (twiddle) -- C unary Not   */
#include "copyr.h"
/*****************************************************************************/
/*                                                                           */
/*      FILE : DAWG.C                                                        */
/*                                                                           */
/*      Convert an alphabetically-ordered dictionary file in text format     */
/*      (one word per line) into a Directed Acyclic Word Graph. The command  */
/*      syntax is                                                            */
/*                                                                           */
/*              DAWG <text file (inc .ext)> <output file (no ext)>           */
/*                                                                           */
/*      The first 4 bytes of the output file form a 24-bit number containing */
/*      the number of edges in the dawg. The rest of the file contains the   */
/*      dawg itself (see "The World's Fastest Scrabble Program" by Appel     */
/*      and Jacobson for a description of a dawg).                           */
/*                                                                           */
/*****************************************************************************/

#include "grope.h"
#ifndef grope_h
/* If you don't have grope.h, create an empty one.
   These will do for a basic system: */
#undef KNOWS_VOID
#undef STDLIBS
#undef PROTOTYPES
#define SYS_ANY 1
#define COMPILER_ANY 1
#define SMALL_MEMORY 1
                       /*  To be defined if you have to generate
                           the data structure in bits. This will
                           certainly be true for any non-trivial
                           dictionary on MSDOS, or most home
                           micros with 1Mb Ram or under. */
#endif
/* Portability notes:

   0) KISS! Keep It Simple, Stupid!

   1) No typedef's
   2) No bitfields
   3) No structs
   4) No #if defined()
   5) No complex #if's
   6) No procedure variables
   7) Don't trust tolower() as some libs don't range check
   8) Stay character-set independent (EBCDIC should work?)
   9) Assume sizeof(long) >= 32 bits, but sizeof(int) just >= 16
  10) Note use of ABS in preference to unsigned longs.
  11) Assume 8-bit char set
  12) Don't use k&r reserved words for variables, even if not
      in ANSI.  Such as 'entry'.
  13) Unix people; no sys/ or machine/ files please unless under a
      suitable #ifdef, and generic code supplied for everyone else...
  14) this is byte-sex independant at the moment. KEEP IT THAT WAY!
      Don't assume dawgs are stored in a fixed form, such as so-called
      'net-order'.
  15) Since C doesn't range-check arrays, we'll do it. If possible,
      leave the running systems with range checking on if you can
      afford it!
  16) No nested include files.
  17) Don't pull in any include files twice. (*Kills* the Atari!)
  18) Don't use 'register' -- trust the compiler; the compiler is your friend.
 */
/*#define RCHECK 1*/      /* We want range-checking of array accesses */

#include <stdio.h>

#ifdef LIB_STRINGS
#include <strings.h>    /* Some unixes, Mac? */
#else
#include <string.h>
#endif

#ifdef SYS_MAC
/* To compile with THINK C 4.0, place the files dawg.h grope.h,
   dawg.c and utils.c in a folder.  Create a project that contains
   dawg.c and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
/* Assume a 1Mb mac for the moment. Someday I'll add dynamic RAM sizing */
#define SMALL_MEMORY 1
#endif

#include "dawg.h"   /* main system constants */

#include "utils.c"  /* utils.c pulls in header file for malloc/free & exit */

#define ABS(x) ((x) < 0 ? -(x) : (x))

/**
*  The following constant, HASH_TABLE_SIZE, can be changed according to
*  dictionary size.  The two values shown will be fine for both large
*  and small systems.  It MUST be prime.
**/

#ifdef SMALL_MEMORY
#define HASH_TABLE_SIZE 30011

#ifdef SYS_MAC
/* boy, you guys must be *really* tight on space...
   are you sure you can handle a reasonable size of dictionary with
   such a small table? Bump this back up as soon as you get everything
   else working... 
   (I was given this info by a Mac site while they were first porting
   this stuff; maybe now it works on macs we can put the buffer size
   back up to something reasonable)
 */
#undef HASH_TABLE_SIZE
#define HASH_TABLE_SIZE 17389
#endif

#else
/* pick one about 20% larger than needed */
#define HASH_TABLE_SIZE 240007

/* Suitable prime numbers if you want to tune it:
     30011
    150001  <-- probably a good compromise. OK for dicts to about 1Mb text
    200003
    220009
    240007

   If you try any others, for goodness sake CHECK THAT THEY ARE PRIME!!!

 */
#endif
#define MAX_EDGES (HASH_TABLE_SIZE-1)

static FILE *fpin, *fpout;     /* Input/output files */
static char current_word[(MAX_LINE+1)];  /* The last word read from fpin */
static int first_diff, save_first_diff;
                                /* The position of the first letter at which */
                                /* current_word differs from the previous    */
                                /* word read                                 */
static NODE PCCRAP *hash_table;
static int first_time = TRUE;
static int this_char = '?', lastch;
static NODE PCCRAP *dawg;

static int FILE_ENDED = FALSE;  /* I'm having real problems getting
 merged dawgs to work on the PC; this is yet another filthy hack. Sorry. */

static INDEX nwords, nnodes, total_edges;

#ifdef SMALL_MEMORY
#define MAX_SUBDAWG 30000 /* Big enough for largest dict,
                             even on small systems */
static NODE PCCRAP *merge;

static INDEX dawgsize[256];
static INDEX dawgstart[256];
static INDEX dawgentry[256];

static int nextslot;  /* Dawgs are packed in sequentially - not in their
                         'proper' indexed position */
#endif
/*
                Shorthand macros for array accesses with checking
     If RCHECK isn't defined, these don't contribute any overhead. I suggest
     leaving RCHECK on except for production code.
 */

#define EDGE_LIST(idx) \
  edge_list[RANGECHECK(idx, MAX_CHARS)]
#define CURRENT_WORD(idx) \
  current_word[RANGECHECK(idx, MAX_LINE+1)]
#define DAWG(idx) \
  dawg[RANGECHECK(idx, MAX_EDGES)]
#define HASH_TABLE(idx) \
  hash_table[RANGECHECK(idx, HASH_TABLE_SIZE)]

/*
                Forward references
 */

#ifdef PROTOTYPES
static INDEX build_node(int depth);
static INDEX add_node(NODE  *edge_list, INDEX nedges);
static void read_next_word(VOID);
static INDEX compute_hashcode(NODE *edge_list, INDEX nedges);
static void report_size(VOID);
#else
static INDEX build_node();
static INDEX add_node();
static void read_next_word();
static INDEX compute_hashcode();
static void report_size();
#endif

#ifdef SMALL_MEMORY

#ifdef PROTOTYPES
static void merge_dawg (char *filename);
#else
static void merge_dawg ();
#endif

#endif

/**
*       main
*
*       Program entry point
**/

long words; /* dirty communication variable -- the multi-pass stuff
               was hacked in at the last minute. */

int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
  INDEX i;
  char fname[128];

#ifdef SYS_MAC
  argc = ccommand(&argv);
#endif

  if (argc != 3) {
    fprintf(stderr,
      "usage: dawg dictfile.ext dawgfile\n");
    exit(EXIT_ERROR);
  }

  /**
  *  Allocate the memory for the dawg
  **/

  if ((dawg = MALLOC(MAX_EDGES, sizeof(NODE PCCRAP *))) == NULL) {
    fprintf(stderr, "Can\'t allocate dictionary memory\n");
#ifndef SMALL_MEMORY
    fprintf(stderr, "-- try recompiling with -DSMALL_MEMORY\n");
#endif
    exit(EXIT_ERROR);
  }
  for (i = 0; i < (INDEX)MAX_EDGES; i++) dawg[i] = 0L;
  /**
  *  Allocate the hash table.
  *  Fill it with zeroes later just before use.  Don't trust calloc etc.
  *  - not portable enough.  Anyway, in the multi-pass version we don't
  *  want to continually free/claim...
  **/

  if ((hash_table =
      MALLOC((HASH_TABLE_SIZE+1), sizeof(NODE))) == NULL) {
    fprintf(stderr, "Can\'t allocate memory for hash table\n");
    exit(EXIT_ERROR);
  }
  /**
  *  Open the input/output files
  **/

  fpin = fopen(argv[1], READ_TEXT);
  if (fpin == NULL) {
    fprintf(stderr, "Can\'t open text file \"%s\"\n", argv[1]);
    /* Could print out error string but it's not portable... */
    exit(EXIT_ERROR);
  }

  /**
  *  Read the first word from the dictionary
  **/

  first_time = TRUE;
  nwords = 0;
  current_word[0] = 0;
  read_next_word();
  lastch = *current_word;
  /**
  *  Initialise the counters, taking account of the root node (which is
  *  a special case)
  **/

  nnodes = 1; total_edges = MAX_CHARS;

  /**
  *  Build the dawg and report the outcome
  **/

  /* Now, in the dim & distant past, this code supported the concept
     of a restricted character set - ie a..z & A..Z were packed into 6 bits;
     this caused awful problems in the loop below, where we had to try to
     keep the loop-control variable and the character code in synch; nowadays
     chars are 8 bits or else, so I'm starting to tidy up the places
     where these hacks were necessary... */
     
#ifdef SMALL_MEMORY
  for (this_char = 0; this_char < MAX_CHARS; this_char++) {
  if (FILE_ENDED) break; /* Don't waste time in this loop... */
#endif
  /* Explicitly initialise hash table to all zeros */
  {INDEX a; for (a = 0; a <= HASH_TABLE_SIZE; a++) hash_table[a] = 0;}
  words = 0;
  (void) build_node(0);
#ifdef SMALL_MEMORY
#ifdef DEBUG
  fprintf(stderr,
    "Char %d done. %ld Words, %ld Nodes\n", *current_word, words, total_edges);
#endif
  if (total_edges /* words */ == 0) continue;
#endif

  /**
  *  Save the dawg to file
  **/
#ifdef SMALL_MEMORY
  sprintf(fname, "%s%c%d", argv[2], EXT_CHAR, lastch);
#else
  sprintf(fname, "%s%cdwg", argv[2], EXT_CHAR);
#endif
  fpout = fopen(fname, WRITE_BIN);
  if (fpout == NULL) {
    fprintf(stderr, "Can\'t open output file \"%s\"\n", fname);
    exit(EXIT_ERROR);
  }
#ifdef DEBUG
  fprintf(stderr, "Writing to %s\n", fname);
#endif

  *dawg = total_edges;
  total_edges = sizeof(NODE) * (total_edges + 1); /* Convert to byte count */

  {
    long cnt;
    if ((cnt = putwords(dawg, (long)total_edges, fpout)) != total_edges) {
      fprintf(stderr, "%ld bytes written instead of %ld\n.", cnt, total_edges);
      exit(EXIT_ERROR);
    }
  }
  fclose(fpout);

  /**
  *  Read the first word from the dictionary
  **/

  first_diff = save_first_diff;
  first_time = FALSE;
  nwords = 0;
  /**
  *  Initialise the counters, taking account of the root node (which is
  *  a special case)
  **/

  nnodes = 1; total_edges = MAX_CHARS;

  lastch = *current_word;
  /**
  *  Build the dawg and report the outcome
  **/

#ifdef SMALL_MEMORY
  }
#endif
  fclose(fpin);
  fprintf(stderr, "Dawg generated\n");
#ifdef SMALL_MEMORY
  merge_dawg(argv[2]);
#endif
  exit(EXIT_OK);
}

/**
*       BUILD_NODE
*
*       Recursively build the next node and all its sub-nodes
**/

static INDEX
#ifdef PROTOTYPES
build_node(int depth)
#else
build_node(depth)
int depth;
#endif
{
  INDEX nedges = 0;
  INDEX i;
  NODE *edge_list;

  edge_list = NULL;
  if (CURRENT_WORD(depth) == 0) {
    /**
    *  End of word reached. If the next word isn't a continuation of the
    *  current one, then we've reached the bottom of the recursion tree.
    **/

    read_next_word();
    if (first_diff < depth) return(0);
  }

  edge_list = (NODE *)malloc(MAX_CHARS*sizeof(NODE));
                     /* Note this is a 'near' array */
  if (edge_list == NULL) {
    fprintf(stderr, "Stack full (depth %d)\n", depth);
    exit(EXIT_ERROR);
  }
  for (i = 0; i < MAX_CHARS; i++) EDGE_LIST(i) = 0L;

  /**
  *  Loop through all the sub-nodes until a word is read which can't
  *  be reached via this node
  **/

  do
    {
    /* Construct the edge. Letter.... */
    EDGE_LIST(nedges) = (NODE) (((NODE)CURRENT_WORD(depth)))
                        << (NODE)V_LETTER;
    /* ....end-of-word flag.... */
    if (CURRENT_WORD(depth+1L) == 0) EDGE_LIST(nedges) |= M_END_OF_WORD;
    /* ....and node pointer. */
    EDGE_LIST(nedges) |= build_node(depth+1); nedges++;
                                                   /* (don't ++ in a macro) */
    } while (first_diff == depth);

  if (first_diff > depth) {
    fprintf(stderr, "Internal error -- first_diff = %d, depth = %d\n",
      first_diff, depth);
    exit(EXIT_ERROR);
  }

  EDGE_LIST(nedges-1) |= M_END_OF_NODE;
                                  /* Flag the last edge in the node */

  /**
  *  Add the node to the dawg
  **/

  if (depth) {
    NODE result = add_node(edge_list, nedges);
    free(edge_list);
    return(result);
  }

  /**
  *  depth is zero, so the root node (as a special case) goes at the start
  **/

  edge_list[MAX_CHARS-1] |= M_END_OF_NODE;      /* For backward searches */
  for (i = 0; i < MAX_CHARS; i++)
    {
    dawg[i+1] = edge_list[i];
    }
  free(edge_list);
  return(0);
}

/**
*       ADD_NODE
*
*       Add a node to the dawg if it isn't already there. A hash table is
*       used to speed up the search for an identical node.
**/

static INDEX
#ifdef PROTOTYPES
add_node(NODE *edge_list, INDEX nedges)
#else
add_node(edge_list, nedges)
NODE *edge_list;
INDEX nedges;
#endif
{
  NODE hash_entry;
  INDEX inc;
  INDEX a, first_a;
  INDEX i;

  /**
  *  Look for an identical node. A quadratic probing algorithm is used
  *  to traverse the hash table.
  **/

  first_a = compute_hashcode(edge_list, nedges);
  first_a = ABS(first_a) MOD HASH_TABLE_SIZE;
  a = first_a;
  inc = 9;

  for (;;)
    {
    hash_entry = HASH_TABLE(a) & M_NODE_POINTER;

    if (hash_entry == 0) break;   /* Node not found, so add it to the dawg */

    for (i = 0; i < nedges; i++)
      if (DAWG((hash_entry+i) MOD HASH_TABLE_SIZE) != EDGE_LIST(i)) break;

/* On the 1.6M dictionary, this gave a rangecheck with < 0. Now fixed
   I think - it was a problem with MOD giving unexpected results. */

      if (i == nedges) {
        return(hash_entry);        /* Node found */
      }
      /**
      *  Node not found here. Look in the next spot
      **/

      a += inc;
      inc += 8;
      if (a >= HASH_TABLE_SIZE) a -= HASH_TABLE_SIZE;
      if (inc >= HASH_TABLE_SIZE) inc -= HASH_TABLE_SIZE;
      if (a == first_a) {
        fprintf(stderr, "Hash table full\n");
        exit(EXIT_ERROR);
      }
    }

  /**
  *  Add the node to the dawg
  **/

  if (total_edges + nedges >= MAX_EDGES) {
    fprintf(stderr,
      "Error -- dictionary full - total edges = %ld\n", total_edges);
    exit(EXIT_ERROR);
  }

  HASH_TABLE(a) |= total_edges + 1;
  nnodes++;
  for (i = 0; i < nedges; i++) {
    DAWG((total_edges + 1 + i) MOD HASH_TABLE_SIZE) = EDGE_LIST(i);
  }
  total_edges += nedges;
  return(total_edges - nedges + 1);
}

/**
*       READ_NEXT_WORD
*
*       Read the next word from the input file, setting first_diff accordingly
**/

static void
#ifdef PROTOTYPES
read_next_word(VOID)
#else
read_next_word()
#endif
{
  /* This stuff imposes the limitation of not allowing '\0' in a word;
     not yet a problem but the dawg structure itself could probably cope
     if the feature were wanted. (Maybe with a little teweaking)       */
  char linebuff[(MAX_LINE+1)];
  int length;
  for (;;)
    {
    int next = 0, c;
    for (;;) {
      c = fgetc(fpin);
      if (FILE_ENDED || c == EOF || ferror(fpin) || feof(fpin)) {
        /* for some reason, we always get a blank line at the end of files */
        current_word[0] = 0;
        first_diff = -1;
        linebuff[next] = '\0';
        FILE_ENDED = TRUE;
        return;
      }
      c &= 255;
      if (next == 0 && c == '\n') continue; /* skip blank lines... */
      linebuff[next++] = c;
      if (c == '\n') break;
    }
    linebuff[next] = '\0';

    words++;

    length = strlen(linebuff);

    if (linebuff[length-1] == '\n') linebuff[length-1] = '\0';
    if (linebuff[length] == '\n') linebuff[length] = '\0';

    if (length < 2 || length > MAX_LINE-1)
      {
      fprintf(stderr, "\n%s - invalid length\n", linebuff);
      continue;    /* Previously exit()ed, but now ignore so that
                      test sites without my pddict can use /usr/dict/words */
      }
    break;
    }
  for (length = 0; current_word[length] == linebuff[length]; length++) {
    /* Get common part of string to check order */
  }
  if (current_word[length] > linebuff[length]) {
    fprintf(stderr, "Error -- %s (word out of sequence)\n", linebuff);
    exit(EXIT_ERROR);
  }
  first_diff = length;

  nwords++;
  strcpy(current_word, linebuff);

  if ((nwords > 1) && (!(nwords & 0x3FF))) report_size();
#ifdef SMALL_MEMORY
  if (current_word[0] != lastch) {
    save_first_diff = first_diff;
    first_diff = -1;
    report_size();
  }
#else
  this_char = current_word[0]; /* for diagnostics... */
#endif
}
/**
*       COMPUTE_HASHCODE
*
*       Compute the hash code for a node
**/

static INDEX
#ifdef PROTOTYPES
compute_hashcode(NODE *edge_list, INDEX nedges)
#else
compute_hashcode(edge_list, nedges)
NODE *edge_list;
INDEX nedges;
#endif
{
  INDEX i;
  INDEX res = 0L;

  for (i = 0; i < nedges; i++)
    res = ((res << 1) | (res >> 31)) EOR EDGE_LIST(i);
  return(res);
}

/**
*       REPORT_SIZE
*
*       Report the current size etc
**/

static void
#ifdef PROTOTYPES
report_size(VOID)
#else
report_size()
#endif
{

  if (first_time)
    {
    fprintf(stderr, "      Words    Nodes    Edges    Bytes    BPW\n");
    fprintf(stderr, "      -----    -----    -----    -----    ---\n");
    first_time = FALSE;
    }
  if (*current_word) fprintf(stderr, "%c:", *current_word);
  else fprintf(stderr, "Total:");
  
  /* THE FOLLOWING IS RATHER GRATUITOUS USE OF FLOATING POINT - REMOVE
     IT AND REPLACE WITH INTEGER ARITHMETIC * 100 */

  /* (hey - I already did this in the copy I sent to Richard; so how
      come its missing? Oh no, not again: I've got out of synch and
      used an old copy, haven't I? :-(   ) */ 

  fprintf(stderr, "  %7ld  %7ld  %7ld  %7ld  %5.2f\n",
    nwords, nnodes, total_edges, sizeof(NODE PCCRAP *)*(total_edges+1),
      (float)(sizeof(NODE PCCRAP *)*(total_edges+1))/(float)nwords);
}

#ifdef SMALL_MEMORY
static void
#ifdef PROTOTYPES
merge_dawg (char *filename)
#else
merge_dawg (filename)
char *filename;
#endif
{
  FILE *fp, *outfile;
  NODE data, edge;
  INDEX nedges, nextfree, i, dentry;
  INDEX count, lastnode;
  int dictch, padding;
  char fname[128];

  nedges = (INDEX)MAX_SUBDAWG;
  if ((merge = MALLOC((SIZE_T)nedges, sizeof(INDEX))) == 0) {
     fprintf(stderr, "Memory allocation error -- %ld wanted\n",
       nedges*sizeof(INDEX));
     exit(EXIT_ERROR);
  }

  nextfree = MAX_CHARS; /* 0 is special, 1..26 are root nodes for a..z */
  nextslot = 0;
  for (dictch = 0; dictch < MAX_CHARS; dictch++) {
    /***
    *   Open the file and find out the size of the dawg
    ***/
    sprintf(fname, "%s%c%d", filename, EXT_CHAR, dictch);
    if ((fp = fopen(fname, READ_BIN)) == NULL) {
      continue;
    }
    nedges = getword(fp);
    dawgstart[nextslot] = nextfree;
    dawgsize[nextslot] = nedges-MAX_CHARS;

    /* the first entry is (erroneously) the pointer to the chunk */
    dentry = getword(fp);
    dawgentry[nextslot] = dentry - MAX_CHARS + nextfree;

    nextfree += nedges - MAX_CHARS;
    nextslot++;

    fclose(fp);
  }

/* Now output total edges, and starts[a..z] */
/* Then set nextfree to start of each block  */
  sprintf(fname, "%s%cdwg", filename, EXT_CHAR);
  outfile = fopen(fname, WRITE_BIN);
  if (outfile == NULL) {
    fprintf(stderr, "Cannot open output dawg file %s\n", fname);
    exit(EXIT_ERROR);
  }
  putword(nextfree, outfile);
  nextfree = 1; nextslot = 0; padding = 0;
  lastnode = MAX_CHARS-1;
  for (;;) {
    if (dawgentry[lastnode] != 0) break; /* Leave with 'lastnode' set */
    lastnode -= 1;
  }
  for (dictch = 0; dictch < MAX_CHARS; dictch++) {
    NODE edge = dawgentry[nextslot];
    if (edge == 0) {
      padding++;
      continue;
    }
    if (dictch == lastnode) {
      edge |= M_END_OF_NODE;
    } else {
      edge &= (NOT M_END_OF_NODE);
    }
    putword(edge, outfile);
    nextfree++; nextslot++;
  }
  nextfree += padding;
  while (padding > 0) {
    putword(0L, outfile); padding -= 1;
  }
  /* nextslot = 0; ???? This one not used? */
  for (dictch = 0; dictch < MAX_CHARS; dictch++) {
    sprintf(fname, "%s%c%d", filename, EXT_CHAR, dictch);
    if ((fp = fopen(fname, READ_BIN)) == NULL) {
      continue;
    }

    nedges = getword(fp);

    for (i = 0; i < MAX_CHARS; i++) {
      (void) getword(fp);
      /* Skip root pointers */
    }

    nedges -= MAX_CHARS;
    count = getwords(&merge[1], (long)(nedges*sizeof(NODE)), fp);
    if (count != nedges*sizeof(NODE)) {
      fprintf(stderr, "Dictionary read error - %ld wanted - %ld got\n",
        nedges*sizeof(NODE), count);
      exit(EXIT_ERROR);
    }
    fclose(fp);

    DELETE(fname);    /* On floppy systems, we can almost get away with
                         little more space than the final files would take! */

    /* relocate all nodes */
    for (i = 1; i <= nedges; i++) {
      data = merge[i];
      edge = data & M_NODE_POINTER;
      if (edge > MAX_CHARS) {
        data &= (NOT M_NODE_POINTER);
        data |= edge - MAX_CHARS - 1 + nextfree;
        merge[i] = data;
      }
      putword(merge[i], outfile);
    }
    nextfree += nedges;
    /*  nextslot++;   this one not used??? */
  }
  fclose(outfile);
  FREE(merge);
}
#endif
SHAR_EOF
cat - << \SHAR_EOF > dwgcheck.c
/*

    File:          dwgcheck.c
    Author:        Graham Toal
    Purpose:       check correct spelling using dict.dwg
    Creation date: 22/06/90
    Lastedit:      23/06/90 01:10:12

    Description:

       This can be expanded to be like the unix 'spell' command.
    This demo file simply checks words passed on the command line.
    note how it remembers words so that the second time it sees
    them, it will call them correct.  This is how you should
    implement the 'ignore' feature of an interactive checker.

    This code uses the dawg data structure, which I recommend
    you stick with; however for *extremely* fast checking in
    special circumstances you can use the 'pack' versions of
    check() et al.

*/


/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"

/*#define RCHECK*/     /* Enable for internal range checks... */

#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* eg, for isalpha() */

/* Spelling library utilities */
#include "init.c"      /* Loading dicts */
#include "dyntrie.c"   /* Creating dicts at run-time */
#include "print.c"     /* Printing dicts */
#include "check.c"     /* Checking words */
#include "locate.c"    /* Finding words by their stems */

#include "soundex.c"   /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c"  /* Closeness-metric for correction */
#include "correct.c"   /* Hack code to attempt error correction (uses above) */

/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
   files in a folder.  Then create a project which contains this main.c
   and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif

/* Your own header files go here, for instance:
               #include "readword.h"
 */



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;      /* precompiled dictionary (dict.dwg) */
INDEX edges;            /* size of above */

NODE PCCRAP *userdict;  /* run-time dictionary, added to dynamically */

char *word;
int each;

#ifdef SYS_MAC
  argc = ccommand(&argv);  /* Mac users have my sympathy */
#endif

  /* Your program goes here... */
  if (argc == 1) {
    fprintf(stderr, "usage: %s possibly mispeled wurdz\n", argv[0]);
    exit(EXIT_ERROR);
  }
  if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);

  if (!trie_new(&userdict)) exit(EXIT_ERROR);

  for (each = 1; each < argc; each++) {
    word = argv[each];
    fprintf(stderr, "* %s:\n", word);
    if (dawg_check(dawg, word) || dawg_check(userdict, word)) {
      fprintf(stderr, "Correct\n");
    } else {
      fprintf(stderr, "Wrong\n");
      (void)trie_add(&userdict, word);
    }
    if (each+1 != argc) fprintf(stderr, "\n");
  }

  exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > pack.c
/* The line below is to protect against Bitnet mailer damage */
#define MOD %
             /* This should be a Percent (for C Modulus) */ 

/* Blech. This file is a mess. Make it the first rewrite... */
#include "copyr.h"
/*****************************************************************************/
/*                                                                           */
/*      FILE : PACK.C                                                        */
/*                                                                           */
/*      Re-pack a dawg/trie data structure using Franklin Liang's            */
/*      algorithm for sparse matrix storage.  Final structure allows         */
/*      direct indexing into trie nodes, so is exceedingly fast at checking. */
/*      Unfortunately the trade off is that any algorithms which walk        */
/*      the data structure and generate words will take much longer          */
/*                                                                           */
/*              PACK <dawg file (inc .ext)> <pack file (inc .ext)>           */
/*                                                                           */
/*****************************************************************************/

/* Pending:

   see what closest gap between dawgptr + freenode is, and see whether we
   can save space by overlapping input & output arrays with a window between
   them.  Should get almost 50% of memory back.  Also, because of hacking
   around a bug some time back, I'm using an extra array (new_loc) for
   relocation of pointers, when in fact I could (and have in the past)
   managed to relocate them in situ with not much extra overhead.
      As I said, it needs a rewrite...

 */

/* Note: I tried one implementation of this which used bitsets to test
   whether two nodes were compatible.  In fact, it wasn't sufficiently
   faster to justify the extra space it used for the arrays of flags.
   Now I check for compatibility on the fly with lots of comparisons.
   I do however have a seperate byte array to flag whether a trie
   is based at any address.  There's probably a way of removing this.
*/

#include "grope.h"
#ifndef grope_h
/* If you don't have grope.h, create an empty one.
   These will do for a basic system: */
#undef KNOWS_VOID
#undef STDLIBS
#undef PROTOTYPES
#define SYS_ANY 1
#define COMPILER_ANY 1
#define SMALL_MEMORY 1 /*  To be defined if you have to generate
                           the data structure in bits. This will
                           certainly be true for any non-trivial
                           dictionary on MSDOS, or most home
                           micros with 1Mb Ram or under. */
#endif

#include <stdio.h>

/*#define RCHECK*/  /* Turn this back on if you have any problems. */

#include "dawg.h"
#include "utils.c"
#include "init.c"
#include "print.c"

#ifdef SYS_MAC
/* To compile with THINK C 4.0, place the files dawg.h grope.h,
   pack.c and utils.c in a folder.  Create a project that contains
   pack.c and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
/* Assume a 1Mb mac for the moment. Someday I'll add dynamic RAM sizing */
#define SMALL_MEMORY
#endif

#define TRUE (0==0)
#define FALSE (0!=0)
#define MAX_WORD_LENGTH 32

/* These two magic numbers alter a very hacky heuristic employed in
   the packing algorithm.  Tweaking them judiciously ought to speed
   it up significantly (at the expense of a slightly sparser packing */
#define DENSE_LOWER 100
#define DENSE_UPPER 200

/*###########################################################################*/
INDEX ptrie_size = 0;
static NODE PCCRAP *ptrie;
#ifdef RCHECK
/* can't use the standard range_check macro because these are slightly
   non-standard. */
#define PTRIE(x) ptrie[RANGECHECK((x), ptrie_size)]
#define DATA(x) (((x) >> 12) == 0xf2f1 ? toosmall((x), 0, __LINE__) : (x))
#else
/* so supply an explicit base case */
#define PTRIE(x) ptrie[x]
#define DATA(x) (x)
#endif
static char PCCRAP *trie_at;  /* save some time by caching this info --
                          previously it was a function called on each test */
static INDEX freenode, lowest_base = 1, highest_base = -1;
static int debug = FALSE;

int
#ifdef PROTOTYPES
compatible(NODE *node, INDEX i) /* Can a node be overlaid here? */
#else
compatible(node, i)
NODE *node;
INDEX i;
/* Can a node be overlaid here? */
#endif
{
int c;
 for (c = 0; c < MAX_CHARS; c++) {
   if ((node[c]&M_FREE) == 0) { /* Something to slot in... */
    if ((PTRIE(i+c) & M_FREE) == 0) return(FALSE); /* but no slot for it */
   }
 }
 return(TRUE);
}

INDEX
#ifdef PROTOTYPES
find_slot(NODE *node)
#else
find_slot(node)
NODE *node;
#endif
{               /* Try each position until we can overlay this node */
INDEX i;
  for (i = lowest_base; i < freenode; i++) {
    if ((!trie_at[i]) && (compatible(node, i))) {
       /* nothing here already */
                         /* and this one will fit */
      return(i);
    }
  }
  fprintf(stderr, "Should never fail to find a slot!\n");
                     /* because the output array is larger than the input... */
  exit(5);
  /* NOT REACHED */
  return(0);
}

static int changes;

INDEX
#ifdef PROTOTYPES
pack(NODE *node)
#else
pack(node)
NODE *node;
#endif
{
int c;
INDEX slot;
NODE value;

  slot = find_slot(node);  /* slot is also returned as the result,
                              to be used in relocation */
  /* Place node at slot */
  changes = 0;
  for (c = 0; c < MAX_CHARS; c++) {
    value = node[c];
    if ((value&M_FREE) == 0) { /* Something to fit? */
      if (((value>>V_LETTER)&M_LETTER) == (INDEX)c+BASE_CHAR) {
                  /* Just a consistency check - could safely be removed */
        PTRIE(slot+c) = DATA(value);
        changes++;
      }
    }
  }
  /* The hack heuristics below keep a N^2 operation down to linear time! */
  if ((slot == lowest_base) || (slot >= lowest_base+DENSE_LOWER)) {
    /* Heuristic is: we increase the initial search position if the
       array is packed solid up to this point or we're finding it *very*
       hard to find suitable slots */
    int found = FALSE;
    for (;;) {
      INDEX c;
      while (trie_at[lowest_base]) lowest_base++;
      for (c = lowest_base; c < lowest_base+MAX_CHARS; c++) {
        if ((PTRIE(c)&M_FREE) != 0) {found = TRUE; break;}
      }
      if (found && (slot < lowest_base+DENSE_UPPER)) break;
                  /* ^ Skip hard slots to fill */
      lowest_base++; /* although nothing is based here, the next N slots
                      are filled with data from the last N tries. */

      /* Actually I'm not sure if 256 sequential trees each with the
         same element used would actually block the next 256 slots
         without a trie_at[] flag being set for them.  However it
         does no harm to believe this... I must formally check this
         someday once all the other stuff is in order. */
    }
  }

  if (slot > highest_base) highest_base = slot;
     /* This is info for when we try to overlap input & output
        arrays, -- with the output sitting some large number of
        bytes lower down than the input. */
  trie_at[slot] = TRUE;
  return(slot);
}
/*###########################################################################*/

static NODE PCCRAP *dawg;
static INDEX PCCRAP *new_loc;
static INDEX nedges;

NODE this_node[MAX_CHARS];

INDEX
#ifdef PROTOTYPES
take_node(INDEX ptr)
#else
take_node(ptr)
INDEX ptr;
#endif
{
NODE data;
INDEX edge;
int let;
int endsword;
int endsnode;
char ch;
int changes = 0;
  for (let = 0; let < MAX_CHARS; let++) this_node[let] = M_FREE;
  for (;;) {
    data = dawg[ptr++];
    if (data == 0) {
      return(-1);
    } else {
      endsword = ((data & M_END_OF_WORD) != 0);
      endsnode = ((data & M_END_OF_NODE) != 0);
      edge = data & M_NODE_POINTER;
      let = (int) ((data >> V_LETTER) & M_LETTER);
      ch = let + BASE_CHAR;
  
      this_node[let] = edge & M_NODE_POINTER;
      if (endsword) this_node[let] |= M_END_OF_WORD;
      this_node[let] |= (NODE)ch<<V_LETTER;
  
      changes++;
      if (endsnode) break;
    }
  }
  if (changes != 0) {
    return(ptr);
  } else {
    fprintf(stderr, "Fatal error\n");
    return(0);
  }
}

NODE
#ifdef PROTOTYPES
redirect_node(NODE ptr)
#else
redirect_node(ptr)
NODE ptr;
#endif
{
NODE data;
INDEX edge;
int endsword;
char ch;
  data = DATA(PTRIE(ptr));

  endsword = ((data & M_END_OF_WORD) != 0);
  edge = data & M_NODE_POINTER;
  ch = (int) ((data >> V_LETTER) & M_LETTER);

  /*edge = dawg[edge] & M_NODE_POINTER;*/
  edge = new_loc[edge];

  ptr = edge;
  ptr |= (NODE)ch<<V_LETTER;
  if (endsword) ptr |= M_END_OF_WORD;

  return(ptr);
}

int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
INDEX dawgptr = 1, trieptr, newdawgptr, i, nodes = 0;
FILE *triefile;

#ifdef SYS_MAC
  argc = ccommand(&argv);
#endif

  if (argc != 3) {
    fprintf(stderr, "pack: wrong parameters - should be infile outfile\n");
    exit(EXIT_ERROR);
  }

  if (!dawg_init((argc >= 2 ? argv[1]: ""), &dawg, &nedges)) exit(EXIT_ERROR);
  if (debug) dawg_print(dawg, (INDEX)ROOT_NODE); /* assume small test file! */

  freenode = ((nedges*16)/15)+(4*MAX_CHARS);
                               /* Minimal slop for pathological packing? */
  ptrie_size = freenode;
  ptrie = MALLOC((SIZE_T)freenode, sizeof(NODE));
  if (ptrie == NULL) {
    fprintf(stderr, "Cannot claim %ld bytes for ptrie[]\n",
            sizeof(NODE)*freenode);
    exit(EXIT_ERROR);
  }
  new_loc = MALLOC((SIZE_T)freenode, sizeof(NODE));
  if (new_loc == NULL) {
    fprintf(stderr, "Cannot claim %ld bytes for new_loc[]\n",
            sizeof(NODE)*freenode);
    exit(EXIT_ERROR);
  }
  trie_at = (char PCCRAP *)MALLOC((SIZE_T)freenode, sizeof(char));
  if (trie_at == NULL) {
    fprintf(stderr, "Cannot claim %ld bytes for trie_at[]\n", freenode);
    exit(EXIT_ERROR);
  }
  for (i = 0; i < freenode; i++) {
    ptrie[i] = M_FREE; new_loc[i] = 0; trie_at[i] = FALSE;
  }

  dawg[0] = 0; /* 1st entry is never looked at, and maps to itself */

  dawgptr = 1;

/* Relocate initial node at 1 seperately */

    newdawgptr = take_node(dawgptr /* 1 */);
    trieptr = pack(this_node);
    /*dawg[dawgptr] = trieptr;*/
      /* What the hell does this do??? I've forgotten!!! - oh yes, this
         was relocating in situ, without new_loc... */
    new_loc[dawgptr] = trieptr;
    dawgptr = MAX_CHARS+1;

{INDEX maxdiff = 0, diff;
  for (;;) {
    if (highest_base > dawgptr) {
      diff = highest_base - dawgptr;
      if (diff > maxdiff) maxdiff = diff;
    }
    newdawgptr = take_node(dawgptr);
    if (newdawgptr == -1) {
      dawgptr++;
      continue;
    }
    trieptr = pack(this_node);
    /*dawg[dawgptr] = trieptr;*/
    new_loc[dawgptr] = trieptr;
    dawgptr = newdawgptr;
    if (dawgptr > nedges) {
      break;  /* AHA!!! I was doing this in the
                 wrong order & lost last entry! *AND* had '>=' for '>' */
    }
    nodes++;
    if ((nodes MOD 1000) == 0) fprintf(stderr, "%ld nodes\n", nodes);
  }
  if (debug) fprintf(stderr, "wavefront gets up to %ld\n", maxdiff);
}
  if (debug) fprintf(stderr, "All packed - used %ld nodes\n", highest_base);
  for (trieptr = 1; trieptr < freenode; trieptr++) {
    /*
      extract edge from ptrie[trieptr],
      look it up via dawg[edge], slot it back in
      (while preserving other bits)
     */
    PTRIE(trieptr) = redirect_node(trieptr);
  }
  /* Finally, remember to bump up highest_base in case last node is only
     one edge and 25 others are missing! */
  if (debug) fprintf(stderr, "Redirected...\n");

  triefile = fopen((argc >= 3 ? argv[2] : DEFAULT_PACK), WRITE_BIN);
  if (triefile == NULL) {
    fprintf(stderr, "Cannot write to packed trie file\n");
    exit(1);
  }
  if (debug) fprintf(stderr, "File opened...\n");

  ptrie[0] = highest_base+MAX_CHARS-1;/* File size in words
                                     (-1 because doesn't include itself)  */
  if (debug) fprintf(stderr, "Dumping... (0..%ld)\n", highest_base+MAX_CHARS-1);
  for (i = 0; i < highest_base+MAX_CHARS; i++) {/* dump packed DAG */
  NODE n;
    n = DATA(PTRIE(i));
    putword(n, triefile);
    if (ferror(triefile)) {
      fprintf(stderr, "*** TROUBLE: Writing DAG -- disk full?\n");
      fclose(triefile);
      exit(1);
    }
  }
  if (debug) fprintf(stderr, "Dumped...\n");
  fclose(triefile);
  if (debug) fprintf(stderr, "Done.\n");
}
SHAR_EOF
cat - << \SHAR_EOF > pckcheck.c
/*

    File:          pckcheck.c
    Author:        Graham Toal
    Purpose:       check correct spelling using dict.pck
    Creation date: 22/06/90
    Lastedit:      23/06/90 01:15:39

    Description:

       This can be expanded to be like the unix 'spell' command.
    This demo file simply checks words passed on the command line.
    note how it remembers words so that the second time it sees
    them, it will call them correct.  This is how you should
    implement the 'ignore' feature of an interactive checker.

    This version used the fast 'packed' data structure.  This is
    approximately 3 times faster, but not all utilities support
    the packed versions.  Also utilities which walk the trie are
    considerably slower (say by a factor of 100) -- so chose when
    to used 'dawg' and when to use 'pack'ed versions carefully!

*/


/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"

/*#define RCHECK*/     /* Enable for internal range checks... */

#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* eg, for isalpha() */

/* Spelling library utilities */
#include "init.c"      /* Loading dicts */
#include "dyntrie.c"   /* Creating dicts at run-time */
#include "print.c"     /* Printing dicts */
#include "check.c"     /* Checking words */
#include "locate.c"    /* Finding words by their stems */

#include "soundex.c"   /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c"  /* Closeness-metric for correction */
#include "correct.c"   /* Hack code to attempt error correction (uses above) */

/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
   files in a folder.  Then create a project which contains this main.c
   and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif

/* Your own header files go here, for instance:
               #include "readword.h"
 */



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *ptrie;     /* precompiled dictionary (dict.pck) */
INDEX edges;            /* size of above */

NODE PCCRAP *userdict;  /* run-time dictionary, added to dynamically */

char *word;
int each;

#ifdef SYS_MAC
  argc = ccommand(&argv);  /* Mac users have my sympathy */
#endif

  /* Your program goes here... */
  if (argc == 1) {
    fprintf(stderr, "usage: %s possibly mispeled wurdz\n", argv[0]);
    exit(EXIT_ERROR);
  }

  if (!pack_init("", &ptrie, &edges)) exit(EXIT_ERROR);

  if (!trie_new(&userdict)) exit(EXIT_ERROR);

  for (each = 1; each < argc; each++) {
    word = argv[each];
    fprintf(stderr, "* %s:\n", word);
    if (pack_check(ptrie, word) || dawg_check(userdict, word)) {
      fprintf(stderr, "Correct\n");
    } else {
      fprintf(stderr, "Wrong\n");
      (void)trie_add(&userdict, word);
    }
    if (each+1 != argc) fprintf(stderr, "\n");
  }

  exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > pdawg.c
/* Manadatory headers */
/*#define RCHECK*/
/*#define DEBUG*/
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* for isalpha() */

/* Spelling library utilities */
#include "init.c"
#include "print.c"
/*
#include "check.c"
#include "locate.c"
#include "similcmp.c"
#include "soundex.c"
#include "correct.c"
*/
/*#include "dyntrie.c"*/

int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX nedges;
  if (!dawg_init((argc == 1 ? "" : argv[1]), &dawg, &nedges)) exit(EXIT_ERROR);
  dawg_print(dawg, (INDEX)ROOT_NODE);
  fprintf(stderr, "Finished printing dawg\n");
  exit(0);
}
SHAR_EOF
cat - << \SHAR_EOF > ppack.c
/* Manadatory headers */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* for isalpha() */

/* Spelling library utilities */
#include "init.c"
#include "print.c"
/*
#include "check.c"
#include "locate.c"
#include "similcmp.c"
#include "soundex.c"
#include "correct.c"
*/
/*#include "dyntrie.c"*/



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *ptrie;
INDEX nedges;
  if (!pack_init((argc == 1 ? "": argv[1]), &ptrie, &nedges)) exit(EXIT_ERROR);
  pack_print(ptrie, (INDEX)ROOT_NODE);
  exit(0);
}
SHAR_EOF
cat - << \SHAR_EOF > proot.c
/*

    File:          proot.c
    Author:        Graham Toal
    Purpose:       find all words starting with 'root'
    Creation date: 22/06/90
    Lastedit:      22:24:32

    Description:
      some spelling programs remove characters from the end of
    wrongly spelt words one by one until the resulting root is
    found to be a prefix of other words in the dictionary.  This
    works because the assumption is made that the word was in
    fact correct, but was an inflected form of a word in the
    dictionary which had not been stored.
*/


/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* eg, for isalpha() */

/* Spelling library utilities */
#include "init.c"        /* Loading dicts */
/*#include "dyntrie.c"*/ /* Creating dicts at run-time */
#include "print.c"       /* Printing dicts */
/*#include "check.c"*/   /* Checking words */
#include "locate.c"      /* Finding words by their stems */

/*#include "soundex.c"*/ /* Soundex algorithm for word-likeness comparison */
/*#include "similcmp.c"*//* Closeness-metric for correction */
/*#include "correct.c"*/ /* Code to attempt error correction (uses above) */

/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
   files in a folder.  Then create a project which contains this main.c
   and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif

/* Your own header files go here, for instance:
               #include "readword.h"
 */



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX root;
INDEX edges;
int each;

#ifdef SYS_MAC
  argc = ccommand(&argv);  /* Mac users have my sympathy */
#endif

  /* Your program goes here... */
  if (argc == 1) {
    fprintf(stderr, "usage: %s part\n", argv[0]);
    exit(EXIT_ERROR);
  }
  if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);
  for (each = 1; each < argc; each++) {
    fprintf(stderr, "* %s:\n", argv[each]);
    root = dawg_locate_prefix(dawg, argv[each], ROOT_NODE);
    if (root != ROOT_NODE) {
      dawg_print_prefix(dawg, argv[each], root);
    } else {
      fprintf(stderr, "(none found)\n");
    }
    if (each+1 != argc) fprintf(stderr, "\n");
  }

  exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > sample.c
/*
   Dummy program for writing word utilities.
   If you stick to the style shown here, your programs will
   remain portable across a vast range of machines.
   Skeleton file by Graham Toal.  Remove this header when you've
   added your own...

   (If you want to see some worked examples, try
    tell.c  dwgcheck.c  pckcheck.c)

*/

/*

    File:          
    Author:        
    Purpose:       
    Creation date: 
    Lastedit:      

    Description:

*/


/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* eg, for isalpha() */

/* Spelling library utilities */
#include "init.c"      /* Loading dicts */
#include "dyntrie.c"   /* Creating dicts at run-time */
#include "print.c"     /* Printing dicts */
#include "check.c"     /* Checking words */
#include "locate.c"    /* Finding words by their stems */

#include "soundex.c"   /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c"  /* Closeness-metric for correction */
#include "correct.c"   /* Hack code to attempt error correction (uses above) */

/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
   files in a folder.  Then create a project which contains this main.c
   and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif

/* Your own header files go here, for instance:
               #include "readword.h"
 */



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
#ifdef SYS_MAC
  argc = ccommand(&argv);  /* Mac users have my sympathy */
#endif

  /* Your program goes here... */

  exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > tell.c
/*

    File:          tell.c
    Author:        Graham Toal
    Purpose:       offer correct spelling
    Creation date: 22/06/90
    Lastedit:      22:24:32

    Description:

       Like the unix 'spelltell' command.

*/


/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"

/* Headers here as needed on per-program basis */
#include <ctype.h>  /* eg, for isalpha() */

/* Spelling library utilities */
#include "init.c"      /* Loading dicts */
#include "dyntrie.c"   /* Creating dicts at run-time */
#include "print.c"     /* Printing dicts */
#include "check.c"     /* Checking words */
#include "locate.c"    /* Finding words by their stems */

#include "soundex.c"   /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c"  /* Closeness-metric for correction */
#include "correct.c"   /* Hack code to attempt error correction (uses above) */

/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
   files in a folder.  Then create a project which contains this main.c
   and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif

/* Your own header files go here, for instance:
               #include "readword.h"
 */



int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX edges;
int each;

#ifdef SYS_MAC
  argc = ccommand(&argv);  /* Mac users have my sympathy */
#endif

  /* Your program goes here... */
  if (argc == 1) {
    fprintf(stderr, "usage: %s mispeled wurdz\n", argv[0]);
    exit(EXIT_ERROR);
  }
  if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);
  for (each = 1; each < argc; each++) {
    fprintf(stderr, "* %s:\n", argv[each]);
    if (!dawg_correct(dawg, argv[each])) {
      fprintf(stderr, "(none found)\n");
    }
    if (each+1 != argc) fprintf(stderr, "\n");
  }

  exit(EXIT_OK);
}
SHAR_EOF