gtoal@tharr.UUCP (Graham Toal) (06/28/90)
#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar: Shell Archiver
# Run the following text with /bin/sh to create:
# README # copyr.h # dawg.h # dictlib.h # grope.h
cat - << \SHAR_EOF > README
This is a suite of programs for working with words. I am releasing
it to alt.sources primarily to invite people to test it on as many
platforms as possible (mail me back any bug reports please!) so that
the final code will be completely portable.
Unpack with unshar or sh < part1; sh < part2; sh < part3
Utilities for spelling checking and correction are supplied, but
the ultimate aim is to support all sorts of word-manipulation activities
such as a writers workbench and assorted word games (See PS).
This news posting is in three parts; part1: this file + headers;
part2: utility modules in .c files to be #include'd; part3: main
programs; just CC these and they should work -- no fancy makefiles
or link commands.
The code here doesn't have much of a user interface; I'm rather hoping
that people who pick it up will hook it in to their favourite operating
system as smoothly as they can. Perhaps someone with the time to
spare (who am I kidding :-( ) might integrate it with ispell for instance.
I've experimented with various interfaces already; I can accept TeX
input and write output suitable for correcting text in either emacs
or uEmacs. I'm not going to release these though, until I'm happy
with the internal code. Incidentally, the code will be totally
public domain - meaning that I've no objection to its being used
in commercial projects. HOWEVER I would appreciate if you didn't do so
until it is officially released (probably through comp.sources.misc)
because I would prefer to define portable file-formats and magic numbers
first.
OK, enough babble; here's how it all works:
1) acquire a list of words for yourself; I have one but at 690K it is
too big to post. (actually even these sources come to around 120K
so I hope they make it through OK. I don't think I have a shar
that splits, so I've divided into three shars myself.)
The word list should be sorted by ascii order.
2) cc dawg.c
I've deliberately made main programs into one source file, by
including *.c for utility modules; I know this is bad style but
it helps make compiling and porting easier (no worries about
long external names for instance, or non-case-sensitive linkers)
3) dawg yourdict dict
'yourdict' is your wordlist; but the second parameter should
be 'dict' at least while testing. This will generate dict.dwg
which is a compacted and *FAST* data structure for word access.
Building a dawg[See next section for meaning of name] from a word
list is a real memory pig; so I've written code which will let you
generate the dawg in chunks (traded off against a small loss of
compression density). (Interestingly enough it isn't a time-pig;
using a hash-table for node merging gives almost linear time - thanks
to Richard Hooker for that one).
If you don't have enough memory (say 2Mb?) you'll get a run-time
message suggesting that you recompile with cc -DSMALL_MEMORY dawg.c
(OK, so one day I'll make it select this mode itself at run-time)
IBM PC's and Mac's will get this mode by default. [See FOOTNOTE]
If you want to check that the data structure generated is ok,
3a) cc pdawg.c
3b) pdawg dict.dwg
This should decompile the data and print out the
original word list
4) cc dwgcheck.c
Compile a Mickey Mouse spelling checker
5) dwgcheck a few wurdz on the comand line ?
... and test it.
6) cc tell.c
Now try the 'advanced' stuff! - soundex correction... (I hates it :-()
7) tell me sume wrongli speld werds
and test it...
If you're getting into this stuff, here's another incidental
hack you might want to try...
7a) cc proot.c
7b) proot root
print out all words of form 'root*'
One day I'll hook this stuff into regexp; but not today :-)
----------
Enough examples for now. If that all worked you are doing well!
If it didn't, and you have the time, please find out what went wrong;
If you can fix it, mail me the fix please, else mail me a log of
the symptoms, such as compiler error messages.
By the way, I know that the spelling correction uses a tacky algorithm;
don't worry about it -- I'm working on some red hot stuff which will
put soundex to shame (which isn't hard :) ) [Late note; it now works
(apparently well) but I need a phonetic dictionary before it is useful :-(
Anyone got a phonetic dictionary? Alvey people out there?]
You might be interested in the data structures used; the main one is
a dawg - a 'Directed Acyclic Word Graph' -- it's essentially a
Directed Acyclic Graph implementing an FSM which describes the set of
all words in your lexicon. The name DAWG comes from Appel's in his paper
on Scrabble (although none of the code or algorithms do).
The second most important data structure is superimposed on this
one; it is a packing algorithm due to Liang (of TeX hyphenation
fame) which allows you to convert an implicitly linked trie
into an indexed trie *without* taking any more space. Liang is
a real smart cookie & it is a shame this technique of his is
not better known... (it should be up there among binary tree and
linked lists!)
OK, more examples:
8) cc pack.c
compile the packing code
9) pack dict.dwg dict.pck
this takes rather a long time; you'll get messages saying '1000 nodes'
'2000 nodes' etc roughly 1 every 20 seconds +/- 50%. There shouldn't
be more than a couple of screenfuls of these :-) I'll speed it up one
day.
If you want to check that this data structure generated is ok too,
9a) cc ppack.c
9b) ppack dict.pck
Just as you did for (3)
10) cc pckcheck.c
Just like dwgcheck but using the other type of data.
11) pckcheck sume moar possibly incorrectly speld woards
boring, huh?
12) Tha Tha Thats all folks!
---------------
Unless I've made some major cockup at the last minute, you should
actually have a reasonable chance of getting these working on your
machine, whatever it is. (Dec 20's and EBCDIC machines *possibly*
excepted - attempts welcome!)
As I said, please mail me your bugfixes, problems, suggestions etc.
By the way, the standard of coding isn't exceptionally high; this
implementation was always intended to be a prototype. A rewrite
of dawg.c and pack.c is definitely high on the priority list...
The actual application programs aren't too bad, but the interfaces
to the various utility routines are liable to change on an hour-by-hour
basis! (I've been trying to make them more consistent -- you should have
seen the last lot ;-) ).
However, the algorithms are all complete now; anyone who has ever needed
to convert a tree to a dag might be interested in the linear-time solution
in dawg.c - most other solutions I've seen have been NlogN or N^2.
If you're interested, there's a file dictlib.h which is a proposed interface
for the next iteration; comments are invited.
Share & Enjoy,
Graham Toal <gtoal@uk.ac.ed>
<FOOTNOTE>
I'm experimenting in this code with ways of finding out at compile
time various parameters needed for portability. There's a file
grope.h which does this. If things all work first time, great;
if not, you may have to change various #define's. The best way
to do that is to add code to grope.h which identifies your system,
and only make changes of the form:
#ifdef SYS_MYSYSTEM
#undef something
#define something (a different value)
#endif /* my system */
There are instructions for customising in grope.h itself. My overall
aim is to avoid special makefiles and special command line options.
(A bit ambitious I know, but nice idea if it works...)
(Steve Pemberton's config.c might be useful to you too if you are
hacking grope.h. It was reposted on usenet recently.)
</FOOTNOTE>
<PS>
Future hacks:
1) Anagrams
2) Crosswords
3) Scrabble
4) Anything where a text key can be used to lookup another
piece of text. This is implemented with the 'dawg_locate_prefix'
routine; it is effectively a cheap substitute for btrees etc.
(and better than a hash table because you can *do* things with it!)
4a) Reverse phone book 1234567=Fred Bloggs
4b) house style enforcer stupid=infelicitous
4c) uk/us spelling advisor sidewalk=pavement
4d) shorthand/macro expander f.y.i.=for your information
etc etc... (Most of wwb in fact)
5) Anything you can suggest! (or *implement* :-) )
</PS>
<MEMO>
Archive-name: dawgutils/part[1-3].shar
Reply-to: Graham Toal <gtoal%uk.ac.ed@nsfnet-relay.ac.uk>
1) Upload fixed version of dyntrie.c - remember bug with signed chars
being used as an index [comment rule in dawg.c?]
2) known design flaw: can't handle strings with 0 in them
3) known bug: dyntrie.c has problems if you enter a string
which *starts* with 255. This is due to sending 'end_of_node'
bit rather hackily. Can be fixed.
4) Test version to be posted to alt.sources by running on machine
with signed chars (eg MSDOS)
5) Remove hacky Malloc debugging macros in utils.c -- there might be
a problem if compiled on MSDOS but not under MICROSOFT C
6) Check that LIB_STRINGS is *not* defined when UX63 is set.
7) tweak the constants in pack.c to speed it up a lot
8) hack the pack.c heuristics into dyntrie.c to speed it up too.
</MEMO>
SHAR_EOF
cat - << \SHAR_EOF > copyr.h
/*
* Copyright 1989 Graham Toal & Richard Hooker
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose with or without fee is hereby granted provided
* that the above copyright notice appear in all copies and that both that
* copyright notice and this permission notice appear in supporting
* documentation.
*
* This program is publicly available, but is NOT in the public domain. The
* difference is that copyrights granting rights for unrestricted use and
* redistribution have been placed on the software to identify its
* authors. You are allowed and encouraged to take this software and
* build commercial products, freeware products, shareware products, and
* any other kind of product you can think up, with the following restriction:
*
* If you redistribute the source to this program, or a derivitive of that
* source, you must include this copyright notice intact. If the system
* this source is distributed with is under a stricter license (such as
* a commercial license, the typical freeware "no commercial use" license,
* or the FSF copyleft) then you must provide the original source under the
* original terms.
*
* Edinburgh Software Products (E.S.P.) makes no representations about the
* suitability of this software for any purpose. It is provided "as is"
* without express or implied warranty.
*
* E.S.P. DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
* ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
* E.S.P. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
* ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
* IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
* OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*
* Authors: Graham Toal & Richard Hooker, Edinburgh Software Products,
* 6 Pypers Wynd, Prestonpans, East Lothian, Scotland.
*/
SHAR_EOF
cat - << \SHAR_EOF > dawg.h
/**
*
* DAWG.H
*
* Header file for Directed Acyclic Word Graph access
*
* The format of a DAWG node (8-bit arbitrary data) is:
*
* 31 24 23 22 21 0
* +--------------------+---+---+--+-------------------------------------+
* | Letter | W | N |??| Node pointer |
* +--------------------+---+---+--+-------------------------------------+
*
* where N flags the last edge in a node and W flags an edge as the
* end of a word. 21 bits are used to store the node pointer, so the
* dawg can contain up to 262143 edges. (and ?? is reserved - all code
* generating dawgs should set this bit to 0 for now)
*
* The root node of the dawg is at address 1 (because node 0 is reserved
* for the node with no edges).
*
* **** PACKED tries do other things, still to be documented!
*
**/
#define TRUE (0==0)
#define FALSE (0!=0)
#define DAWG_MAGIC 0x01234567
#define PACK_MAGIC 0x34567890
#define TERMINAL_NODE 0
#define ROOT_NODE 1
#define V_END_OF_WORD 23
#define M_END_OF_WORD (1L << V_END_OF_WORD)
#define V_END_OF_NODE 22 /* Bit number of N */
#define M_END_OF_NODE (1L << V_END_OF_NODE) /* Can be tested for by <0 */
#define V_FREE 22 /* Overloading this bit, as
packed tries need no end-of-node */
#define M_FREE (1L << V_FREE)
#define V_LETTER 24
#define M_LETTER 0xFF
#define M_NODE_POINTER 0x1FFFFFL /* Bit mask for node pointer */
/* max_chars and base_chars are a hangover from the days when this
was trying to save space, and dawgs were only able to contain
characters in the range 'a'..'z' 'A'..'Z' (squeezed into 6 bits).
Now that we're '8-bit clean' we no longer need these. Later code
in fact doesn't support the old style; but some procedures still
subtract 'BASE_CHAR' from the character to normalize it. Since it
is now 0 it is a no-op... */
#define MAX_CHARS 256
#define BASE_CHAR 0
/* Enough? */
#define MAX_WORD_LEN 256
#define MAX_LINE 256
typedef long NODE;
typedef long INDEX;
SHAR_EOF
cat - << \SHAR_EOF > dictlib.h
/* This header file does not describe any of the code in the
accompanying files; it is a rough sketch of the NEXT iteration
of the spelling/word utilities. Comments are invited. Missing
functionality or bits that won't work should be pointed out
please! Send mail to gtoal@ed.ac.uk (via nsfnet-relay.ac.uk
from US sites)
many thanks.
Graham Toal 23/06/90
*/
typedef long INDEX;
typedef long NODE;
#define ROOTNODE 1L
typedef struct DICT {
/* Although primitive procedures will be provided for handling the
basic dawg array, by using this interface applications can
remain independent of the implementation details of the
dictionary. */
/* So far there are three forms of dict; 1: a simple dawg with edges
as 4-byte integers; each node (a set of branches) is stored
sequentially. 2: An *indexed* form of the above -- much faster
to lookup in, but slower to walk through. Although indexing
would normally be heavy on space, this uses Liang's packed
overlapping tries, thus using almost exactly the same space as
method 1. 3: A form of 1, but with all the stops pulled out
to get as much bit-level compaction as possible; edges can
be two bytes instead of 4. (by Keith Bostic) */
/* This struct contains both access procedures and information;
most fields will be filled in by our code when the dictionary
is loaded by load_dict(). The fields may be added to in
later releases, but they will always be upwards compatible;
none of this data is saved to disk in this format, so changing
the struct layout will not cause problems as long as the names
remain the same. */
/* Note that some procedures take a user-supplied 'apply' parameter;
This 'apply' procedure has a 'context' parameter which is merely
a handle to allow the user to pass data in to his code without
having to use global variables. If the user's apply procedure
returns anything other than 0, it will cause an early exit, with
the users return code being returned from the calling function. */
/* It is intended that these are used in an object-oriented way;
although we will supply an external dict_check(word, dict) procedure, for
instance, all it will do is call dict->check(word, dict->dawg) */
NODE *dawg;
/* Root of the dictionary tree */
char *dict_name, /* "smiths" - used in command lines etc. */
*dict_info, /* "Johns Smiths Rhyming Dictionary,\n
(c) Smith & Jones, 1892\n
Whitechapel, London.\n" */
*dict_fname, /* "/usr/dict/smiths" */
*lang_charset; /* "ISO 8859/1 Latin 1" */
/* These are filled in by us to make life easier for the applications
programmer. We'll supply a default table for simple 7-bit ascii
dawgs; but this mechanism allows us the option of including all
the salient points about the character set used in a dictionary
in the dictionary itself. Because they are char*'s, we'll probably
end up sharing the same table between multiple dictionaries --
so don't write to them. (Although you may replace the _pointer_
with one to a table of your own. */
/* Later note: I've just found out about LOCALE.H etc. This stuff
may therefore change... */
int *chartype;
/* 1 whitespace */
/* 2 punctuation */
/* 4 blank */
/* 8 lower case letter */
/* 16 upper case letter */
/* 32 (decimal) digit */
/* 64 underscore */
/* 128 A-F and a-f */
/* 256 Ignore words starting
with this char */
/* 512 Include in definition
of a word -- mainly
alpha but also hyphen
and apostrophe */
char *lower, /* personal implementation of tolower() */
*unaccented, /* map accented chars to non-accenmted */
*lexorder; /* an arbitrary ordering for sorting:
e-acute would come after e and before f
*/
int (* check) (NODE *base, INDEX state, struct DICT *d, char *word);
/* Check a word - exact match; return true/false */
int (* fuzzy) (NODE *base, INDEX state, struct DICT *d, char *word,
char *value);
/* Check a word - fuzzy case, accents, hyphens etc; return true/false */
/* value is filled in with the dictionary's idea of what the
word should look like. */
/* Fancier correction procedures may be synthesised out of user code and
'walk' */
int (* correct) (
char *word,
NODE *base, INDEX state,
/* Normally dict's root, but may be subtree */
struct DICT *d, /* lowercase/accent information from here */
int maxwords,
int order, /* 0: by alpha
1: by highest score first
2: by lexical ordering
(ignores case, accents, hyphens)
*/
int (* apply) (
char *word, /* Corrected word */
int score, /* Normalised to range 0..255 */
void *context),
void *context);
/* Offer spelling corrections for the
given word. If maxwords > 0, return
the best <maxwords> words in order
of most-likely first; otherwise
many words may be returned, in
alphabetical order, coupled with
a score: 255 means exact match; 0
means totally different */
/* returns 0 if user returned 0 for
all applications */
int (* walk) (
NODE *base, INDEX state,
int (* apply) (char *word, void *context),
void *context);
/* Apply the procedure given to every
word in the dictionary tree or subtree,
passing in the user-supplied context */
/* returns 0 if user returned 0 for
all applications */
int (* lookup) (
NODE *base, INDEX state,
char *key,
/* Out */
INDEX values);
/* Return the index of the subtree of the
dictionary which has 'key' as a prefix;
for instance, in a reverse telephone
book, you might find entries:
0874=Anytown
0875=Cockenzie
0875=Port Seton
0875=Prestonpans
0876=Toytown
In this case, if searching for "0875=",
a subtree would be returned containing:
Cockenzie
Port Seton
Prestonpans
This subtree would then be walked
using the 'walk' function and a user-
written 'apply' procedure. */
/* returns 0 if successful */
/* The following two are performance hints ONLY. They should not
affect the correct functioning of a program. */
int (* uncache) (struct DICT *d);/* The space a dictionary uses may be
reclaimed by calling this. If the
dictionary is subsequently accessed,
there *may* be a performance penalty --
for instance the dictionary may be
accessed off disk. However on a machine
with sufficient memory, the system
may choose to leave the data in ram
until, say, malloc runs out of space.
If there is no convenient mechanism
for organising the demand-recacheing,
this may be a no-op. */
/* returns 0 if successful */
int (* recache) (struct DICT *d);/* The data of the dictionary is reloaded
into active memory where it will stay */
/* returns 0 if successful */
/* May be extended in the future */
} DICT;
/*
dict_load
This is provided as a primitive so that dictionaries can be loaded
in the most efficient way the machine supports. The space for the dict
is supplied *by* this procedure - not *to* it. This allows a Mach (or Emas)
implementation to mmap (or Connect) the file in memory - the connection
address being chosen by the OS and outside our control. It also allows
systems like RISC OS on the Acorn Archimedes to use an optimised whole-
file load call to pull the file off disk into real ram in a single disk
operation.
Since the data parts of the type 1 & 2 dicts are just a set of 4-byte
integers, it is easy to correct a faulty byte-sex by running down this
array *once at start-up* to reverse the order. This is easy if the
data is loaded into physical ram. If it is connected as a file by a VM system
however, the VM system must support copy-on-write; if it only has read-only
mapping there must be two files available -- one of each sex.
A *possible* scheme on VM systems is to byte-sex-reverse a whole
page the first time that page is touched. This may be a lot of unneccesary
work - I'd recommend keeping a correct-sex file on-line.
This code will create the DICT struct, and fill in the various fields,
either from the dictionary file itself or from private knowledge if not available.
*/
int dict_load_file(char *fname, DICT **dict_struct);
/* fname is the literal file name */
/* dict_struct is allocated by us,
not by the user. */
/* returns 0 if successfully loaded */
int dict_load_dict(char *dname, DICT **dict_struct);
/* dname is the dictionary name,
e.g. "words", or "web2" etc.
The corresponding file will be searched
for via a system-dependent path, logical
variable, or fixed place. */
/* dict_struct is allocated by us,
not by the user. */
/* returns 0 if successfully loaded */
int dict_unload(DICT *d); /* Use when totally finished with data */
/* These are user-level veneers for the internal procedures used above.
Use them in preference to the above. Use the above only if
you are a speed-freak! (but disguise them in macros such as:
#define dict_check(dawg, dict, word) dict->check(dict->dawg, dict, word)
These macros will of course go 'Bang!' if dict ptr is unassigned or NULL.
*/
/* In the worst case, if a machine has a poor quality compiler which
doesn't support procedure variables well, these routines could be
the *acual* implementation rather than just a pointer to it. */
int check (NODE *base, INDEX state, DICT *d, char *word);
int fuzzy (NODE *base, INDEX state, DICT *d, char *word, char *value);
int correct (char *word,
NODE *base, INDEX state,
/* Normally dictionary root, but may be subtree */
DICT *d, /* lowercase/accent information from here */
int maxwords,
int order, /* 0: by alpha
1: by highest score first
2: by lexical ordering
*/
int apply(char *word, /* Corrected word */
int score, /* Normalised to range 0..255 */
void *context),
void *context);
int walk (NODE *base, INDEX state, /* Yoyo fanatics will be pleased
to learn that they can 'walk the dawg'... :-) */
int apply(char *word, void *context),
void *context);
int lookup (NODE *base, INDEX state,
char *key,
/* Out */
INDEX values);
/* Now here come the fast internal procedures which the above eventually call */
int dawg_check(NODE *base, INDEX state, char *word);
/* Standard dawg */
int pack_check(NODE *base, INDEX state, char *word);
/* Overlapped indexed dawg */
int bsd_check(NODE *base, INDEX state, char *word);
/* Space-optimised dawg (called bsd because Keith Bostic is working
on this format for the next bsd4.4 spelling checker) */
/* Must make the internl procs agree with these... save the macros for later...
#define dict_check(word,dict) \
(dict != NULL ? (dict->check(dict->dawg, ROOTNODE, word)) : -1)
*/
int dawg_walk(NODE *base, INDEX state,
int apply(char *word, void *context),
void *context);
int pack_walk(NODE *base, INDEX state,
int apply(char *word, void *context),
void *context);
int bsd_walk(NODE *base, INDEX state,
int apply(char *word, void *context),
void *context);
/*
#define dict_walk(dict, apply, context) \
(dict != NULL ? (dict->walk(dict->dawg, ROOTNODE, apply, context)) : -1)
*/
int dawg_lookup(NODE *base, INDEX state, char *key, /* Out */ INDEX values);
int pack_lookup(NODE *base, INDEX state, char *key, /* Out */ INDEX values);
int bsd_lookup(NODE *base, INDEX state, char *key, /* Out */ INDEX values);
/* etc... */
/* Get list of possible next characters from this point in the tree,
put it into user-supplied array branch[256]. Filled with
NULL if no branch, or an INDEX if it points to more of the tree.
Words which may terminate on a particular letter have terminate[let] != 0
The user-supplied terminate array is deliberately a long array to allow
for expansion; it may be used one day to return arbitrary codes such as
'Noun' etc...
*/
void dawg_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void pack_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void bsd_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
/* No internal proc yet... add it in the morning. */
/* In fact there will be even simpler procedures which can be used by
people who prefer to include the whole source of this package rather
than just the library interface parts. In applications where utmost
speed is a neccessity (eg Scrabble(tm)) you could use these, or write
your own. But most of us can use this clean interface with unnoticable
overhead. */
SHAR_EOF
cat - << \SHAR_EOF > grope.h
#ifndef grope_h
#define grope_h 1
#ifdef TESTING_DEFS
/*###################################################################*/
/*
* This is an exercise to assist in writing portable programs.
*
* To use as a header file, eg "grope.h", just leave this file
* as it is. To use it to define new entries, rename it as
* "grope.c" and compile it as "cc -DTESTING_DEFS grope".
*
* To customise this test program for your system, I have set it up
* so that all features are enabled. If you find that one feature
* causes a compile-time error, test a suitable set of '#ifdef's for
* your machine and #undef the values below which are not appropriate.
*
* Please do your best to see that your tests are unique, and that
* you do not undo any tests already implemented.
*
* One last request; PLEASE PLEASE PLEASE send me any updates you
* may make. If you cannot get through to me on email, send me a
* paper copy (or even better, a floppy copy :-)) to Grahan Toal,
* c/o 6 Pypers wynd, Prestonpans, East Lothian, Scotland EH32 9AH.
*
* Graham Toal <gtoal@uk.ac.ed>
* [PS: I can read DOS and RISC OS floppies only]
*/
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
#define KNOWS_LINE 1
#define KNOWS_REGISTER 1
#endif /* TESTING_DEFS */
/*###################################################################*/
#define SYS_ANY 1
#define COMPILER_ANY 1
/* Don't yet know what machine it is. Add a test once identified. */
/*--------------------------*/
#ifdef MSDOS
#define SYS_MSDOS 1
#endif
/*--------------------------*/
#ifdef __STDC__
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
/* If the compiler defines STDC it should have these. We can undef
them later if the compiler was telling lies! */
#endif
/*--------------------------*/
#ifdef MPU8086
#define SYS_MSDOS 1
/* Aztec does not #define MSDOS.
We assume it is Aztec on MSDOS from the MPU8086.
*/
#ifdef __STDC__
#define COMPILER_AZTEC 1
/* Aztec is known to also define __STDC__ (illegally). Therefore if
__STDC__ is not defined, it is probably not Aztec */
#endif
#endif
#ifdef SYS_MSDOS
/*--------------------------*/
#ifdef __STDC__
/*----------------------*/
#define COMPILER_MICROSOFT 1
/* I assume that the combination of #define MSDOS and #define __STDC__
without any other collaboration means MICROSOFT. (Who, incidentally,
are being naughty by declaring __STDC__) */
#define KNOWS_LINE 1
#else
/*----------------------*/
#ifdef __ZTC__
/*------------------*/
#define COMPILER_ZORTECH 1
#undef STDLIBS
/* Another system without locale.h */
#define PROTOTYPES 1
#else
/*------------------*/
/* A non-std msdos compiler */
#undef STDLIBS
#undef PROTOTYPES
/*------------------*/
#endif
/*----------------------*/
#endif
/*--------------------------*/
#endif
#ifdef TURBOC
/* Although Turbo C correctly does not define __STDC__, it has SOME
standard libraries (but not all - locale.h?) and accepts prototypes. */
#undef STDLIBS
#define PROTOTYPES 1
#define SYS_MSDOS 1
#define COMPILER_TURBOC 1
/* TURBOC is defined, but has no value. This allows it to be tested
with #if as well as #ifdef. */
#endif
/*--------------------------*/
#ifdef COMPILER_MICROSOFT
#undef STDLIBS
/* Again, like Turbo C, does not know locale.h */
#define PROTOTYPES 1
#endif
/*--------------------------*/
#ifdef SYS_MSDOS
#define MEMMODELS 1
/* We assume ALL MSDOS machines use memory models */
#endif
/*--------------------------*/
#ifdef UX63
#undef STDLIBS
#undef PROTOTYPES
#define SYS_UX63 1
#define COMPILER_PCC 1
/*#define LIB_STRINGS 1 - apparently not... */
#endif
/*--------------------------*/
#ifdef sun
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SUN 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef THINK_C
#define SYS_MAC 1
#define COMPILER_THINKC 1
#define KNOWS_VOID 1
#endif
/*--------------------------*/
#ifdef sparc
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SPARC 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef ARM
#define SYS_RISCOS 1
/* I fear Acorn may define 'ARM' on both unix and risc os versions.
Should find out whether they define others as well, to differentiate. */
#endif
#ifdef __ARM__
#define SYS_RISCOS 1
/* I fear Acorn may define 'ARM' on both unix and risc os versions.
Should find out whether they define others as well, to differentiate. */
#endif
/*--------------------------*/
#ifdef SYS_RISCOS
#define COMPILER_NORCROFT 1
#define KNOWS_REGISTER 1
#define KNOWS_LINE 1
#endif
/*--------------------------*/
#ifdef vms
#define SYS_VMS 1
#endif
/*--------------------------*/
/*--------------------------*/
#ifdef SYS_UX63
#undef SYS_ANY
#endif
#ifdef SYS_ARM
#undef SYS_ANY
#endif
#ifdef SYS_MSDOS
#undef SYS_ANY
#endif
#ifdef SYS_SUN
#undef SYS_ANY
#endif
#ifdef SYS_SPARC
#undef SYS_ANY
#endif
#ifdef SYS_RISCOS
#undef SYS_ANY
#endif
#ifdef SYS_MAC
#undef SYS_ANY
#endif
#ifdef SYS_VMS
#undef SYS_ANY
#endif
/*--------------------------*/
#ifdef COMPILER_PCC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_MICROSOFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_TURBOC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_ZORTECH
#undef COMPILER_ANY
#endif
#ifdef COMPILER_AZTEC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_NORCROFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_THINKC
#undef COMPILER_ANY
#endif
/*--------------------------*/
/* ##################################################################### */
#ifdef TESTING_DEFS
#include <stdio.h>
/* ======================================================================= */
#ifdef STDLIBS
/* If any of these fail, make sure STDLIBS is not
defined for your machine. */
#include <stdlib.h> /* STDLIBS should not be defined. */
#include <time.h> /* STDLIBS should not be defined. */
#include <locale.h> /* STDLIBS should not be defined. */
#endif
/* ======================================================================= */
#ifdef KNOWS_VOID
void test() { /* KNOWS_VOID should not be defined */
/* Make sure your ifdef's don't #define KNOWS_VOID if this fails */
}
#endif
/* ======================================================================= */
#ifdef KNOWS_REGISTER
int regtest() {
register int i = 0; /* KNOWS_REGISTER should not be defined */
/* Make sure your ifdef's don't #define KNOWS_REGISTER if this fails */
return(i);
}
#endif
/* ======================================================================= */
#ifdef PROTOTYPES
int main(int argc, char **argv) /* PROTOTYPES should not be defined */
/* If this fails, make sure PROTOTYPES is not defined for your machine. */
#else
int main(argc,argv)
int argc;
char **argv;
#endif
{
/*-------------------------------------------------------------------------*/
printf("We should know just what the machine is, or 'SYS_ANY':\n");
#ifdef SYS_UX63
printf("#define SYS_UX63 %d\n", SYS_UX63);
#endif
#ifdef SYS_ARM
printf("#define SYS_ARM %d\n", SYS_ARM);
#endif
#ifdef SYS_MSDOS
printf("#define SYS_MSDOS %d\n", SYS_MSDOS);
#endif
#ifdef SYS_SUN
printf("#define SYS_SUN %d\n", SYS_SUN);
#endif
#ifdef SYS_SPARC
printf("#define SYS_SPARC %d\n", SYS_SPARC);
#endif
#ifdef SYS_RISCOS
printf("#define SYS_RISCOS %d\n", SYS_RISCOS);
#endif
#ifdef SYS_MAC
printf("#define SYS_MAC %d\n", SYS_MAC);
#endif
#ifdef SYS_VMS
printf("#define SYS_VMS %d\n", SYS_VMS);
#endif
#ifdef SYS_ANY
printf("#define SYS_ANY %d\n", SYS_ANY);
#endif
/*-------------------------------------------------------------------------*/
printf("Either the machine has different memory models or not:\n");
#ifdef MEMMODELS
printf("#define MEMMODELS %d\n", MEMMODELS);
#else
printf("#undef MEMMODELS\n");
#endif
/*-------------------------------------------------------------------------*/
printf("One compiler name should be given, or 'COMPILER_ANY':\n");
#ifdef COMPILER_PCC
printf("#define COMPILER_PCC %d\n", COMPILER_PCC);
#endif
#ifdef COMPILER_MICROSOFT
printf("#define COMPILER_MICROSOFT %d\n", COMPILER_MICROSOFT);
#endif
#ifdef COMPILER_TURBOC
printf("#define COMPILER_TURBOC %d\n", COMPILER_TURBOC);
#endif
#ifdef COMPILER_ZORTECH
printf("#define COMPILER_ZORTECH %d\n", COMPILER_ZORTECH);
#endif
#ifdef COMPILER_AZTEC
printf("#define COMPILER_AZTEC %d\n", COMPILER_AZTEC);
/* Can exist on other machines as well as MSDOS */
#endif
#ifdef COMPILER_LATTICE
/* Currently coming through as 'COMPILER_ANY' - haven't found one to test */
/* Can exist on other machines as well as MSDOS. Meanwhile pass it in */
/* on the command_line? */
printf("#define COMPILER_LATTICE %d\n", COMPILER_LATTICE);
#endif
#ifdef COMPILER_GCC
/* Ditto. Test in appropriate places for each machine. */
printf("#define COMPILER_GCC %d\n", COMPILER_GCC);
#endif
#ifdef COMPILER_NORCROFT
printf("#define COMPILER_NORCROFT %d\n", COMPILER_NORCROFT);
#endif
#ifdef COMPILER_THINKC
printf("#define COMPILER_THINKC %d\n", COMPILER_THINKC);
#endif
#ifdef COMPILER_ANY
printf("#define COMPILER_ANY %d\n", COMPILER_ANY);
#endif
/*-------------------------------------------------------------------------*/
printf("Either the compiler accepts ANSI-like libraries or not:\n");
#ifdef STDLIBS
printf("#define STDLIBS %d\n",STDLIBS);
#else
printf("#undef STDLIBS\n");
#endif
/*-------------------------------------------------------------------------*/
printf("Either the machine accepts ANSI prototypes or not:\n");
#ifdef PROTOTYPES
printf("#define PROTOTYPES %d\n", PROTOTYPES);
#else
printf("#undef PROTOTYPES\n");
#endif
/*-------------------------------------------------------------------------*/
printf("Either the machine needs nonstandard #include <strings.h> or not:\n");
#ifdef LIB_STRINGS
printf("#define LIB_STRINGS %d\n", LIB_STRINGS);
#else
printf("#undef LIB_STRINGS\n");
#endif
/*-------------------------------------------------------------------------*/
printf("Either the machine accepts the 'void' keyword or not:\n");
#ifdef KNOWS_VOID
printf("#define KNOWS_VOID %d\n", KNOWS_VOID);
#else
printf("#undef KNOWS_VOID\n");
#endif
printf("Either the machine accepts the 'register' keyword or not:\n");
/*-------------------------------------------------------------------------*/
printf("Either the compiler accepts the '__LINE__' directive or not:\n");
#ifdef KNOWS_LINE
printf("#define KNOWS_LINE %d\n", KNOWS_LINE);
#else
printf("#undef KNOWS_LINE\n");
#endif
/*-------------------------------------------------------------------------*/
printf("Either the machine accepts the 'register' keyword or not:\n");
#ifdef KNOWS_REGISTER
printf("#define KNOWS_REGISTER %d\n", KNOWS_REGISTER);
#else
printf("#undef KNOWS_REGISTER\n");
#endif
/* Note - this is intended to be used only if your compiler accepts
'register' *AND* compiles code with it correctly! Some compilers
show up bugs when register optimisations are attempted! */
/*-------------------------------------------------------------------------*/
if (argv[argc] != 0) printf("Warning! argv[argc] != NULL !!! (Non ansii)\n");
}
#endif /* TESTING_DEFS */
#endif /* Already seen? */
SHAR_EOFgtoal@tharr.UUCP (Graham Toal) (06/28/90)
#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar: Shell Archiver
# Run the following text with /bin/sh to create:
# check.c # correct.c # dyntrie.c # init.c # locate.c # print.c # similcmp.c # soundex.c # utils.c
cat - << \SHAR_EOF > check.c
/*
File: check.c
Author: Graham Toal
Purpose: Check a word using DAWG or TRIE.
Functions exported: dawg_check, pack_check
Internal function: dawg_ck, pack_ck
Description:
call as: dawg_check(dawg, "word")
pack_check(trie, "word")
Simple spelling-check. Unfortunately the two interfaces for
DAWGs and TRIEs are different, and even DAWG stuff differs from
program to program. The problem is primarily in how to spot
the end of a search; at the moment it is done when a particular
pointer points to the root. Unfortunately we enter the data
structure either at root+1 or at some arbitrary point, so this
scheme means passing in two pointers. The next idea is to check
that the *data* pointed to by the terminating pointer is 0; this
would be OK but I was hoping to leave the initial word in the
dawg free for expansion... (dawg contains [0, Node1..NodeN])
*best* idea would be to terminate on *INDEX* being 0. This is
probably also simplest & I'll code it up properly when I'm awake...
*/
static int
#ifdef PROTOTYPES
dawg_ck(NODE PCCRAP *dict, NODE PCCRAP *edge, char *word)
#else
dawg_ck(dict, edge, word)
NODE PCCRAP *dict;
NODE PCCRAP *edge;
char *word;
#endif
{
if (edge == dict) return(0!=0);
for (;;) {
if (*word == (((*edge >> V_LETTER) & M_LETTER))) {
if (*++word == '\0') {
return((*edge & M_END_OF_WORD) != 0);
} else {
if ((edge = dict+(*edge & M_NODE_POINTER)) == dict) break;
continue;
}
}
if (((*edge++) & M_END_OF_NODE) != 0) break;
}
return(0!=0);
}
int
#ifdef PROTOTYPES
dawg_check(NODE PCCRAP *dict, char *word)
#else
dawg_check(dict, word)
NODE PCCRAP *dict;
char *word;
#endif
{
return(dawg_ck(dict, dict+ROOT_NODE, word));
}
static int
#ifdef PROTOTYPES
pack_ck(NODE PCCRAP *ptrie, INDEX p, char *s)
#else
pack_ck(ptrie, p, s)
NODE PCCRAP *ptrie;
INDEX p;
char *s;
#endif
{
/* Note: node is passed in as a parameter so that it is in a register -
otherwise it would take an extra instruction every time it was accessed.
We trust the compiler to allocate register variables most efficiently --
when people tinker, it might improve one machine but ruin another...
*/
#define next_letter(s) \
p = ptrie[p + *s]; \
if (((p >> V_LETTER) & M_LETTER) != *s++) return(0!=0); \
if (*s == '\0') return((p >> V_END_OF_NODE) != 0); \
if ((p &= M_NODE_POINTER) == 0) return(0!=0)
/* Depending on whether your machine has a cache or not, and whether
it optimises pipeline fetches, you should decrease or increase the
number of macro instantiations in the loop below appropriately */
for (;;) {
next_letter(s); next_letter(s); next_letter(s); next_letter(s);
}
}
int
#ifdef PROTOTYPES
pack_check(NODE PCCRAP *ptrie, char *s)
#else
pack_check(ptrie, s)
NODE PCCRAP *ptrie;
char *s;
#endif
{
return(pack_ck(ptrie, 1L, s));
}
SHAR_EOF
cat - << \SHAR_EOF > correct.c
/*
File: correct.c
Author: Graham Toal
Purpose: Correct a word using Soudex and similcmp on DAWG.
Functions exported: dawg_correct
Internal functions: reccmp, walk_rest_dawg, walk_dawg
Description:
Reduce word to rough phonetic representation.
Reduce all words in dawg to same form. Compare. This
results in a large list of plausible (HA!) candidates.
Then apply 'similcmp' which returns a distance metric representing
closeness of two words. Sort by this metric, and return N best
matches. If one stands out above others, return it alone.
I'm working on a *much better* algorithm, but this will do
for prototyping the interface.
Final version of this will have callbacks to handle corrected
words as they are produced.
*/
#include "soundex.c"
#include "similcmp.c"
typedef struct {
char *possible;
int score;
} rec;
/* Numbers of parameters exploded during process of removing global state :( */
static void
#ifdef PROTOTYPES
walk_rest_dawg(NODE PCCRAP *dawg, INDEX node, int len,
char *word, char *original, char *target,
rec *ans, int *nextfreep)
#else
walk_rest_dawg(dawg, node, len, word, original, target, ans, nextfreep)
NODE PCCRAP *dawg;
INDEX node;
int len;
char *word;
char *original;
char *target;
rec *ans;
int *nextfreep;
#endif
#define nextfree (*nextfreep)
{
NODE PCCRAP *edge;
long c;
for (edge = (NODE PCCRAP *)&dawg[node]; ;edge++) {
c = *edge; /* Avoid MS C compiler bug :-( */
c = c >> V_LETTER;
c = c & M_LETTER;
word[len] = (char)c;
if ((*edge & M_END_OF_WORD) != 0) {
word[len+1] = 0;
if (strcmp(soundex(word), target) == 0) {
ans[nextfree].possible = (char *)Malloc(len+1, 1);
if (ans[nextfree].possible == NULL) {
fprintf(stderr,"I\'ve no idea - it could be anything...\n");
exit(0);
}
strcpy(ans[nextfree].possible, word);
ans[nextfree].score = simil(word, original);
if (nextfree < 511) nextfree++;
}
}
c = *edge & M_NODE_POINTER;
if (c != 0) {
walk_rest_dawg(dawg, c, len+1, word, original, target, ans, nextfreep);
}
if ((*edge & M_END_OF_NODE) != 0) break; /* End of node */
}
#undef nextfree
}
static void
#ifdef PROTOTYPES
walk_dawg(
NODE PCCRAP *dawg, INDEX node, int len,
char *original, char *target,
char targets, char *word,
rec *ans, int *nextfreep)
#else
walk_dawg(dawg, node, len, original, target, targets, word, ans, nextfreep)
NODE PCCRAP *dawg;
INDEX node;
int len;
char *original;
char *target;
char targets;
char *word;
rec *ans;
int *nextfreep;
#endif
#define nextfree (*nextfreep)
{
NODE PCCRAP *edge;
long c;
edge = (NODE PCCRAP *)&dawg[node];
/* Only search trie starting with initial letter */
for (;;) {
c = *edge; /* Avoid MS C compiler bug :-( */
c = c >> V_LETTER;
c = c & M_LETTER;
word[len] = (char)c;
if (c == targets) {
c = *edge & M_NODE_POINTER;
walk_rest_dawg(dawg, c, len+1, word, original, target, ans, nextfreep);
return;
}
edge++;
}
#undef nextfree
}
static int
#ifdef PROTOTYPES
reccmp(const void *a, const void *b)
#else
reccmp(a, b)
void *a;
void *b;
#endif
{
return(((rec *)b)->score - ((rec *)a)->score);
}
int
#ifdef PROTOYPES
dawg_correct(NODE PCCRAP *dawg, char *original)
#else
dawg_correct(dawg, original)
NODE PCCRAP *dawg;
char *original;
#endif
{
char word[81];
char target[256];
static rec ans[256];
/* static because brain-dead MSC can't hack a big stack :-( */
int targets;
int nextfree = 0;
int i, prev=0;
targets = original[0];
strcpy(target, soundex(original));
walk_dawg(
dawg, (INDEX)ROOT_NODE, 0, original, target, targets, word, ans, &nextfree);
if (nextfree == 0) return(FALSE);
qsort(ans, (size_t)nextfree, sizeof(rec), /*&*/reccmp);
for (i = 0; i < nextfree; i++) {
if (((prev*100) / (ans[i].score)) > 120) break;
fprintf(stdout, "%s (%d)\n", ans[i].possible, ans[i].score);
if (ans[i].score >= 100) break;
if (i == 5) break;
prev = ans[i].score;
}
return(TRUE);
}
SHAR_EOF
cat - << \SHAR_EOF > dyntrie.c
/* The #define is to repair mangling by BITNET mailers */
#define NOT ~
/* This should be Tilde (twiddle) -- C unary Not */
/*
File: dyntrie.c
Author: Graham Toal
Purpose: Check a word using DAWG or TRIE.
Functions exported: trie_new, trie_add, trie_dump
Internal functions: insert_simple recurse_add
Description:
Builds DAWG-compatible trie in store, incrementally; words do not
have to be presented in order. (Note it is NOT a 'packed-trie';
in our scheme it is closer to a dawg - but without the tail-compression
precomputed dawgs have).
Any tries built using this should be used by check_dawg, print_dawg etc.
Has rather nice property of only claiming enough memory for the
job; dynamically moves the data structure when it fills!
*/
/* To avoid global state, I'm putting these useful bits of info
in the two words before the dawg itself; I *HOPE* that all systems
allow negative indexing off arrays; I'm not at all sure they do
though. However since I moved the base of the dict up by adding
2 to it, I *should* be able to get the old address by by subtracting
2 from it, so I'm using the first pair of macros below, not the
more intuitive second pair.
I should really do this for ordinary dawgs & tries too. Indeed
I should *really* implement the dictlib.h interface, but thats a
month or two off. Also I'ld like some feedback first.
*/
#define EXTRAS 2
#define MAX_ENTRIES(dawg) (*(dawg-2)) /* Safe way of aliasing */
#define NEXT_FREE(dawg) (*(dawg-1))
/* #define MAX_ENTRIES(dawg) dawg[-2] */ /* 'Obvious' alias */
/* #define NEXT_FREE(dawg) dawg[-1] */ /* may be buggy */
#define EMPTY 0
#define INIT_MAX_ENTRIES 512
/* Slop is so that we don't need to be continually checking nextfree
as being close to max_entries */
#define SLOP 12
/* A debugging macro which rangechecks arrays -- happens when
utils.c is included with RCHECK defined. Otherwise no overhead. */
#define DICT(idx) dict[RANGECHECK(idx, MAX_ENTRIES(dict))]
/*
This quick hack job has special-case code in each function
for the root node-set. Its rather tacky; I could clean it
up and make the root node like all other nodes, but why
bother; this is only test code anyway...
*/
static INDEX
#ifdef PROTOTYPES
insert_simple(NODE PCCRAP **dictp, char *word)
#else
insert_simple(dictp, word)
NODE PCCRAP **dictp;
char *word;
#endif
#define dict (*dictp)
{
NODE bugfix;
INDEX i; NODE n; int c;
if (NEXT_FREE(dict) >= MAX_ENTRIES(dict)-SLOP) {
NODE PCCRAP *moved;
moved = MALLOC((SIZE_T)MAX_ENTRIES(dict)*2+EXTRAS, sizeof(NODE));
if (moved != NULL) {
moved += EXTRAS;
MAX_ENTRIES(moved) = MAX_ENTRIES(dict);
NEXT_FREE(moved) = NEXT_FREE(dict);
/* Should use realloc but appears to be buggy :-( */
for (i = MAX_ENTRIES(dict); i < MAX_ENTRIES(dict)*2; i++) {
moved[i] = EMPTY;
}
for (i = 0; i < MAX_ENTRIES(dict); i++) moved[i] = DICT(i);
dict -= EXTRAS;
FREE(dict);
dict = moved; MAX_ENTRIES(dict) *= 2;
} else {
fprintf(stderr, "Can\'t add any more words again\n");
}
}
c = (*word++)&255;
if (c == '\0') return(TERMINAL_NODE);
i = NEXT_FREE(dict)++;
bugfix = insert_simple(&dict, word);
n = ((NODE)c << (NODE)V_LETTER) | M_END_OF_NODE | bugfix;
DICT(i) = n; if (*word == '\0') DICT(i) |= M_END_OF_WORD;
return(i);
#undef dict
}
static void
#ifdef PROTOTYPES
recurse_add(NODE PCCRAP **dictp, INDEX base, char *word)
#else
recurse_add(dictp, base, word)
NODE PCCRAP **dictp;
INDEX base;
char *word;
#endif
#define dict (*dictp)
{
NODE bugfix;
INDEX i = base, newbase;
NODE unpicked[MAX_CHARS], n;
int c, ch;
int gap, nextslot = 0, j;
/* First see if first char is already in trie */
ch = (*word++)&255;
for (;;) {
c = (int)(((dict[i] >> V_LETTER) & M_LETTER)&255);
if (ch == c) {
newbase = dict[i]&M_NODE_POINTER;
if (newbase == 0) {
/* have abc (this is c), adding (abc)defg */
dict[i] &= (NOT M_NODE_POINTER);
bugfix = insert_simple(&dict, word);
dict[i] |= bugfix;
} else {
if (*word == '\0') {
dict[i] |= M_END_OF_WORD;
} else {
recurse_add(dictp, newbase, word);
}
}
return;
}
if ((dict[i]&M_END_OF_NODE) != 0) break;
i++;
}
/* unpick base */
for (nextslot = 0; nextslot < MAX_CHARS; nextslot++) {
unpicked[nextslot] = EMPTY;
}
nextslot = 0;
i = base; j = 0;
for (;;) {
if ((j == 0) && (((dict[i] >> V_LETTER) & M_LETTER) > ch)) {
j = 1; newbase = nextslot++;
}
n = dict[i]; dict[i] = EMPTY; i++;
unpicked[nextslot++] = n & (NOT M_END_OF_NODE);
if ((n & M_END_OF_NODE) != 0) break;
}
if (j == 0) newbase = nextslot++; /* Was it last alphabetically? */
/* add this char to the node */
if (*word == '\0') {
unpicked[newbase] =
((NODE)ch << (NODE)V_LETTER) | TERMINAL_NODE | M_END_OF_WORD;
} else {
/* and insert the rest of the
string with insert_simple */
bugfix = insert_simple(&dict, word);
unpicked[newbase] =
((NODE)ch << (NODE)V_LETTER) | bugfix;
/* The insert_simple call has to come after above loop, because
of freeing slots... */
}
unpicked[nextslot-1] |= M_END_OF_NODE;
/* scan for suitably-sized gap */
/* MAX_CHARS+1 should be first 'real' node base (0, 1..MAX_CHARS reserved) */
/*
The particular application this is wanted for doesn't have a large
number of words to be added dynamically, so the BFI code below which
finds free holes in the trie space works fine. However if some bright
spark cares to keep a freelist *in the holes* then this program
effectively implements a linear-time sorting algorithm.
(I know it's not *really*, but think about it before jumping
down my throat; the N^2 case is very statistically unlikely)
*/
for (i = (INDEX)MAX_CHARS+1; i < MAX_ENTRIES(dict)-nextslot-SLOP; i++) {
/*
Even without the freelist, the sneaky hack in pack.c for
noting earliest bunch of free holes might well make a
significant difference... (It was a lowest_search_start
variable and used a bit of hysteresis)
*/
gap = TRUE;
for (j = 0; j < nextslot; j++) {
if (dict[i+j] != EMPTY) {
gap = FALSE;
break;
}
}
if (gap) break;
}
if (!gap) {
NODE PCCRAP *moved;
moved = MALLOC((SIZE_T)MAX_ENTRIES(dict)*2+EXTRAS, sizeof(NODE));
if (moved != NULL) {
moved += EXTRAS;
MAX_ENTRIES(moved) = MAX_ENTRIES(dict);
NEXT_FREE(moved) = NEXT_FREE(dict);
/* Should use realloc but appears to be buggy :-( */
for (i = MAX_ENTRIES(dict);
i < MAX_ENTRIES(dict)*2; i++) moved[i] = EMPTY;
for (i = 0; i < MAX_ENTRIES(dict); i++) moved[i] = DICT(i);
dict -= EXTRAS;
FREE(dict);
dict = moved; /* If your compiler has aliasing bugs they may hit here. */
MAX_ENTRIES(dict) *= 2;
/* scan for suitably-sized gap */
for (i = ((MAX_ENTRIES(dict)/2)-(MAX_CHARS+1));
i < MAX_ENTRIES(dict)-nextslot; i++) {
gap = TRUE;
for (j = 0; j < nextslot; j++) {
if (dict[i+j] != EMPTY) {
gap = FALSE;
break;
}
}
if (gap) break;
}
}
if (!gap) {
fprintf(stderr, "Can\'t add any more words\n");
return;
}
}
newbase = i;
/* reinsert */
for (j = 0; j < nextslot; j++) {
dict[newbase+j] = unpicked[j];
}
if (newbase+nextslot-1 >= NEXT_FREE(dict)) NEXT_FREE(dict) = newbase+nextslot;
/* update pointer to the base of this node */
for (i = 0; i < MAX_ENTRIES(dict); i++) {
if ((dict[i] & M_NODE_POINTER) == base) {
dict[i] &= (NOT M_NODE_POINTER);
dict[i] |= newbase;
break; /* (should only be one since trie, not dawg) */
}
}
#undef dict
}
void
#ifdef PROTOTYPES
trie_add(NODE PCCRAP **dictp, char *word)
#else
trie_add(dictp, word)
NODE PCCRAP **dictp;
char *word;
#endif
#define dict (*dictp)
{
INDEX next;
INDEX bugfix;
int c;
/* Root node is pre-reserved MAX_CHARS entries */
c = (*word++)&255; /* bugfix - remember chars can be signed :-( */
if (DICT((INDEX)ROOT_NODE+c) == EMPTY) {
/* I'm allowing the root node to be sparse for speed; it *should*
be dense like any other node. May change this later */
if (*word == '\0') {
DICT((INDEX)ROOT_NODE+c) =
((NODE)c << (NODE)V_LETTER) | M_END_OF_WORD;
} else {
bugfix = insert_simple(&dict, word);
DICT((INDEX)ROOT_NODE+c) =
((NODE)c << (NODE)V_LETTER) | bugfix;
}
} else {
if (*word == '\0') {
DICT((INDEX)ROOT_NODE+c) |= M_END_OF_WORD;
} else {
next = DICT((INDEX)ROOT_NODE+c) & M_NODE_POINTER;
if (next == 0) {
/* have 'a', adding 'abcd' */
/* (Should really check the letter for safety...) */
bugfix = insert_simple(&dict, word);
DICT((INDEX)ROOT_NODE+c) = ((NODE)c << (NODE)V_LETTER)
| bugfix | M_END_OF_WORD;
} else {
recurse_add(dictp, next, word);
}
}
}
#undef dict
}
int
#ifdef PROTOTYPES
trie_new(NODE PCCRAP **dawgp)
#else
trie_new(dawgp)
NODE PCCRAP **dawgp;
#endif
#define dawg (*dawgp)
{
INDEX i;
dawg = MALLOC(INIT_MAX_ENTRIES+EXTRAS, sizeof(NODE));
dawg += EXTRAS;
MAX_ENTRIES(dawg) = INIT_MAX_ENTRIES; NEXT_FREE(dawg) = MAX_CHARS+1;
dawg[0] = 0;
for (i = 1; i <= MAX_CHARS; i++) dawg[i] = 0;
dawg[MAX_CHARS] |= M_END_OF_NODE;
/* 0 base, MAX_CHARS chars */
return(TRUE);
#undef dawg
}
int
#ifdef PROTOTYPES
trie_dump(NODE PCCRAP *dawg, char *filename)
#else
trie_dump(dawg, filename)
NODE PCCRAP *dawg;
char *filename;
#endif
{
INDEX cnt, max;
FILE *dumper;
max = NEXT_FREE(dawg)*sizeof(NODE);
/* I *knew* the next_free variable would come in handy :-) */
dumper = fopen(filename, WRITE_BIN);
if (dumper == NULL) {
fprintf(stderr, "Failed to dump trie on file \'%s\'\n", filename);
return(FALSE);
}
putword(max, dumper);
if ((cnt = putwords(dawg, max, dumper)) != max) {
fprintf(stderr, "Failed to dump: %ld bytes written instead of %ld\n",
cnt, max);
return(FALSE);
}
fclose(dumper);
return(TRUE);
}
#undef MAX_ENTRIES
#undef NEXT_FREE
#undef DICT
#undef SLOP
SHAR_EOF
cat - << \SHAR_EOF > init.c
/*
File: init.c
Authors: Richard Hooker, Graham Toal
Purpose: Loading of dictionary for spelling checker.
Functions exported: dawg_init, pack_init
Internal functions: spell_init
Description:
The address of a dictionary (either a PACKed dict or a DAWG) is set up by
this procedure. This gives us the option of connecting the dict in read-only
(or copy-on-write) virtual memory. On non-VM machines, we simply allocate a
large buffer into which the relevant part of the dictionary is loaded.
The magic number is used to check the dict type, *and* the machine byte-sex.
If the sex is reversed, even a VM system has to load the data into writable
memory (so that it can reverse it).
*/
/*######################### INTERNAL FUNCTIONS #########################*/
static int
#ifdef PROTOTYPES
spell_init(char *filename, NODE PCCRAP **dictp,
char *DEFAULT_DICT, long magic_number, INDEX *nedges)
#else
spell_init(filename, dictp, DEFAULT_DICT, magic_number, nedges)
char *filename;
NODE PCCRAP **dictp;
char *DEFAULT_DICT;
long magic_number;
INDEX *nedges;
#endif
#define dict (*dictp)
{
FILE *fp; INDEX count;
/* init_dict("") gets default */
if (*filename == '\0') filename = DEFAULT_DICT;
/* Open the file and find out the size of the dict -- which
is stored in the first word. Later I'll change the dict format
to have a header, and the header will have to be skipped by
this module. */
if ((fp = fopen(filename, "rb")) == NULL) {
fprintf (stderr, "Can\'t open file \"%s\"\n", filename);
return(FALSE);
}
*nedges = getword(fp);
#ifdef DEBUG
fprintf(stderr, "dawg contains %8lx edges\n", *nedges);
#endif
/* Allocate precisely enough memory for all edges + 0 at root node. */
if ((dict = MALLOC((SIZE_T)((*nedges)+1), sizeof(NODE PCCRAP *))) == 0) {
fprintf (stderr, "Can\'t allocate space for dictionary\n");
return(FALSE);
}
dict[0] = 0; /* Root node set to 0; terminal nodes point to 0. */
/* Load in the dictionary. Routine 'getwords' should be efficient */
count = getwords(&dict[1], (long)(4*(*nedges)), fp);
if (count != 4*(*nedges)) {
fprintf(stderr,
"Failed to read dictionary %s - wanted %ld bytes, got %ld\n",
filename, 4*(*nedges), count);
return(FALSE);
}
fclose(fp);
return(TRUE);
#undef dict
}
/*####################### EXPORTED FUNCTIONS #########################*/
int
#ifdef PROTOTYPES
dawg_init(char *filename, NODE PCCRAP **dawgp, INDEX *nedges)
#else
dawg_init(filename, dawgp, nedges)
char *filename;
NODE PCCRAP **dawgp;
INDEX *nedges;
#endif
{
return(spell_init(filename, dawgp, DEFAULT_DAWG, DAWG_MAGIC, nedges));
}
int
#ifdef PROTOTYPES
pack_init(char *filename, NODE PCCRAP **packp, INDEX *nedges)
#else
pack_init(filename, packp, nedges)
char *filename;
NODE PCCRAP **packp;
INDEX *nedges;
#endif
{
return(spell_init(filename, packp, DEFAULT_PACK, PACK_MAGIC, nedges));
}
SHAR_EOF
cat - << \SHAR_EOF > locate.c
/*
File: locate.c
Author: Graham Toal
Purpose: Find all words beginning with particular prefix.
Functions exported: locate_prefix
Description:
Searches as in spell-check for prefix in dict, but doesn't
fail if word doesn't terminate at that point. It returns
an index into the dict which can be used with print_dawg_prefix
to display all the words found. However it is more useful
than that; text-analysis programs can check that a word matches
"root*", for instance, when doing stylistic analysis etc.
*/
INDEX
#ifdef PROTOTYPES
dawg_locate_prefix(NODE PCCRAP *dawg, char *word, INDEX edge)
#else
dawg_locate_prefix(dawg, word, edge)
NODE PCCRAP *dawg;
char *word;
INDEX edge;
#endif
{
for (;;) {
if (*word == (((dawg[edge] >> V_LETTER) & M_LETTER))) {
if (*++word == '\0') {
return(dawg[edge]&M_NODE_POINTER);
} else {
if ((edge = (dawg[edge] & M_NODE_POINTER)) == ROOT_NODE) break;
continue;
}
}
if (((dawg[edge++]) & M_END_OF_NODE) != 0) break;
}
/* What to do when none found? -- fail-safe, or some error code...? */
return(ROOT_NODE);
}
SHAR_EOF
cat - << \SHAR_EOF > print.c
/*
File: print.c
Author: Graham Toal
Purpose: Print a packed trie to stderr.
Functions exported: dawg_print, pack_print, dawg_print_prefix
Internal functions: pack_pr dawg_pr dawg_pr_prefix
Description:
Pre-order traverse of packed TRIE or DAWG. Will be modified
soon to take output file as parameter. Then sometime after that
to callback with each string at it is generated. Meanwhile,
people requiring custom versions of dawg-walking stuff might
as well just copy this code and edit it to do what they want.
The special version print_dawg_prefix takes a NODE from somewhere
in a dict as a parameter, and prints out only those words which
contain that node. You have to locate the node seperately with
'locate_prefix', and pass the prefix string in so it can be printed.
*/
static void
#ifdef PROTOTYPES
dawg_pr(NODE PCCRAP *dawg, INDEX node, int len)
#else
dawg_pr(dawg, node, len)
NODE PCCRAP *dawg;
INDEX node;
int len;
#endif
{
static char word[MAX_WORD_LEN];
NODE PCCRAP *edge;
for (edge = (NODE PCCRAP *)&dawg[node]; ; edge++) {
long c;
c = *edge; /* Don't rewrite this - its avoiding a MSC bug */
c = c >> V_LETTER;
c = c & M_LETTER;
word[len] = (char)c;
if ((*edge & M_END_OF_WORD) != 0) {
word[len+1] = '\0';
fprintf(stdout, "%s\n", word);
}
c = *edge & M_NODE_POINTER;
if ((*edge & M_NODE_POINTER) != 0)
dawg_pr (dawg, c, len + 1);
if ((*edge & M_END_OF_NODE) != 0) break; /* End of node */
}
}
void
#ifdef PROTOTYPES
dawg_print(NODE PCCRAP *dawg, INDEX node)
#else
dawg_print(dawg, node)
NODE PCCRAP *dawg;
INDEX node;
#endif
{
dawg_pr(dawg, node, 0);
}
static void
#ifdef PROTOTYPES
pack_pr(NODE PCCRAP *ptrie, INDEX i, int pos)
#else
pack_pr(ptrie, i, pos)
NODE PCCRAP *ptrie;
INDEX i;
int pos;
#endif
{
static char s[MAX_WORD_LEN+1];
int b;
INDEX p;
for (b = BASE_CHAR; b < BASE_CHAR+MAX_CHARS; b++) {
if (b != 0) {
p = ptrie[i+b-BASE_CHAR];
if (((p>>V_LETTER)&M_LETTER) == b) {
s[pos] = b; s[pos+1] = '\0';
if ((p & M_END_OF_WORD) != 0) fprintf(stderr, "%s\n", s);
if ((p &= M_NODE_POINTER) != 0) {
pack_pr(ptrie, p, pos+1);
}
}
}
}
}
void
#ifdef PROTOTYPES
pack_print(NODE PCCRAP *ptrie, INDEX node)
#else
pack_print(ptrie, node)
NODE PCCRAP *ptrie;
INDEX node;
#endif
{
pack_pr(ptrie, node, 0);
}
static void
#ifdef PROTOTYPES
dawg_pr_prefix(NODE PCCRAP *dawg, char *prefix, INDEX node, int len)
#else
dawg_pr_prefix(dawg, prefix, node, len)
NODE PCCRAP *dawg;
char *prefix;
INDEX node;
int len;
#endif
{
NODE PCCRAP *edge;
static char word[MAX_WORD_LEN];
long c;
for (edge = (NODE PCCRAP *)&dawg[node]; ; edge++) {
/* Don't 'fix' - compiler bugaround for microsoft :-( */
c = *edge; c = c >> V_LETTER; c = c & M_LETTER;
word[len] = (char)c;
if ((*edge & M_END_OF_WORD) != 0) {
word[len+1] = 0;
fprintf(stdout, "%s%s\n", prefix, word);
}
c = *edge & M_NODE_POINTER;
if (c != 0) dawg_pr_prefix(dawg, prefix, c, len + 1);
/* End of node - check repair later - I accidentally nobbled it */
if ((*edge & M_END_OF_NODE) != 0) break;
}
}
void
#ifdef PROTOTYPES
dawg_print_prefix(NODE PCCRAP *dawg, char *prefix, INDEX node)
#else
dawg_print_prefix(dawg, prefix, node)
NODE PCCRAP *dawg;
char *prefix;
INDEX node;
#endif
{
dawg_pr_prefix(dawg, prefix, node, 0);
}
SHAR_EOF
cat - << \SHAR_EOF > similcmp.c
#ifndef similcmp_c
#define similcmp_c 1
#ifdef LIB_STRINGS
#include <strings.h>
#else
#include <string.h>
#endif
/*
File: similcmp.c
Authors: John Ratcliffe, and an anonymous coder.
Purpose: Better approximate string equality test.
Functions exported: simil
Internal functions: GCsubstr
Description:
See below. I'll replace this one eventually with my own
algorithm which does sophisticated compound-grapheme analysis
and returns a degree of phonetic similarity and a probability
that the transformations made were valid.
'ghoti==fish' (Match = 90%, Prob = 1%) ;-)
lauGH = f match 100% prob 30%
wOmen = i match 90% prob 10%
staTIon = sh match 100% prob 5%
*/
/* Ratcliff/Obershelp pattern matching
*
* Approximate word matching: takes two words and returns a
* similarity score based on co-occurrence of subpatterns.
*
* This code appeared in a letter to the editor in DDJ, 11/88.
* The original article on the pattern matching, "Pattern Matching
* by Gestalt" by John Ratcliff, appeared in the July 1988
* issue (#181) but the algorithm was presented in assembly.
*
* The 11/88 issue also contained another verison in C which was
* a more faithful translation of the original; it has the
* advantage of not being recursive.
*
* The algorithm seems to work nicely for a variety of test cases.
* The main drawback of the algorithm is the cost of the pairwise
* comparisons. It is significantly more expensive than stemming,
* soundex, and the like. Might consider using this as a second
* phase...
*/
int
#ifdef PROTOTYPES
GCsubstr(char *st1, char *end1, char *st2, char *end2)
#else
GCsubstr(st1, end1, st2, end2)
char *st1;
char *end1;
char *st2;
char *end2;
#endif
{
register char *a1, *a2;
char *b1, *s1, *b2, *s2;
short max, i;
if (end1 <= st1 || end2 <= st2) return(0);
if (end1 == st1 + 1 && end2 == st2 + 1) return(0);
max = 0; b1 = end1; b2 = end2;
for (a1 = st1; a1 < b1; a1++) {
for (a2 = st2; a2 < b2; a2++) {
if (*a1 == *a2) {
/* determine length of common substring */
for (i = 1; a1[i] && (a1[i] == a2[i]); i++) /* do nothing */;
if (i > max) {
max = i; s1 = a1; s2 = a2;
b1 = end1 - max; b2 = end2 - max;
}
}
}
}
if (!max) return(0);
max += GCsubstr(s1 + max, end1, s2 + max, end2); /* rhs */
max += GCsubstr(st1, s1, st2, s2); /* lhs */
return(max);
}
int
#ifdef PROTOTYPES
simil(char *s1, char *s2)
#else
simil(s1, s2)
char *s1;
char *s2;
#endif
{
int l1 = strlen(s1), l2 = strlen(s2);
if (strcmp(s1, s2) == 0) return(100); /* exact match end-case */
return(200 * GCsubstr(s1, s1 + l1, s2, s2 + l2) / (l1 + l2));
}
#endif /* similcmp_c */
SHAR_EOF
cat - << \SHAR_EOF > soundex.c
#ifndef soundex_c
#define soundex_c 1
#ifdef LIB_STRINGS
#include <strings.h>
#else
#include <string.h>
#endif
/* isalpha & toupper; define your own if you don't have ctype.h */
#include <ctype.h>
/*
File: soundex.c
Authors: Jonathan Leffler
Purpose: Approximate string equality test.
Functions exported: soundex
Internal functions: nsoundex
Description:
There are umpteen soundexes around. At least this one is commented...
(Actually I'd prefer one which understood ph/f and Mac/Mc as special
cases; short of a proper phonetic alg such as I'm currently developing)
See below for description:
*/
/*
** SOUNDEX CODING
**
** Rules:
** 1. Retain the first letter; ignore non-alphabetic characters.
** 2. Replace second and subsequent characters by a group code.
** Group Letters
** 1 BFPV
** 2 CGJKSXZ
** 3 DT
** 4 L
** 5 MN
** 6 R
** 3. Do not repeat digits
** 4. Truncate or ser-pad to 4-character result.
**
** Originally formatted with tabstops set at 4 spaces -- you were warned!
**
** Code by: Jonathan Leffler (john@sphinx.co.uk)
** This code is shareware -- I wrote it; you can have it for free
** if you supply it to anyone else who wants it for free.
**
** BUGS: Assumes 7-ASCII; fails on ISO Latin1
*/
static char lookup[] = {
'0', /* A */ '1', /* B */ '2', /* C */
'3', /* D */ '0', /* E */ '1', /* F */
'2', /* G */ '0', /* H */ '0', /* I */
'2', /* J */ '2', /* K */ '4', /* L */
'5', /* M */ '5', /* N */ '0', /* O */
'1', /* P */ '0', /* Q */ '6', /* R */
'2', /* S */ '3', /* T */ '0', /* U */
'1', /* V */ '0', /* W */ '2', /* X */
'0', /* Y */ '2' /* Z */
};
/* Soundex for arbitrary number of characters of information */
#define MAX_SOUNDEX_LEN 10
static char *
#ifdef PROTOTYPES
nsoundex(char *str, int n)
#else
nsoundex(str, n)
char *str; /* In: String to be converted */
int n; /* In: Number of characters in result string */
#endif
{
static char buff[MAX_SOUNDEX_LEN];
register char *s;
register char *t;
char c;
char l;
if (n <= 0) n = 4; /* Default */
if (n > sizeof(buff) - 1) n = sizeof(buff) - 1;
t = &buff[0];
for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++) {
if (!isalpha(c)) continue;
c = toupper(c);
if (t == &buff[0]) {
l = *t++ = c;
continue;
}
c = lookup[c-'A']; /* Fails on any alpha not in 'a'..'z' */
if (c != '0' && c != l) l = *t++ = c;
}
while (t < &buff[n]) *t++ = '0'; *t = '\0';
return(&buff[0]);
}
/* Normal external interface */
char *
#ifdef PROTOTYPES
soundex(char *str)
#else
soundex(str)
char *str;
#endif
{
return(nsoundex(str, 4)); /* vary this? - try 5 or 6? */
}
#endif /* soundex_c */
SHAR_EOF
cat - << \SHAR_EOF > utils.c
/*
File: utils.c
Author: Graham Toal
Purpose: Portability library
Description:
Most of what differs between operating systems is handled here.
This module is *incredibly* hacky -- I've just been interested
in finding the problems so far, not in making the solutions neat.
The final release of this suite will concentrate on cleaning up
this file!
*/
/* PENDING: Generic load dict; meanwhile should also put efficient
msdos file loading into getwords(). See spelldawg for best coding. */
#define SIZE_T size_t
/* MSDos redefines this for huge allocation */
#ifdef SYS_RISCOS
#define DEFAULT_DAWG "run:dict-dwg"
#define DEFAULT_PACK "run:dict-pck"
#else
#ifdef SYS_EMAS
#define DEFAULT_DAWG "dict#dwg"
#define DEFAULT_PACK "dict#pck"
#else
/* Unix, MessDross, etc... */
#define DEFAULT_DAWG "dict.dwg"
#define DEFAULT_PACK "dict.pck"
#endif
#endif
/* Undo any #define's here if they are inappropriate for some system */
#define EXT_CHAR '.'
#define READ_TEXT "r"
#define WRITE_BIN "wb"
#define READ_BIN "rb"
/* system configuration */
#ifdef KNOWS_VOID
#define VOID void
#else
/* As in... void fred(VOID) */
#define void int
#define VOID
#endif
#ifdef SYS_MSDOS
#ifdef COMPILER_ZORTECH
int _stack = 0x3000;
#define PCCRAP far
#else
#ifdef COMPILER_TURBOC
#define PCCRAP far
#else
#define PCCRAP huge
#endif
#endif
#else
#define PCCRAP
#endif
/* Hmph... I don't really want to do realloc too. Just as well that
one implmentation is buggy; saves me having to work out what to do :-) */
#ifdef SYS_MSDOS
/* Normally int... */
#undef SIZE_T
#define SIZE_T long
/* (Still to test AZTEC & LATTICE) */
/* Mallocs of > 64K (maybe even > 32K) have to come off the far/huge heap */
#ifdef COMPILER_ZORTECH
#include <dos.h>
#define MALLOC(x,s) (NODE PCCRAP *)farmalloc((x)*(s)) /* Zortech */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) farfree(x)
#define Free(x) free(x)
#else /* else not microsoft */
#ifdef COMPILER_TURBOC
#include <alloc.h>
#define MALLOC(x,s) (NODE PCCRAP *)farmalloc((x)*(s)) /* Turbo */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) farfree(x)
#define Free(x) free(x)
#else /* else not turbo */
#include <malloc.h>
#ifdef NEVER
#define MALLOC(x,s) (NODE PCCRAP *)my_halloc((long)(x),(size_t)(s))
/* Microsoft, Aztec */
#define Malloc(x,s) my_malloc((x)*(s))
#define FREE(x) my_hfree((void huge *)(x)) /* For some reason MICROSOFT */
#define Free(x) my_free((void *)x) /* complains of a type mismatch */
/* without these casts */
#endif
#define MALLOC(x,s) (NODE PCCRAP *)halloc((long)(x),(size_t)(s))
/* Microsoft, Aztec */
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) hfree((void huge *)(x)) /* For some reason MICROSOFT */
#define Free(x) free((void *)x) /* complains of a type mismatch */
/* without these casts */
#endif
#endif
#else /* else not SYS_MSDOS */
#ifdef STDLIBS
#include <stdlib.h> /* for malloc, free & exit */
#define MALLOC(x,s) (NODE PCCRAP *)malloc(((x)*(s)))
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) free(x)
#define Free(x) free(x)
#else
#define MALLOC(x,s) (NODE PCCRAP *)malloc(((x)*(s)))
#define Malloc(x,s) malloc((x)*(s))
#define FREE(x) free(x)
#define Free(x) free(x)
#ifndef size_t /* Doesn't work if size_t was typedef'd */
#define size_t int /* And if an int isn't big enough we're in trouble! */
#endif
#ifdef PROTOTYPES
extern void *malloc(size_t bytes);
extern void exit(int rc);
#else
extern void *malloc();
extern void exit();
#endif
#endif /* not stdlibs */
#endif /* Not msdos */
#ifdef SYS_RISCOS
#undef EXT_CHAR
#define EXT_CHAR '-'
#endif
#ifdef SYS_EMAS
#undef EXT_CHAR
#define EXT_CHAR '#'
#endif
#ifdef SYS_MAC
#ifdef COMPILER_THINK
#undef WRITE_BIN
#undef READ_BIN
#define WRITE_BIN "w"
#define READ_BIN "r"
#endif
#endif
#ifdef MEMMODELS
#define SMALL_MEMORY 1
#endif
/*
Error handling
At the moment we'll just go for OK / NOT OK. Later we should
fix all exit's to use a specific code symbol (eg EXIT_MALLOCFAIL)
but this has to be done system by system. Whenever a new one is
added, a default should be set up (as 0?)
*/
/*#include <errno.h>*/ /* Only later when we want to be more precise! */
#define EXIT_OK (0) /* Generic success */
#define EXIT_ERROR (1) /* Generaic failure */
/* For each system, replace generic errors with system-dependant ones. */
#ifdef vms
/*
* VMS-specific error status codes. Approximate Ultrix equivalents.
*/
#include <ssdef.h>
#include <stsdef.h>
#undef EXIT_OK
#define EXIT_OK SS$_NORMAL /* Generic success */
#undef EXIT_ERROR
#define EXIT_ERROR SS$_NORMAL /* Don't want random error message! */
#endif
#define DELETE(filename)
#ifdef __STDC__
#undef DELETE
#define DELETE(filename) remove(filename)
#else
#ifdef unix
#undef DELETE
#define DELETE(filename) unlink(filename)
#endif
#endif
#ifdef NEVER
/* these macros caught the fact that on MICROSOFT, the parameters
being passed in were ints -- and I hadn't been given a warning
because I had explicitly cast the to size-t. Hence why I've
declared SIZE_T as long. This is all a bit messy. Curse you IBM PCs
*/
void PCCRAP *my_halloc(long n,size_t s) {
char PCCRAP *p;
p = halloc(n, s);
fprintf(stderr, "halloc(%8lx*%d) -> %8lx\n", n, s, (long)p);
return(p);
}
void *my_malloc(size_t s) {
char *p;
p = malloc(s);
fprintf(stderr, "malloc(%4x) -> %4x\n", s, (int)p);
return(p);
}
void my_hfree(void PCCRAP *p) {
fprintf(stderr, "hfree(%8lx)\n", (long)p);
hfree(p);
}
void my_free(void *p) {
fprintf(stderr, "free(%4x)\n", (int)p);
free(p);
}
#endif
#ifdef RCHECK
#ifndef PROTOTYPES
long toosmall(idx, max, line)
long idx;
long max;
int line;
#else
long toosmall(long idx, long max, int line)
#endif
{
if (line == 0) {
fprintf(stderr,
"RANGE CHECK: %ld should not be less than 0 (max %ld)\n", idx, max);
} else {
fprintf(stderr,
"RANGE CHECK AT LINE %d: %ld should not be less than 0 (max %ld)\n",
line, idx, max);
}
return(0L);
}
#ifndef PROTOTYPES
long toobig(idx, max, line)
long idx;
long max;
int line;
#else
long toobig(long idx, long max, int line)
#endif
{
if (line == 0) {
fprintf(stderr,
"RANGE CHECK: %ld should not be larger than %ld\n", idx, max);
} else {
fprintf(stderr,
"RANGE CHECK AT LINE %d: %ld should not be larger than %ld\n",
line, idx, max);
}
return(max);
}
#ifdef KNOWS_LINE
#define TOOSMALL(idx, max) toosmall((long)idx, (long)max, __LINE__)
#define TOOBIG(idx, max) toobig((long)idx, (long)max, __LINE__)
#else
#define TOOSMALL(idx, max) toosmall((long)idx, (long)max, 0)
#define TOOBIG(idx, max) toobig((long)idx, (long)max, 0)
#endif
#define RANGECHECK(idx,max) \
(idx < 0 ? (TOOSMALL(idx,max)) : (idx >= max ? (TOOBIG(idx,max)) : idx))
#else
#define RANGECHECK(idx,max) (idx)
#endif
#ifdef PROTOTYPES
long getwords(long PCCRAP *data, long count, FILE *fp)
#else
long getwords(data, count, fp)
long PCCRAP *data;
long count;
FILE *fp;
#endif
{
#ifdef SYS_MSDOS
char PCCRAP *p; int c; long byte_count;
p = (char PCCRAP *)(&data[0]);
/* Somewhere I have a version which fread()s into a 16K near
buffer, then copies it out; use that technique here - *MUCH*
faster */
for (byte_count = 0; byte_count < count; byte_count++) {
c = fgetc(fp);
if (c == -1 || ferror(fp)) {
printf ("Dictionary read error - %ld wanted - %ld got\n",
count, byte_count)/*, exit(0)*/;
break;
}
*p++ = (c & 255);
}
return(count);
#else
return((long)fread(&data[0], (size_t)1, (size_t)count, fp));
#endif
}
#ifdef PROTOTYPES
long putwords(long PCCRAP *data, long count, FILE *fp)
#else
long putwords(data, count, fp)
long PCCRAP *data;
long count;
FILE *fp;
#endif
{
#ifdef SYS_MSDOS
extern int _NEAR _CDECL errno;
long i; char PCCRAP *p;
p = (char PCCRAP *)&data[0];
if (fp == NULL) {
fprintf(stderr, "putwords: no file?\n");
exit(0);
}
for (i = 0L; i < count; i++) {
/* Somewhere I have a version which copies to a 16K near
buffer, then frwrite()s it out; use that technique here - *MUCH*
faster */
int rc = fputc(*p++, fp);
if (ferror(fp)) {
fprintf(stderr, "rc = %d\n", rc);
perror("dawg");
fprintf (stderr, "Error writing to output file\n");
exit(0);
}
}
return(count);
#else
return(fwrite(&data[0], (size_t)1, (size_t)count, fp));
#endif
}
static long PCCRAP *WTMP = NULL;
/* A bit hacky having this single 4 bytes in heap space, but it makes
things a little more consistent. it'll all go away eventually
anyway... */
#ifdef PROTOTYPES
void putword(long w, FILE *fp)
#else
void putword(w, fp)
long w;
FILE *fp;
#endif
{
if (WTMP == NULL) {
WTMP = (long PCCRAP *)MALLOC(1,sizeof(long));
}
*WTMP = w;
if (putwords(WTMP, (long)sizeof(long), fp) != sizeof(long)) {
/* (using putwords avoids confusion over bytesex) */
fprintf(stderr, "Data write error - putword\n");
}
}
#ifdef PROTYPES
long getword(FILE *fp)
#else
long getword(fp)
FILE *fp;
#endif
{
if (WTMP == NULL) {
WTMP = (long PCCRAP *)MALLOC(1,sizeof(long));
}
if (getwords(WTMP, (long)sizeof(long), fp) != sizeof(long)) {
fprintf(stderr, "Data read error - getword\n");
}
return(*WTMP);
}
SHAR_EOFgtoal@tharr.UUCP (Graham Toal) (06/28/90)
#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar: Shell Archiver
# Run the following text with /bin/sh to create:
# dawg.c # dwgcheck.c # pack.c # pckcheck.c # pdawg.c # ppack.c # proot.c # sample.c # tell.c
cat - << \SHAR_EOF > dawg.c
/* The first three #define's are to repair mangling by BITNET mailers */
#define EOR ^
/* This should be Caret (hat, up arrow) -- C Ex-or */
#define MOD %
/* This should be Percent -- C Modulus */
#define NOT ~
/* This should be Tilde (twiddle) -- C unary Not */
#include "copyr.h"
/*****************************************************************************/
/* */
/* FILE : DAWG.C */
/* */
/* Convert an alphabetically-ordered dictionary file in text format */
/* (one word per line) into a Directed Acyclic Word Graph. The command */
/* syntax is */
/* */
/* DAWG <text file (inc .ext)> <output file (no ext)> */
/* */
/* The first 4 bytes of the output file form a 24-bit number containing */
/* the number of edges in the dawg. The rest of the file contains the */
/* dawg itself (see "The World's Fastest Scrabble Program" by Appel */
/* and Jacobson for a description of a dawg). */
/* */
/*****************************************************************************/
#include "grope.h"
#ifndef grope_h
/* If you don't have grope.h, create an empty one.
These will do for a basic system: */
#undef KNOWS_VOID
#undef STDLIBS
#undef PROTOTYPES
#define SYS_ANY 1
#define COMPILER_ANY 1
#define SMALL_MEMORY 1
/* To be defined if you have to generate
the data structure in bits. This will
certainly be true for any non-trivial
dictionary on MSDOS, or most home
micros with 1Mb Ram or under. */
#endif
/* Portability notes:
0) KISS! Keep It Simple, Stupid!
1) No typedef's
2) No bitfields
3) No structs
4) No #if defined()
5) No complex #if's
6) No procedure variables
7) Don't trust tolower() as some libs don't range check
8) Stay character-set independent (EBCDIC should work?)
9) Assume sizeof(long) >= 32 bits, but sizeof(int) just >= 16
10) Note use of ABS in preference to unsigned longs.
11) Assume 8-bit char set
12) Don't use k&r reserved words for variables, even if not
in ANSI. Such as 'entry'.
13) Unix people; no sys/ or machine/ files please unless under a
suitable #ifdef, and generic code supplied for everyone else...
14) this is byte-sex independant at the moment. KEEP IT THAT WAY!
Don't assume dawgs are stored in a fixed form, such as so-called
'net-order'.
15) Since C doesn't range-check arrays, we'll do it. If possible,
leave the running systems with range checking on if you can
afford it!
16) No nested include files.
17) Don't pull in any include files twice. (*Kills* the Atari!)
18) Don't use 'register' -- trust the compiler; the compiler is your friend.
*/
/*#define RCHECK 1*/ /* We want range-checking of array accesses */
#include <stdio.h>
#ifdef LIB_STRINGS
#include <strings.h> /* Some unixes, Mac? */
#else
#include <string.h>
#endif
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place the files dawg.h grope.h,
dawg.c and utils.c in a folder. Create a project that contains
dawg.c and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
/* Assume a 1Mb mac for the moment. Someday I'll add dynamic RAM sizing */
#define SMALL_MEMORY 1
#endif
#include "dawg.h" /* main system constants */
#include "utils.c" /* utils.c pulls in header file for malloc/free & exit */
#define ABS(x) ((x) < 0 ? -(x) : (x))
/**
* The following constant, HASH_TABLE_SIZE, can be changed according to
* dictionary size. The two values shown will be fine for both large
* and small systems. It MUST be prime.
**/
#ifdef SMALL_MEMORY
#define HASH_TABLE_SIZE 30011
#ifdef SYS_MAC
/* boy, you guys must be *really* tight on space...
are you sure you can handle a reasonable size of dictionary with
such a small table? Bump this back up as soon as you get everything
else working...
(I was given this info by a Mac site while they were first porting
this stuff; maybe now it works on macs we can put the buffer size
back up to something reasonable)
*/
#undef HASH_TABLE_SIZE
#define HASH_TABLE_SIZE 17389
#endif
#else
/* pick one about 20% larger than needed */
#define HASH_TABLE_SIZE 240007
/* Suitable prime numbers if you want to tune it:
30011
150001 <-- probably a good compromise. OK for dicts to about 1Mb text
200003
220009
240007
If you try any others, for goodness sake CHECK THAT THEY ARE PRIME!!!
*/
#endif
#define MAX_EDGES (HASH_TABLE_SIZE-1)
static FILE *fpin, *fpout; /* Input/output files */
static char current_word[(MAX_LINE+1)]; /* The last word read from fpin */
static int first_diff, save_first_diff;
/* The position of the first letter at which */
/* current_word differs from the previous */
/* word read */
static NODE PCCRAP *hash_table;
static int first_time = TRUE;
static int this_char = '?', lastch;
static NODE PCCRAP *dawg;
static int FILE_ENDED = FALSE; /* I'm having real problems getting
merged dawgs to work on the PC; this is yet another filthy hack. Sorry. */
static INDEX nwords, nnodes, total_edges;
#ifdef SMALL_MEMORY
#define MAX_SUBDAWG 30000 /* Big enough for largest dict,
even on small systems */
static NODE PCCRAP *merge;
static INDEX dawgsize[256];
static INDEX dawgstart[256];
static INDEX dawgentry[256];
static int nextslot; /* Dawgs are packed in sequentially - not in their
'proper' indexed position */
#endif
/*
Shorthand macros for array accesses with checking
If RCHECK isn't defined, these don't contribute any overhead. I suggest
leaving RCHECK on except for production code.
*/
#define EDGE_LIST(idx) \
edge_list[RANGECHECK(idx, MAX_CHARS)]
#define CURRENT_WORD(idx) \
current_word[RANGECHECK(idx, MAX_LINE+1)]
#define DAWG(idx) \
dawg[RANGECHECK(idx, MAX_EDGES)]
#define HASH_TABLE(idx) \
hash_table[RANGECHECK(idx, HASH_TABLE_SIZE)]
/*
Forward references
*/
#ifdef PROTOTYPES
static INDEX build_node(int depth);
static INDEX add_node(NODE *edge_list, INDEX nedges);
static void read_next_word(VOID);
static INDEX compute_hashcode(NODE *edge_list, INDEX nedges);
static void report_size(VOID);
#else
static INDEX build_node();
static INDEX add_node();
static void read_next_word();
static INDEX compute_hashcode();
static void report_size();
#endif
#ifdef SMALL_MEMORY
#ifdef PROTOTYPES
static void merge_dawg (char *filename);
#else
static void merge_dawg ();
#endif
#endif
/**
* main
*
* Program entry point
**/
long words; /* dirty communication variable -- the multi-pass stuff
was hacked in at the last minute. */
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
INDEX i;
char fname[128];
#ifdef SYS_MAC
argc = ccommand(&argv);
#endif
if (argc != 3) {
fprintf(stderr,
"usage: dawg dictfile.ext dawgfile\n");
exit(EXIT_ERROR);
}
/**
* Allocate the memory for the dawg
**/
if ((dawg = MALLOC(MAX_EDGES, sizeof(NODE PCCRAP *))) == NULL) {
fprintf(stderr, "Can\'t allocate dictionary memory\n");
#ifndef SMALL_MEMORY
fprintf(stderr, "-- try recompiling with -DSMALL_MEMORY\n");
#endif
exit(EXIT_ERROR);
}
for (i = 0; i < (INDEX)MAX_EDGES; i++) dawg[i] = 0L;
/**
* Allocate the hash table.
* Fill it with zeroes later just before use. Don't trust calloc etc.
* - not portable enough. Anyway, in the multi-pass version we don't
* want to continually free/claim...
**/
if ((hash_table =
MALLOC((HASH_TABLE_SIZE+1), sizeof(NODE))) == NULL) {
fprintf(stderr, "Can\'t allocate memory for hash table\n");
exit(EXIT_ERROR);
}
/**
* Open the input/output files
**/
fpin = fopen(argv[1], READ_TEXT);
if (fpin == NULL) {
fprintf(stderr, "Can\'t open text file \"%s\"\n", argv[1]);
/* Could print out error string but it's not portable... */
exit(EXIT_ERROR);
}
/**
* Read the first word from the dictionary
**/
first_time = TRUE;
nwords = 0;
current_word[0] = 0;
read_next_word();
lastch = *current_word;
/**
* Initialise the counters, taking account of the root node (which is
* a special case)
**/
nnodes = 1; total_edges = MAX_CHARS;
/**
* Build the dawg and report the outcome
**/
/* Now, in the dim & distant past, this code supported the concept
of a restricted character set - ie a..z & A..Z were packed into 6 bits;
this caused awful problems in the loop below, where we had to try to
keep the loop-control variable and the character code in synch; nowadays
chars are 8 bits or else, so I'm starting to tidy up the places
where these hacks were necessary... */
#ifdef SMALL_MEMORY
for (this_char = 0; this_char < MAX_CHARS; this_char++) {
if (FILE_ENDED) break; /* Don't waste time in this loop... */
#endif
/* Explicitly initialise hash table to all zeros */
{INDEX a; for (a = 0; a <= HASH_TABLE_SIZE; a++) hash_table[a] = 0;}
words = 0;
(void) build_node(0);
#ifdef SMALL_MEMORY
#ifdef DEBUG
fprintf(stderr,
"Char %d done. %ld Words, %ld Nodes\n", *current_word, words, total_edges);
#endif
if (total_edges /* words */ == 0) continue;
#endif
/**
* Save the dawg to file
**/
#ifdef SMALL_MEMORY
sprintf(fname, "%s%c%d", argv[2], EXT_CHAR, lastch);
#else
sprintf(fname, "%s%cdwg", argv[2], EXT_CHAR);
#endif
fpout = fopen(fname, WRITE_BIN);
if (fpout == NULL) {
fprintf(stderr, "Can\'t open output file \"%s\"\n", fname);
exit(EXIT_ERROR);
}
#ifdef DEBUG
fprintf(stderr, "Writing to %s\n", fname);
#endif
*dawg = total_edges;
total_edges = sizeof(NODE) * (total_edges + 1); /* Convert to byte count */
{
long cnt;
if ((cnt = putwords(dawg, (long)total_edges, fpout)) != total_edges) {
fprintf(stderr, "%ld bytes written instead of %ld\n.", cnt, total_edges);
exit(EXIT_ERROR);
}
}
fclose(fpout);
/**
* Read the first word from the dictionary
**/
first_diff = save_first_diff;
first_time = FALSE;
nwords = 0;
/**
* Initialise the counters, taking account of the root node (which is
* a special case)
**/
nnodes = 1; total_edges = MAX_CHARS;
lastch = *current_word;
/**
* Build the dawg and report the outcome
**/
#ifdef SMALL_MEMORY
}
#endif
fclose(fpin);
fprintf(stderr, "Dawg generated\n");
#ifdef SMALL_MEMORY
merge_dawg(argv[2]);
#endif
exit(EXIT_OK);
}
/**
* BUILD_NODE
*
* Recursively build the next node and all its sub-nodes
**/
static INDEX
#ifdef PROTOTYPES
build_node(int depth)
#else
build_node(depth)
int depth;
#endif
{
INDEX nedges = 0;
INDEX i;
NODE *edge_list;
edge_list = NULL;
if (CURRENT_WORD(depth) == 0) {
/**
* End of word reached. If the next word isn't a continuation of the
* current one, then we've reached the bottom of the recursion tree.
**/
read_next_word();
if (first_diff < depth) return(0);
}
edge_list = (NODE *)malloc(MAX_CHARS*sizeof(NODE));
/* Note this is a 'near' array */
if (edge_list == NULL) {
fprintf(stderr, "Stack full (depth %d)\n", depth);
exit(EXIT_ERROR);
}
for (i = 0; i < MAX_CHARS; i++) EDGE_LIST(i) = 0L;
/**
* Loop through all the sub-nodes until a word is read which can't
* be reached via this node
**/
do
{
/* Construct the edge. Letter.... */
EDGE_LIST(nedges) = (NODE) (((NODE)CURRENT_WORD(depth)))
<< (NODE)V_LETTER;
/* ....end-of-word flag.... */
if (CURRENT_WORD(depth+1L) == 0) EDGE_LIST(nedges) |= M_END_OF_WORD;
/* ....and node pointer. */
EDGE_LIST(nedges) |= build_node(depth+1); nedges++;
/* (don't ++ in a macro) */
} while (first_diff == depth);
if (first_diff > depth) {
fprintf(stderr, "Internal error -- first_diff = %d, depth = %d\n",
first_diff, depth);
exit(EXIT_ERROR);
}
EDGE_LIST(nedges-1) |= M_END_OF_NODE;
/* Flag the last edge in the node */
/**
* Add the node to the dawg
**/
if (depth) {
NODE result = add_node(edge_list, nedges);
free(edge_list);
return(result);
}
/**
* depth is zero, so the root node (as a special case) goes at the start
**/
edge_list[MAX_CHARS-1] |= M_END_OF_NODE; /* For backward searches */
for (i = 0; i < MAX_CHARS; i++)
{
dawg[i+1] = edge_list[i];
}
free(edge_list);
return(0);
}
/**
* ADD_NODE
*
* Add a node to the dawg if it isn't already there. A hash table is
* used to speed up the search for an identical node.
**/
static INDEX
#ifdef PROTOTYPES
add_node(NODE *edge_list, INDEX nedges)
#else
add_node(edge_list, nedges)
NODE *edge_list;
INDEX nedges;
#endif
{
NODE hash_entry;
INDEX inc;
INDEX a, first_a;
INDEX i;
/**
* Look for an identical node. A quadratic probing algorithm is used
* to traverse the hash table.
**/
first_a = compute_hashcode(edge_list, nedges);
first_a = ABS(first_a) MOD HASH_TABLE_SIZE;
a = first_a;
inc = 9;
for (;;)
{
hash_entry = HASH_TABLE(a) & M_NODE_POINTER;
if (hash_entry == 0) break; /* Node not found, so add it to the dawg */
for (i = 0; i < nedges; i++)
if (DAWG((hash_entry+i) MOD HASH_TABLE_SIZE) != EDGE_LIST(i)) break;
/* On the 1.6M dictionary, this gave a rangecheck with < 0. Now fixed
I think - it was a problem with MOD giving unexpected results. */
if (i == nedges) {
return(hash_entry); /* Node found */
}
/**
* Node not found here. Look in the next spot
**/
a += inc;
inc += 8;
if (a >= HASH_TABLE_SIZE) a -= HASH_TABLE_SIZE;
if (inc >= HASH_TABLE_SIZE) inc -= HASH_TABLE_SIZE;
if (a == first_a) {
fprintf(stderr, "Hash table full\n");
exit(EXIT_ERROR);
}
}
/**
* Add the node to the dawg
**/
if (total_edges + nedges >= MAX_EDGES) {
fprintf(stderr,
"Error -- dictionary full - total edges = %ld\n", total_edges);
exit(EXIT_ERROR);
}
HASH_TABLE(a) |= total_edges + 1;
nnodes++;
for (i = 0; i < nedges; i++) {
DAWG((total_edges + 1 + i) MOD HASH_TABLE_SIZE) = EDGE_LIST(i);
}
total_edges += nedges;
return(total_edges - nedges + 1);
}
/**
* READ_NEXT_WORD
*
* Read the next word from the input file, setting first_diff accordingly
**/
static void
#ifdef PROTOTYPES
read_next_word(VOID)
#else
read_next_word()
#endif
{
/* This stuff imposes the limitation of not allowing '\0' in a word;
not yet a problem but the dawg structure itself could probably cope
if the feature were wanted. (Maybe with a little teweaking) */
char linebuff[(MAX_LINE+1)];
int length;
for (;;)
{
int next = 0, c;
for (;;) {
c = fgetc(fpin);
if (FILE_ENDED || c == EOF || ferror(fpin) || feof(fpin)) {
/* for some reason, we always get a blank line at the end of files */
current_word[0] = 0;
first_diff = -1;
linebuff[next] = '\0';
FILE_ENDED = TRUE;
return;
}
c &= 255;
if (next == 0 && c == '\n') continue; /* skip blank lines... */
linebuff[next++] = c;
if (c == '\n') break;
}
linebuff[next] = '\0';
words++;
length = strlen(linebuff);
if (linebuff[length-1] == '\n') linebuff[length-1] = '\0';
if (linebuff[length] == '\n') linebuff[length] = '\0';
if (length < 2 || length > MAX_LINE-1)
{
fprintf(stderr, "\n%s - invalid length\n", linebuff);
continue; /* Previously exit()ed, but now ignore so that
test sites without my pddict can use /usr/dict/words */
}
break;
}
for (length = 0; current_word[length] == linebuff[length]; length++) {
/* Get common part of string to check order */
}
if (current_word[length] > linebuff[length]) {
fprintf(stderr, "Error -- %s (word out of sequence)\n", linebuff);
exit(EXIT_ERROR);
}
first_diff = length;
nwords++;
strcpy(current_word, linebuff);
if ((nwords > 1) && (!(nwords & 0x3FF))) report_size();
#ifdef SMALL_MEMORY
if (current_word[0] != lastch) {
save_first_diff = first_diff;
first_diff = -1;
report_size();
}
#else
this_char = current_word[0]; /* for diagnostics... */
#endif
}
/**
* COMPUTE_HASHCODE
*
* Compute the hash code for a node
**/
static INDEX
#ifdef PROTOTYPES
compute_hashcode(NODE *edge_list, INDEX nedges)
#else
compute_hashcode(edge_list, nedges)
NODE *edge_list;
INDEX nedges;
#endif
{
INDEX i;
INDEX res = 0L;
for (i = 0; i < nedges; i++)
res = ((res << 1) | (res >> 31)) EOR EDGE_LIST(i);
return(res);
}
/**
* REPORT_SIZE
*
* Report the current size etc
**/
static void
#ifdef PROTOTYPES
report_size(VOID)
#else
report_size()
#endif
{
if (first_time)
{
fprintf(stderr, " Words Nodes Edges Bytes BPW\n");
fprintf(stderr, " ----- ----- ----- ----- ---\n");
first_time = FALSE;
}
if (*current_word) fprintf(stderr, "%c:", *current_word);
else fprintf(stderr, "Total:");
/* THE FOLLOWING IS RATHER GRATUITOUS USE OF FLOATING POINT - REMOVE
IT AND REPLACE WITH INTEGER ARITHMETIC * 100 */
/* (hey - I already did this in the copy I sent to Richard; so how
come its missing? Oh no, not again: I've got out of synch and
used an old copy, haven't I? :-( ) */
fprintf(stderr, " %7ld %7ld %7ld %7ld %5.2f\n",
nwords, nnodes, total_edges, sizeof(NODE PCCRAP *)*(total_edges+1),
(float)(sizeof(NODE PCCRAP *)*(total_edges+1))/(float)nwords);
}
#ifdef SMALL_MEMORY
static void
#ifdef PROTOTYPES
merge_dawg (char *filename)
#else
merge_dawg (filename)
char *filename;
#endif
{
FILE *fp, *outfile;
NODE data, edge;
INDEX nedges, nextfree, i, dentry;
INDEX count, lastnode;
int dictch, padding;
char fname[128];
nedges = (INDEX)MAX_SUBDAWG;
if ((merge = MALLOC((SIZE_T)nedges, sizeof(INDEX))) == 0) {
fprintf(stderr, "Memory allocation error -- %ld wanted\n",
nedges*sizeof(INDEX));
exit(EXIT_ERROR);
}
nextfree = MAX_CHARS; /* 0 is special, 1..26 are root nodes for a..z */
nextslot = 0;
for (dictch = 0; dictch < MAX_CHARS; dictch++) {
/***
* Open the file and find out the size of the dawg
***/
sprintf(fname, "%s%c%d", filename, EXT_CHAR, dictch);
if ((fp = fopen(fname, READ_BIN)) == NULL) {
continue;
}
nedges = getword(fp);
dawgstart[nextslot] = nextfree;
dawgsize[nextslot] = nedges-MAX_CHARS;
/* the first entry is (erroneously) the pointer to the chunk */
dentry = getword(fp);
dawgentry[nextslot] = dentry - MAX_CHARS + nextfree;
nextfree += nedges - MAX_CHARS;
nextslot++;
fclose(fp);
}
/* Now output total edges, and starts[a..z] */
/* Then set nextfree to start of each block */
sprintf(fname, "%s%cdwg", filename, EXT_CHAR);
outfile = fopen(fname, WRITE_BIN);
if (outfile == NULL) {
fprintf(stderr, "Cannot open output dawg file %s\n", fname);
exit(EXIT_ERROR);
}
putword(nextfree, outfile);
nextfree = 1; nextslot = 0; padding = 0;
lastnode = MAX_CHARS-1;
for (;;) {
if (dawgentry[lastnode] != 0) break; /* Leave with 'lastnode' set */
lastnode -= 1;
}
for (dictch = 0; dictch < MAX_CHARS; dictch++) {
NODE edge = dawgentry[nextslot];
if (edge == 0) {
padding++;
continue;
}
if (dictch == lastnode) {
edge |= M_END_OF_NODE;
} else {
edge &= (NOT M_END_OF_NODE);
}
putword(edge, outfile);
nextfree++; nextslot++;
}
nextfree += padding;
while (padding > 0) {
putword(0L, outfile); padding -= 1;
}
/* nextslot = 0; ???? This one not used? */
for (dictch = 0; dictch < MAX_CHARS; dictch++) {
sprintf(fname, "%s%c%d", filename, EXT_CHAR, dictch);
if ((fp = fopen(fname, READ_BIN)) == NULL) {
continue;
}
nedges = getword(fp);
for (i = 0; i < MAX_CHARS; i++) {
(void) getword(fp);
/* Skip root pointers */
}
nedges -= MAX_CHARS;
count = getwords(&merge[1], (long)(nedges*sizeof(NODE)), fp);
if (count != nedges*sizeof(NODE)) {
fprintf(stderr, "Dictionary read error - %ld wanted - %ld got\n",
nedges*sizeof(NODE), count);
exit(EXIT_ERROR);
}
fclose(fp);
DELETE(fname); /* On floppy systems, we can almost get away with
little more space than the final files would take! */
/* relocate all nodes */
for (i = 1; i <= nedges; i++) {
data = merge[i];
edge = data & M_NODE_POINTER;
if (edge > MAX_CHARS) {
data &= (NOT M_NODE_POINTER);
data |= edge - MAX_CHARS - 1 + nextfree;
merge[i] = data;
}
putword(merge[i], outfile);
}
nextfree += nedges;
/* nextslot++; this one not used??? */
}
fclose(outfile);
FREE(merge);
}
#endif
SHAR_EOF
cat - << \SHAR_EOF > dwgcheck.c
/*
File: dwgcheck.c
Author: Graham Toal
Purpose: check correct spelling using dict.dwg
Creation date: 22/06/90
Lastedit: 23/06/90 01:10:12
Description:
This can be expanded to be like the unix 'spell' command.
This demo file simply checks words passed on the command line.
note how it remembers words so that the second time it sees
them, it will call them correct. This is how you should
implement the 'ignore' feature of an interactive checker.
This code uses the dawg data structure, which I recommend
you stick with; however for *extremely* fast checking in
special circumstances you can use the 'pack' versions of
check() et al.
*/
/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
/*#define RCHECK*/ /* Enable for internal range checks... */
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* eg, for isalpha() */
/* Spelling library utilities */
#include "init.c" /* Loading dicts */
#include "dyntrie.c" /* Creating dicts at run-time */
#include "print.c" /* Printing dicts */
#include "check.c" /* Checking words */
#include "locate.c" /* Finding words by their stems */
#include "soundex.c" /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c" /* Closeness-metric for correction */
#include "correct.c" /* Hack code to attempt error correction (uses above) */
/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
files in a folder. Then create a project which contains this main.c
and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif
/* Your own header files go here, for instance:
#include "readword.h"
*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg; /* precompiled dictionary (dict.dwg) */
INDEX edges; /* size of above */
NODE PCCRAP *userdict; /* run-time dictionary, added to dynamically */
char *word;
int each;
#ifdef SYS_MAC
argc = ccommand(&argv); /* Mac users have my sympathy */
#endif
/* Your program goes here... */
if (argc == 1) {
fprintf(stderr, "usage: %s possibly mispeled wurdz\n", argv[0]);
exit(EXIT_ERROR);
}
if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);
if (!trie_new(&userdict)) exit(EXIT_ERROR);
for (each = 1; each < argc; each++) {
word = argv[each];
fprintf(stderr, "* %s:\n", word);
if (dawg_check(dawg, word) || dawg_check(userdict, word)) {
fprintf(stderr, "Correct\n");
} else {
fprintf(stderr, "Wrong\n");
(void)trie_add(&userdict, word);
}
if (each+1 != argc) fprintf(stderr, "\n");
}
exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > pack.c
/* The line below is to protect against Bitnet mailer damage */
#define MOD %
/* This should be a Percent (for C Modulus) */
/* Blech. This file is a mess. Make it the first rewrite... */
#include "copyr.h"
/*****************************************************************************/
/* */
/* FILE : PACK.C */
/* */
/* Re-pack a dawg/trie data structure using Franklin Liang's */
/* algorithm for sparse matrix storage. Final structure allows */
/* direct indexing into trie nodes, so is exceedingly fast at checking. */
/* Unfortunately the trade off is that any algorithms which walk */
/* the data structure and generate words will take much longer */
/* */
/* PACK <dawg file (inc .ext)> <pack file (inc .ext)> */
/* */
/*****************************************************************************/
/* Pending:
see what closest gap between dawgptr + freenode is, and see whether we
can save space by overlapping input & output arrays with a window between
them. Should get almost 50% of memory back. Also, because of hacking
around a bug some time back, I'm using an extra array (new_loc) for
relocation of pointers, when in fact I could (and have in the past)
managed to relocate them in situ with not much extra overhead.
As I said, it needs a rewrite...
*/
/* Note: I tried one implementation of this which used bitsets to test
whether two nodes were compatible. In fact, it wasn't sufficiently
faster to justify the extra space it used for the arrays of flags.
Now I check for compatibility on the fly with lots of comparisons.
I do however have a seperate byte array to flag whether a trie
is based at any address. There's probably a way of removing this.
*/
#include "grope.h"
#ifndef grope_h
/* If you don't have grope.h, create an empty one.
These will do for a basic system: */
#undef KNOWS_VOID
#undef STDLIBS
#undef PROTOTYPES
#define SYS_ANY 1
#define COMPILER_ANY 1
#define SMALL_MEMORY 1 /* To be defined if you have to generate
the data structure in bits. This will
certainly be true for any non-trivial
dictionary on MSDOS, or most home
micros with 1Mb Ram or under. */
#endif
#include <stdio.h>
/*#define RCHECK*/ /* Turn this back on if you have any problems. */
#include "dawg.h"
#include "utils.c"
#include "init.c"
#include "print.c"
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place the files dawg.h grope.h,
pack.c and utils.c in a folder. Create a project that contains
pack.c and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
/* Assume a 1Mb mac for the moment. Someday I'll add dynamic RAM sizing */
#define SMALL_MEMORY
#endif
#define TRUE (0==0)
#define FALSE (0!=0)
#define MAX_WORD_LENGTH 32
/* These two magic numbers alter a very hacky heuristic employed in
the packing algorithm. Tweaking them judiciously ought to speed
it up significantly (at the expense of a slightly sparser packing */
#define DENSE_LOWER 100
#define DENSE_UPPER 200
/*###########################################################################*/
INDEX ptrie_size = 0;
static NODE PCCRAP *ptrie;
#ifdef RCHECK
/* can't use the standard range_check macro because these are slightly
non-standard. */
#define PTRIE(x) ptrie[RANGECHECK((x), ptrie_size)]
#define DATA(x) (((x) >> 12) == 0xf2f1 ? toosmall((x), 0, __LINE__) : (x))
#else
/* so supply an explicit base case */
#define PTRIE(x) ptrie[x]
#define DATA(x) (x)
#endif
static char PCCRAP *trie_at; /* save some time by caching this info --
previously it was a function called on each test */
static INDEX freenode, lowest_base = 1, highest_base = -1;
static int debug = FALSE;
int
#ifdef PROTOTYPES
compatible(NODE *node, INDEX i) /* Can a node be overlaid here? */
#else
compatible(node, i)
NODE *node;
INDEX i;
/* Can a node be overlaid here? */
#endif
{
int c;
for (c = 0; c < MAX_CHARS; c++) {
if ((node[c]&M_FREE) == 0) { /* Something to slot in... */
if ((PTRIE(i+c) & M_FREE) == 0) return(FALSE); /* but no slot for it */
}
}
return(TRUE);
}
INDEX
#ifdef PROTOTYPES
find_slot(NODE *node)
#else
find_slot(node)
NODE *node;
#endif
{ /* Try each position until we can overlay this node */
INDEX i;
for (i = lowest_base; i < freenode; i++) {
if ((!trie_at[i]) && (compatible(node, i))) {
/* nothing here already */
/* and this one will fit */
return(i);
}
}
fprintf(stderr, "Should never fail to find a slot!\n");
/* because the output array is larger than the input... */
exit(5);
/* NOT REACHED */
return(0);
}
static int changes;
INDEX
#ifdef PROTOTYPES
pack(NODE *node)
#else
pack(node)
NODE *node;
#endif
{
int c;
INDEX slot;
NODE value;
slot = find_slot(node); /* slot is also returned as the result,
to be used in relocation */
/* Place node at slot */
changes = 0;
for (c = 0; c < MAX_CHARS; c++) {
value = node[c];
if ((value&M_FREE) == 0) { /* Something to fit? */
if (((value>>V_LETTER)&M_LETTER) == (INDEX)c+BASE_CHAR) {
/* Just a consistency check - could safely be removed */
PTRIE(slot+c) = DATA(value);
changes++;
}
}
}
/* The hack heuristics below keep a N^2 operation down to linear time! */
if ((slot == lowest_base) || (slot >= lowest_base+DENSE_LOWER)) {
/* Heuristic is: we increase the initial search position if the
array is packed solid up to this point or we're finding it *very*
hard to find suitable slots */
int found = FALSE;
for (;;) {
INDEX c;
while (trie_at[lowest_base]) lowest_base++;
for (c = lowest_base; c < lowest_base+MAX_CHARS; c++) {
if ((PTRIE(c)&M_FREE) != 0) {found = TRUE; break;}
}
if (found && (slot < lowest_base+DENSE_UPPER)) break;
/* ^ Skip hard slots to fill */
lowest_base++; /* although nothing is based here, the next N slots
are filled with data from the last N tries. */
/* Actually I'm not sure if 256 sequential trees each with the
same element used would actually block the next 256 slots
without a trie_at[] flag being set for them. However it
does no harm to believe this... I must formally check this
someday once all the other stuff is in order. */
}
}
if (slot > highest_base) highest_base = slot;
/* This is info for when we try to overlap input & output
arrays, -- with the output sitting some large number of
bytes lower down than the input. */
trie_at[slot] = TRUE;
return(slot);
}
/*###########################################################################*/
static NODE PCCRAP *dawg;
static INDEX PCCRAP *new_loc;
static INDEX nedges;
NODE this_node[MAX_CHARS];
INDEX
#ifdef PROTOTYPES
take_node(INDEX ptr)
#else
take_node(ptr)
INDEX ptr;
#endif
{
NODE data;
INDEX edge;
int let;
int endsword;
int endsnode;
char ch;
int changes = 0;
for (let = 0; let < MAX_CHARS; let++) this_node[let] = M_FREE;
for (;;) {
data = dawg[ptr++];
if (data == 0) {
return(-1);
} else {
endsword = ((data & M_END_OF_WORD) != 0);
endsnode = ((data & M_END_OF_NODE) != 0);
edge = data & M_NODE_POINTER;
let = (int) ((data >> V_LETTER) & M_LETTER);
ch = let + BASE_CHAR;
this_node[let] = edge & M_NODE_POINTER;
if (endsword) this_node[let] |= M_END_OF_WORD;
this_node[let] |= (NODE)ch<<V_LETTER;
changes++;
if (endsnode) break;
}
}
if (changes != 0) {
return(ptr);
} else {
fprintf(stderr, "Fatal error\n");
return(0);
}
}
NODE
#ifdef PROTOTYPES
redirect_node(NODE ptr)
#else
redirect_node(ptr)
NODE ptr;
#endif
{
NODE data;
INDEX edge;
int endsword;
char ch;
data = DATA(PTRIE(ptr));
endsword = ((data & M_END_OF_WORD) != 0);
edge = data & M_NODE_POINTER;
ch = (int) ((data >> V_LETTER) & M_LETTER);
/*edge = dawg[edge] & M_NODE_POINTER;*/
edge = new_loc[edge];
ptr = edge;
ptr |= (NODE)ch<<V_LETTER;
if (endsword) ptr |= M_END_OF_WORD;
return(ptr);
}
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
INDEX dawgptr = 1, trieptr, newdawgptr, i, nodes = 0;
FILE *triefile;
#ifdef SYS_MAC
argc = ccommand(&argv);
#endif
if (argc != 3) {
fprintf(stderr, "pack: wrong parameters - should be infile outfile\n");
exit(EXIT_ERROR);
}
if (!dawg_init((argc >= 2 ? argv[1]: ""), &dawg, &nedges)) exit(EXIT_ERROR);
if (debug) dawg_print(dawg, (INDEX)ROOT_NODE); /* assume small test file! */
freenode = ((nedges*16)/15)+(4*MAX_CHARS);
/* Minimal slop for pathological packing? */
ptrie_size = freenode;
ptrie = MALLOC((SIZE_T)freenode, sizeof(NODE));
if (ptrie == NULL) {
fprintf(stderr, "Cannot claim %ld bytes for ptrie[]\n",
sizeof(NODE)*freenode);
exit(EXIT_ERROR);
}
new_loc = MALLOC((SIZE_T)freenode, sizeof(NODE));
if (new_loc == NULL) {
fprintf(stderr, "Cannot claim %ld bytes for new_loc[]\n",
sizeof(NODE)*freenode);
exit(EXIT_ERROR);
}
trie_at = (char PCCRAP *)MALLOC((SIZE_T)freenode, sizeof(char));
if (trie_at == NULL) {
fprintf(stderr, "Cannot claim %ld bytes for trie_at[]\n", freenode);
exit(EXIT_ERROR);
}
for (i = 0; i < freenode; i++) {
ptrie[i] = M_FREE; new_loc[i] = 0; trie_at[i] = FALSE;
}
dawg[0] = 0; /* 1st entry is never looked at, and maps to itself */
dawgptr = 1;
/* Relocate initial node at 1 seperately */
newdawgptr = take_node(dawgptr /* 1 */);
trieptr = pack(this_node);
/*dawg[dawgptr] = trieptr;*/
/* What the hell does this do??? I've forgotten!!! - oh yes, this
was relocating in situ, without new_loc... */
new_loc[dawgptr] = trieptr;
dawgptr = MAX_CHARS+1;
{INDEX maxdiff = 0, diff;
for (;;) {
if (highest_base > dawgptr) {
diff = highest_base - dawgptr;
if (diff > maxdiff) maxdiff = diff;
}
newdawgptr = take_node(dawgptr);
if (newdawgptr == -1) {
dawgptr++;
continue;
}
trieptr = pack(this_node);
/*dawg[dawgptr] = trieptr;*/
new_loc[dawgptr] = trieptr;
dawgptr = newdawgptr;
if (dawgptr > nedges) {
break; /* AHA!!! I was doing this in the
wrong order & lost last entry! *AND* had '>=' for '>' */
}
nodes++;
if ((nodes MOD 1000) == 0) fprintf(stderr, "%ld nodes\n", nodes);
}
if (debug) fprintf(stderr, "wavefront gets up to %ld\n", maxdiff);
}
if (debug) fprintf(stderr, "All packed - used %ld nodes\n", highest_base);
for (trieptr = 1; trieptr < freenode; trieptr++) {
/*
extract edge from ptrie[trieptr],
look it up via dawg[edge], slot it back in
(while preserving other bits)
*/
PTRIE(trieptr) = redirect_node(trieptr);
}
/* Finally, remember to bump up highest_base in case last node is only
one edge and 25 others are missing! */
if (debug) fprintf(stderr, "Redirected...\n");
triefile = fopen((argc >= 3 ? argv[2] : DEFAULT_PACK), WRITE_BIN);
if (triefile == NULL) {
fprintf(stderr, "Cannot write to packed trie file\n");
exit(1);
}
if (debug) fprintf(stderr, "File opened...\n");
ptrie[0] = highest_base+MAX_CHARS-1;/* File size in words
(-1 because doesn't include itself) */
if (debug) fprintf(stderr, "Dumping... (0..%ld)\n", highest_base+MAX_CHARS-1);
for (i = 0; i < highest_base+MAX_CHARS; i++) {/* dump packed DAG */
NODE n;
n = DATA(PTRIE(i));
putword(n, triefile);
if (ferror(triefile)) {
fprintf(stderr, "*** TROUBLE: Writing DAG -- disk full?\n");
fclose(triefile);
exit(1);
}
}
if (debug) fprintf(stderr, "Dumped...\n");
fclose(triefile);
if (debug) fprintf(stderr, "Done.\n");
}
SHAR_EOF
cat - << \SHAR_EOF > pckcheck.c
/*
File: pckcheck.c
Author: Graham Toal
Purpose: check correct spelling using dict.pck
Creation date: 22/06/90
Lastedit: 23/06/90 01:15:39
Description:
This can be expanded to be like the unix 'spell' command.
This demo file simply checks words passed on the command line.
note how it remembers words so that the second time it sees
them, it will call them correct. This is how you should
implement the 'ignore' feature of an interactive checker.
This version used the fast 'packed' data structure. This is
approximately 3 times faster, but not all utilities support
the packed versions. Also utilities which walk the trie are
considerably slower (say by a factor of 100) -- so chose when
to used 'dawg' and when to use 'pack'ed versions carefully!
*/
/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
/*#define RCHECK*/ /* Enable for internal range checks... */
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* eg, for isalpha() */
/* Spelling library utilities */
#include "init.c" /* Loading dicts */
#include "dyntrie.c" /* Creating dicts at run-time */
#include "print.c" /* Printing dicts */
#include "check.c" /* Checking words */
#include "locate.c" /* Finding words by their stems */
#include "soundex.c" /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c" /* Closeness-metric for correction */
#include "correct.c" /* Hack code to attempt error correction (uses above) */
/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
files in a folder. Then create a project which contains this main.c
and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif
/* Your own header files go here, for instance:
#include "readword.h"
*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *ptrie; /* precompiled dictionary (dict.pck) */
INDEX edges; /* size of above */
NODE PCCRAP *userdict; /* run-time dictionary, added to dynamically */
char *word;
int each;
#ifdef SYS_MAC
argc = ccommand(&argv); /* Mac users have my sympathy */
#endif
/* Your program goes here... */
if (argc == 1) {
fprintf(stderr, "usage: %s possibly mispeled wurdz\n", argv[0]);
exit(EXIT_ERROR);
}
if (!pack_init("", &ptrie, &edges)) exit(EXIT_ERROR);
if (!trie_new(&userdict)) exit(EXIT_ERROR);
for (each = 1; each < argc; each++) {
word = argv[each];
fprintf(stderr, "* %s:\n", word);
if (pack_check(ptrie, word) || dawg_check(userdict, word)) {
fprintf(stderr, "Correct\n");
} else {
fprintf(stderr, "Wrong\n");
(void)trie_add(&userdict, word);
}
if (each+1 != argc) fprintf(stderr, "\n");
}
exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > pdawg.c
/* Manadatory headers */
/*#define RCHECK*/
/*#define DEBUG*/
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* for isalpha() */
/* Spelling library utilities */
#include "init.c"
#include "print.c"
/*
#include "check.c"
#include "locate.c"
#include "similcmp.c"
#include "soundex.c"
#include "correct.c"
*/
/*#include "dyntrie.c"*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX nedges;
if (!dawg_init((argc == 1 ? "" : argv[1]), &dawg, &nedges)) exit(EXIT_ERROR);
dawg_print(dawg, (INDEX)ROOT_NODE);
fprintf(stderr, "Finished printing dawg\n");
exit(0);
}
SHAR_EOF
cat - << \SHAR_EOF > ppack.c
/* Manadatory headers */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* for isalpha() */
/* Spelling library utilities */
#include "init.c"
#include "print.c"
/*
#include "check.c"
#include "locate.c"
#include "similcmp.c"
#include "soundex.c"
#include "correct.c"
*/
/*#include "dyntrie.c"*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *ptrie;
INDEX nedges;
if (!pack_init((argc == 1 ? "": argv[1]), &ptrie, &nedges)) exit(EXIT_ERROR);
pack_print(ptrie, (INDEX)ROOT_NODE);
exit(0);
}
SHAR_EOF
cat - << \SHAR_EOF > proot.c
/*
File: proot.c
Author: Graham Toal
Purpose: find all words starting with 'root'
Creation date: 22/06/90
Lastedit: 22:24:32
Description:
some spelling programs remove characters from the end of
wrongly spelt words one by one until the resulting root is
found to be a prefix of other words in the dictionary. This
works because the assumption is made that the word was in
fact correct, but was an inflected form of a word in the
dictionary which had not been stored.
*/
/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* eg, for isalpha() */
/* Spelling library utilities */
#include "init.c" /* Loading dicts */
/*#include "dyntrie.c"*/ /* Creating dicts at run-time */
#include "print.c" /* Printing dicts */
/*#include "check.c"*/ /* Checking words */
#include "locate.c" /* Finding words by their stems */
/*#include "soundex.c"*/ /* Soundex algorithm for word-likeness comparison */
/*#include "similcmp.c"*//* Closeness-metric for correction */
/*#include "correct.c"*/ /* Code to attempt error correction (uses above) */
/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
files in a folder. Then create a project which contains this main.c
and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif
/* Your own header files go here, for instance:
#include "readword.h"
*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX root;
INDEX edges;
int each;
#ifdef SYS_MAC
argc = ccommand(&argv); /* Mac users have my sympathy */
#endif
/* Your program goes here... */
if (argc == 1) {
fprintf(stderr, "usage: %s part\n", argv[0]);
exit(EXIT_ERROR);
}
if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);
for (each = 1; each < argc; each++) {
fprintf(stderr, "* %s:\n", argv[each]);
root = dawg_locate_prefix(dawg, argv[each], ROOT_NODE);
if (root != ROOT_NODE) {
dawg_print_prefix(dawg, argv[each], root);
} else {
fprintf(stderr, "(none found)\n");
}
if (each+1 != argc) fprintf(stderr, "\n");
}
exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > sample.c
/*
Dummy program for writing word utilities.
If you stick to the style shown here, your programs will
remain portable across a vast range of machines.
Skeleton file by Graham Toal. Remove this header when you've
added your own...
(If you want to see some worked examples, try
tell.c dwgcheck.c pckcheck.c)
*/
/*
File:
Author:
Purpose:
Creation date:
Lastedit:
Description:
*/
/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* eg, for isalpha() */
/* Spelling library utilities */
#include "init.c" /* Loading dicts */
#include "dyntrie.c" /* Creating dicts at run-time */
#include "print.c" /* Printing dicts */
#include "check.c" /* Checking words */
#include "locate.c" /* Finding words by their stems */
#include "soundex.c" /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c" /* Closeness-metric for correction */
#include "correct.c" /* Hack code to attempt error correction (uses above) */
/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
files in a folder. Then create a project which contains this main.c
and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif
/* Your own header files go here, for instance:
#include "readword.h"
*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
#ifdef SYS_MAC
argc = ccommand(&argv); /* Mac users have my sympathy */
#endif
/* Your program goes here... */
exit(EXIT_OK);
}
SHAR_EOF
cat - << \SHAR_EOF > tell.c
/*
File: tell.c
Author: Graham Toal
Purpose: offer correct spelling
Creation date: 22/06/90
Lastedit: 22:24:32
Description:
Like the unix 'spelltell' command.
*/
/* Manadatory header files */
#include <stdio.h>
#include "dawg.h"
#include "grope.h"
#include "utils.c"
/* Headers here as needed on per-program basis */
#include <ctype.h> /* eg, for isalpha() */
/* Spelling library utilities */
#include "init.c" /* Loading dicts */
#include "dyntrie.c" /* Creating dicts at run-time */
#include "print.c" /* Printing dicts */
#include "check.c" /* Checking words */
#include "locate.c" /* Finding words by their stems */
#include "soundex.c" /* Soundex algorithm for word-likeness comparison */
#include "similcmp.c" /* Closeness-metric for correction */
#include "correct.c" /* Hack code to attempt error correction (uses above) */
/* Let's be friendly to these guys... */
#ifdef SYS_MAC
/* To compile with THINK C 4.0, place all the relevant .h and .c
files in a folder. Then create a project which contains this main.c
and the libraries unix and ANSI.
*/
#include <unix.h>
#include <stdlib.h>
#include <console.h>
#endif
/* Your own header files go here, for instance:
#include "readword.h"
*/
int
#ifdef PROTOTYPES
main(int argc, char **argv)
#else
main(argc, argv)
int argc;
char **argv;
#endif
{
NODE PCCRAP *dawg;
INDEX edges;
int each;
#ifdef SYS_MAC
argc = ccommand(&argv); /* Mac users have my sympathy */
#endif
/* Your program goes here... */
if (argc == 1) {
fprintf(stderr, "usage: %s mispeled wurdz\n", argv[0]);
exit(EXIT_ERROR);
}
if (!dawg_init("", &dawg, &edges)) exit(EXIT_ERROR);
for (each = 1; each < argc; each++) {
fprintf(stderr, "* %s:\n", argv[each]);
if (!dawg_correct(dawg, argv[each])) {
fprintf(stderr, "(none found)\n");
}
if (each+1 != argc) fprintf(stderr, "\n");
}
exit(EXIT_OK);
}
SHAR_EOF