brachman@ubc-cs.UUCP (01/31/87)
The following is version 1.3 of sp, a soundex-based dictionary search program that uses the dbm routines. Included are Makefiles, man pages, and an Mlisp interface. I submitted this stuff to mod.sources in early December but I have not heard from the moderator after repeated attempts to find out what's going on. Please read the README file carefully. ----- Barry Brachman | {ihnp4!alberta,uw-beaver}!ubc-vision!ubc-cs!brachman Dept. of Computer Science| brachman@cs.ubc.cdn Univ. of British Columbia| brachman%ubc.csnet@csnet-relay.arpa Vancouver, B.C. V6T 1W5 | brachman@ubc.csnet (604) 228-4327 ----- CUT HERE ----- CUT HERE ----- CUT HERE ----- CUT HERE ----- CUT HERE : This is a shar archive. Extract with sh, not csh. : This archive ends with exit, so do not worry about trailing junk. echo 'Extracting README' sed 's/^X//' > README << '+ END-OF-FILE README' X XHere are a pair of programs that might be of some use to those who have Xtrouble with spelling. X XThe first program, sp, accepts your tentative or approximate Xspelling of a word as input and produces a list of words. XIf the correct spelling of the word appears in one of the dictionaries used, Xit is likely that it appears in the output list. XNote that this is different from the UNIX 'spell' command that Xtells you which words in a document do not appear in the dictionary. X XThe second program, mksp, lets you maintain your own dictionary of troublesome Xwords. X X===== XTo run sp you'll need: X - the Unix dbm routines, old or new (4.3BSD) X XNot required, but very useful: X - the source to the old dbm routines if you don't have the new ones X or your dbm routines don't have dbmclose() (check your man page for X dbm(3X) to see if you've got dbmclose()) X - /usr/dict/words plus any other large list of words you might have X X===== XThe sp distribution consists of: X X README This file X Makefile Makefile for use with old dbm routines and source X Makefile.newdbm Makefile for use with new (4.3BSD) dbm routines, dbm X libraries containing dbmclose(), and old dbm routines X if you don't have dbm source X calcsoundex.c Program to calculate soundex codes X dbm.bug A bug report for the old dbm routines X dbm.diffs Diffs to be applied to old dbm routines X dbmstuff.c Interface to old and new dbm routines X misc.c Miscellaneous support routines X mksp.c Program to maintain dictionaries X sp.1 Man page for sp/mksp/calcsoundex X sp.9 Man page for EMACS interface to sp X sp.c Program to search dictionaries X sp.h Header file X sp.ml Mlisp code for EMACS interface to sp X X===== XI apologize for the complexity of the following guide. It is due to the Xpossibility of 4 different dbm configurations: 4.3 style dbm, Sun style dbm Xwith the dbmclose() routine, "old" (4.2BSD/V7) dbm with source and without Xsource. X X1. The program assumes that a char is 8 bits and an int is at least 16 bits. X I've avoided using shorts. X X2. Note the following if you are using the old dbm routines that *don't* have X dbmclose(): X The "old" dbm routines that don't have dbmclose() don't work properly if you X do more than one dbminit(). If you have source code, you can apply the X diffs so that multiple dbminit() calls will work, allowing X multiple dictionaries to be used by sp, although you can still only access X one dbm at a time. If you do not have source then you can still use X sp/mksp except you must change MAXDICT (in sp.h) to 1 and edit X Makefile.newdbm as indicated there. You will only be able to use one X dictionary. I'm including a bug report that came off the net for the old X dbm routines. This bug patch has been included in dbm.diffs but is X surrounded by #ifdef BUGFIX. X X If you're applying the patches to the old dbm code, make a copy of dbm.c X and dbm.h. Apply the patches by: X patch < dbm.diffs X or by hand (Larry Wall's patch program is in the mod.sources archive). X X3. Note the following if you are using the old dbm routines that *do* have X dbmclose() (e.g., Sun 2 and Sun 3): X Edit Makefile.newdbm and uncomment the two lines indicated. Make using X Makefile.newdbm (see below). X X4. Check sp.h and adjust for local conditions. You might also edit sp.1 X to reflect your local configuration. X X5. I've tried to make it easy to change the key used for retrieving from X the dbm. The routines to make and disassemble a key are in misc.c. X I want to keep the key as small as possible since dictionaries tend to X be rather large. I've used a vector of unsigned chars for the key because X I didn't want to have to deal with various lengths of shorts and ints on X different hardware. X X6. If you are using the "new" dbm routines (e.g., those in 4.3BSD that allow X multiple simultaneously open dbm's), if you have dbmclose(), or if you have X the old dbm routines without the dbm source then do: X make -f Makefile.newdbm X otherwise do: X make X X Then move sp, mksp, and calcsoundex to a public directory. Copy sp.1 to X where you keep man pages for such programs (you might also link mksp.1 and X calcsoundex.1 to sp.1). X X7. If you are using Gosling EMACS, copy sp.ml (the MLISP interface to sp) into X a public EMACS library. I haven't tried to convert sp.ml to work with X gnuemacs. Put the documentation (sp.9) where appropriate on your system X (you may need to edit the FILES section). X X8. You should create a public library using /usr/dict/words, e.g.: X mksp -a -v /usr/public/lib/sp.dict < /usr/dict/words X The path of this dictionary should appear in DEFAULT_SPPATH (sp.h). Users X should be made aware of the public version so they don't make their own copy. X X9. Set tabs to 4 when you edit and print the source files. X X10. dbm doesn't seem to work between a Sun and VAX across NFS. Too bad. X (It does work between Sun's.) X Use rsh with the dictionary list on the command line. X X11. The programs have been tested on Sun 3/160 (4.2BSD 3.0), VAX 750 (4.3BSD), X using both the new and old dbm routines. X X12. I have a dictionary of 35K words (350Kb) that do not appear in X /usr/dict/words. The only way I have of circulating it is on a X double-sided Atari ST or Mac disk (single-sided if ARC'ed). If you are X interested send me a message. Perhaps it could be archived somewhere X (any volunteers?). X X13. Reference: Knuth, D.E. The Art of Computer Programming, Volume 3/Sorting X and Searching, 1973, pp.391-392. X X14. If you find any bugs please notify me rather than posting to the net. X XEnjoy! X X----- XBarry Brachman | {ihnp4!alberta,uw-beaver}!ubc-vision!ubc-cs!brachman XDept. of Computer Science| brachman@cs.ubc.cdn XUniv. of British Columbia| brachman%ubc.csnet@csnet-relay.arpa XVancouver, B.C. V6T 1W5 | brachman@ubc.csnet X(604) 228-4327 X + END-OF-FILE README chmod 'u=rw,g=r,o=r' 'README' echo ' -rw-r--r-- 1 brachman 5841 Jan 19 16:24 README (as sent)' echo -n ' ' /bin/ls -l README echo 'Extracting Makefile' sed 's/^X//' > Makefile << '+ END-OF-FILE Makefile' X X# Makefile for systems using modified dbm routines X XCFLAGS=-O X Xall: sp mksp X Xsp: sp.o dbmstuff.o dbm.o misc.o X cc ${CFLAGS} -o sp sp.o dbmstuff.o misc.o dbm.o X Xmksp: mksp.o dbmstuff.o dbm.o misc.o X cc ${CFLAGS} -o mksp mksp.o dbmstuff.o misc.o dbm.o X Xcalcsoundex: calcsoundex.c X cc ${CFLAGS} -o calcsoundex calcsoundex.c X X# define BUGFIX if you want the fix included Xdbm.o: dbm.h dbm.c X cc -c ${CFLAGS} dbm.c X X.c.o: X cc -c ${CFLAGS} $? X Xclean: X rm -f sp.o mksp.o dbmstuff.o dbm.o X + END-OF-FILE Makefile chmod 'u=rw,g=r,o=r' 'Makefile' echo ' -rw-r--r-- 1 brachman 482 Dec 11 22:34 Makefile (as sent)' echo -n ' ' /bin/ls -l Makefile echo 'Extracting dbm.diffs' sed 's/^X//' > dbm.diffs << '+ END-OF-FILE dbm.diffs' XIndex: dbm.c X*** dbm.orig.c Thu Nov 27 17:45:38 1986 X--- dbm.c Thu Nov 27 18:13:11 1986 X*************** X*** 6,16 **** X--- 6,21 ---- X #include <sys/types.h> X #include <sys/stat.h> X X+ static long dbm_access_oldb; X+ static getbit_oldb; X+ X dbminit(file) X char *file; X { X struct stat statb; X X+ dbm_access_oldb = -1; X+ getbit_oldb = -1; X dbrdonly = 0; X strcpy(pagbuf, file); X strcat(pagbuf, ".pag"); X*************** X*** 27,36 **** X dirf = open(pagbuf, 0); X dbrdonly = 1; X } X! if(pagf < 0 || dirf < 0) { X! printf("cannot open database %s\n", file); X return(-1); X- } X fstat(dirf, &statb); X maxbno = statb.st_size*BYTESIZ-1; X return(0); X--- 32,39 ---- X dirf = open(pagbuf, 0); X dbrdonly = 1; X } X! if(pagf < 0 || dirf < 0) X return(-1); X fstat(dirf, &statb); X maxbno = statb.st_size*BYTESIZ-1; X return(0); X*************** X*** 130,136 **** X--- 133,143 ---- X return (0); X X split: X+ #ifdef BUGFIX X+ if(key.dsize+dat.dsize+3*sizeof(short) >= PBLKSIZ) { X+ #else X if(key.dsize+dat.dsize+2*sizeof(short) >= PBLKSIZ) { X+ #endif X printf("entry too big\n"); X return (-1); X } X*************** X*** 226,232 **** X dbm_access(hash) X long hash; X { X! static long oldb = -1; X X for(hmask=0;; hmask=(hmask<<1)+1) { X blkno = hash & hmask; X--- 233,239 ---- X dbm_access(hash) X long hash; X { X! /*** static long oldb = -1; ***/ X X for(hmask=0;; hmask=(hmask<<1)+1) { X blkno = hash & hmask; X*************** X*** 234,245 **** X if(getbit() == 0) X break; X } X! if(blkno != oldb) { X clrbuf(pagbuf, PBLKSIZ); X lseek(pagf, blkno*PBLKSIZ, 0); X read(pagf, pagbuf, PBLKSIZ); X chkblk(pagbuf); X! oldb = blkno; X } X } X X--- 241,252 ---- X if(getbit() == 0) X break; X } X! if(blkno != dbm_access_oldb) { X clrbuf(pagbuf, PBLKSIZ); X lseek(pagf, blkno*PBLKSIZ, 0); X read(pagf, pagbuf, PBLKSIZ); X chkblk(pagbuf); X! dbm_access_oldb = blkno; X } X } X X*************** X*** 247,253 **** X { X long bn; X register b, i, n; X! static oldb = -1; X X if(bitno > maxbno) X return(0); X--- 254,260 ---- X { X long bn; X register b, i, n; X! /*** static oldb = -1; ***/ X X if(bitno > maxbno) X return(0); X*************** X*** 255,265 **** X bn = bitno / BYTESIZ; X i = bn % DBLKSIZ; X b = bn / DBLKSIZ; X! if(b != oldb) { X clrbuf(dirbuf, DBLKSIZ); X lseek(dirf, (long)b*DBLKSIZ, 0); X read(dirf, dirbuf, DBLKSIZ); X! oldb = b; X } X if(dirbuf[i] & (1<<n)) X return(1); X--- 262,272 ---- X bn = bitno / BYTESIZ; X i = bn % DBLKSIZ; X b = bn / DBLKSIZ; X! if(b != getbit_oldb) { X clrbuf(dirbuf, DBLKSIZ); X lseek(dirf, (long)b*DBLKSIZ, 0); X read(dirf, dirbuf, DBLKSIZ); X! getbit_oldb = b; X } X if(dirbuf[i] & (1<<n)) X return(1); + END-OF-FILE dbm.diffs chmod 'u=rw,g=r,o=r' 'dbm.diffs' echo ' -rw-r--r-- 1 brachman 2748 Dec 9 17:20 dbm.diffs (as sent)' echo -n ' ' /bin/ls -l dbm.diffs echo 'Extracting dbmstuff.c' sed 's/^X//' > dbmstuff.c << '+ END-OF-FILE dbmstuff.c' X/* dbmstuff.c */ X X/* X * Interface to old and new dbm routines X */ X X#include <stdio.h> X X#ifndef NEWDBM X X#include <dbm.h> X X/*ARGSUSED*/ XDBMINIT(path, flags) Xchar *path; Xint flags; X{ X X return(dbminit(path)); X} X XDBMCLOSE() X{ X X#ifdef HAS_CLOSE X dbmclose(); X#else X close(3); /* free up the file descriptors */ X close(4); X#endif X} X Xdatum XFETCH(key) Xdatum key; X{ X datum fetch(); X X return(fetch(key)); X} X Xdatum XFIRSTKEY() X{ X X return(firstkey()); X} X Xdatum XNEXTKEY(key) Xdatum key; X{ X return(nextkey(key)); X} X XSTORE(key, content) Xdatum key, content; X{ X X return(store(key, content)); X} X XREPLACE(key, content) Xdatum key, content; X{ X X if (delete(key) == -1) X return(-1); X return(store(key, content)); X} X XDELETE(key) Xdatum key; X{ X X return(delete(key)); X} X X#endif !NEWDBM X X#ifdef NEWDBM X X#include <ndbm.h> X Xstatic DBM *current_db = (DBM *) NULL; X XDBMINIT(path, flags) Xchar *path; Xint flags; X{ X X current_db = dbm_open(path, flags, 0); X if (current_db == (DBM *) NULL) X return(-1); X return(0); X} X XDBMCLOSE() X{ X X if (current_db != (DBM *) NULL) { X dbm_close(current_db); X current_db = (DBM *) NULL; X } X} X Xdatum XFETCH(key) Xdatum key; X{ X X return(dbm_fetch(current_db, key)); X} X Xdatum XFIRSTKEY() X{ X X return(dbm_firstkey(current_db)); X} X X/*ARGSUSED*/ Xdatum XNEXTKEY(key) Xdatum key; X{ X X return(dbm_nextkey(current_db)); X} X XREPLACE(key, content) Xdatum key, content; X{ X X return(dbm_store(current_db, key, content, DBM_REPLACE)); X} X XSTORE(key, content) Xdatum key, content; X{ X X return(dbm_store(current_db, key, content, DBM_INSERT)); X} X XDELETE(key) Xdatum key; X{ X X return(dbm_delete(current_db, key)); X} X X#endif NEWDBM + END-OF-FILE dbmstuff.c chmod 'u=rw,g=r,o=r' 'dbmstuff.c' echo ' -rw-r--r-- 1 brachman 1595 Dec 9 20:53 dbmstuff.c (as sent)' echo -n ' ' /bin/ls -l dbmstuff.c echo 'Extracting calcsoundex.c' sed 's/^X//' > calcsoundex.c << '+ END-OF-FILE calcsoundex.c' X/* vi: set tabstop=4 : */ X X/* X * calcsoundex - calculate soundex codes X * X * Permission is given to copy or distribute this program provided you X * do not remove this header or make money off of the program. X * X * Please send comments and suggestions to: X * Barry Brachman X * Dept. of Computer Science X * Univ. of British Columbia X * Vancouver, B.C. V6T 1W5 X * X * .. {ihnp4!alberta, uw-beaver}!ubc-vision!ubc-cs!brachman X * brachman@cs.ubc.cdn X * brachman%ubc.csnet@csnet-relay.arpa X * brachman@ubc.csnet X */ X X#include <stdio.h> X#include <ctype.h> X X#include "sp.h" X Xchar word[MAXWORDLEN + 2]; X Xchar soundex_code_map[26] = { X/*** A B C D E F G H I J K L M N O P ***/ X 0, 1, 2, 3, 0, 1, 2, 0, 0, 2, 2, 4, 5, 5, 0, 1, X X/*** Q R S T U V W X Y Z ***/ X 2, 6, 2, 3, 0, 1, 0, 2, 0, 2 X}; X Xmain(argc, argv) Xint argc; Xchar **argv; X{ X register int c, i, soundex_length, digit_part, previous_code; X int ch, len, vflag; X short soundex; X char *gets(); X X vflag = 0; X if (argc > 2 || (argc == 2 && strcmp(argv[1], "-v"))) { X fprintf(stderr, "Usage: calcsoundex [-v]\n"); X exit(1); X } X if (argc > 1) X vflag = 1; X X while (fgets(word, sizeof(word), stdin) != (char *) NULL) { X len = strlen(word); X if (word[len - 1] != '\n') { X fprintf(stderr, "calcsoundex: Word too long: %s", word); X while ((ch = getchar()) != '\n') /* flush rest of line */ X putc(ch, stderr); X putc('\n', stderr); X continue; X } X word[--len] = '\0'; X if (len > MAXWORDLEN) { X fprintf(stderr, "calcsoundex: Word too long: %s\n", word); X continue; X } X X for (i = 0; word[i] != '\0'; i++) { X if (isupper(word[i])) X word[i] = tolower(word[i]); X } X if (!isalpha(word[0])) X continue; X X digit_part = 0; X soundex_length = 0; X previous_code = soundex_code_map[word[0] - 'a']; X for (i = 1; word[i] != '\0' && soundex_length < 3; i++) { X if (!isalpha(word[i])) X continue; X c = soundex_code_map[word[i] - 'a']; X if (c == 0 || previous_code == c) { X previous_code = c; X continue; X } X digit_part = digit_part * 10 + c; X previous_code = c; X soundex_length++; X } X while (soundex_length++ < 3) X digit_part *= 10; X soundex = digit_part << 5 + word[0] - 'a'; X printf("%c", word[0]); X if (digit_part < 100) X putchar('0'); X if (digit_part < 10) X putchar('0'); X if (digit_part == 0) X putchar('0'); X else X printf("%d", digit_part); X if (vflag) X printf(" %s", word); X putchar('\n'); X } X putchar('\n'); X exit(0); X} X + END-OF-FILE calcsoundex.c chmod 'u=rw,g=r,o=r' 'calcsoundex.c' echo ' -rw-r--r-- 1 brachman 2455 Dec 11 17:19 calcsoundex.c (as sent)' echo -n ' ' /bin/ls -l calcsoundex.c echo 'Extracting misc.c' sed 's/^X//' > misc.c << '+ END-OF-FILE misc.c' X/* misc.c */ X X/* vi: set tabstop=4 : */ X X#include <ctype.h> X#include <stdio.h> X X#include "sp.h" X X/* X * Special character map that determines what the second character of a word X * can be; see sp.h X * May be expanded to contain up to 12 entries plus the terminating entry X * Must end with an entry of two null bytes X */ Xstruct spchar_map spchar_map[] = { X '\'', QUOTE_CHAR, X '&', AMPER_CHAR, X '.', PERIOD_CHAR, X ' ', SPACE_CHAR, X '\0', '\0' X}; X Xmk_key(key, soundex, count) Xkey_t *key; Xint soundex; Xint count; X{ X X key[0] = soundex & 0377; X key[1] = ((soundex & 037400) >> 8) | ((count & 03) << 6); X key[2] = (count & 01774) >> 2; X#ifdef DEBUG X if (ex_soundex(key) != soundex) X fprintf(stderr, "mk_key: soundex failed\n"); X if (ex_count(key) != count) X fprintf(stderr, "mk_key: count failed\n"); X#endif DEBUG X} X Xex_soundex(key) Xkey_t *key; X{ X register int soundex; X X soundex = key[0] & 0377; X soundex |= (key[1] & 077) << 8; X return(soundex); X} X Xex_count(key) Xkey_t *key; X{ X register int count; X X count = (key[1] & 0300) >> 6; X count |= ((key[2] & 0377) << 2); X return(count); X} X X/* Xex_char(key) Xkey_t *key; X{ X int ch; X X ch = (key[1] & 076) >> 1; X return(ch + 'a'); X} X*/ X X/* X * Unpack a word given the retrieved word of length len and its soundex X * Extract the first letter from the soundex code X * If the length is 1 and if it is marked as a single character word X * then the marked character will be overlaid with a null X * otherwise a null will be appended to the string X * Adjust for upper case leading character if necessary X * Return address of the copy X */ Xchar * Xmk_word(p, len, s) Xchar *p; Xint len, s; X{ X register char *q, ch; X static char word[MAXWORDLEN + 2]; X X q = word; X if (len == 1 && (*p & SINGLE_CHAR)) { X *(q + 1) = '\0'; X len = 0; X } X else X *(q + len + 1) = '\0'; X X /* X * Extract the first character from the soundex and X * adjust case X */ X if (*p & UPPER_CHAR) X ch = (s & 037) + 'A'; X else X ch = (s & 037) + 'a'; X *q++ = ch; X X if (len != 0) { /* if more than one char adjust second char */ X ch = *p & MASK_CHAR; X if (ch < 26) X ch += 'a'; X else if (ch < 52) X ch = ch - 26 + 'A'; X else if ((ch = fromspchar(ch)) == '\0') { X fprintf(stderr, "Bogus second char in mk_word\n"); X exit(1); X } X *q++ = ch; X p++; X len--; X } X X while (len-- > 0) X *q++ = *p++; X return(word); X} X X/* X * Convert the second character of a word to a special character code X * Return null if there is no mapping X */ Xtospchar(ch) Xchar ch; X{ X register struct spchar_map *m; X X for (m = spchar_map; m->spchar != '\0'; m++) X if (ch == m->spchar) X break; X return(m->code); X} X X/* X * Convert from the special character code to the ASCII code X * Return null if there is no mapping X */ Xfromspchar(ch) Xchar ch; X{ X register struct spchar_map *m; X X for (m = spchar_map; m->spchar != '\0'; m++) X if (ch == m->code) X break; X return(m->spchar); X} X X/* X * Compare two strings, independent of case, given their lengths X */ X/* Xstrnmatch(str1, len1, str2, len2) Xchar *str1, *str2; Xint len1, len2; X{ X register char ch1, ch2; X X if (len1 != len2) X return(0); X while (len1-- > 0) { X ch1 = *str1++; X ch2 = *str2++; X if (ch1 != ch2) { X if (isupper(ch1)) X ch1 = tolower(ch1); X if (isupper(ch2)) X ch2 = tolower(ch2); X if (ch1 != ch2) X return(0); X } X } X return(1); X} X*/ X X/* X * Compare two strings, independent of case X */ Xstrmatch(p, q) Xchar *p, *q; X{ X register char ch1, ch2; X X while (1) { X ch1 = *p++; X ch2 = *q++; X if (ch1 == '\0' || ch2 == '\0') X break; X if (ch1 != ch2) { X if (isupper(ch1)) X ch1 = tolower(ch1); X if (isupper(ch2)) X ch2 = tolower(ch2); X if (ch1 != ch2) X break; X } X } X return(ch1 - ch2); X} X + END-OF-FILE misc.c chmod 'u=rw,g=r,o=r' 'misc.c' echo ' -rw-r--r-- 1 brachman 3643 Dec 11 22:33 misc.c (as sent)' echo -n ' ' /bin/ls -l misc.c echo 'Extracting sp.h' sed 's/^X//' > sp.h << '+ END-OF-FILE sp.h' X/* sp.h */ X/* vi: set tabstop=4 : */ X X/* X * A deleted dbm entry is denoted by a dsize of zero X */ X#define IS_DELETED(C) (C.dsize == 0) X X/* X * Because the soundex code (part of the key) includes the first character of X * the word, we don't need to store the first character again with the content. X * To do this we treat the first byte of the content stored in the dbm X * specially: we rip off the two high order bits of the first byte of X * the content and therefore have to restrict the value of the second X * character of the word. We use 'a' == 0, 'z' == 25, 'A' == 26, 'Z' == 51. X * See spchar_map[] (misc.c) for the mapping of codes 52 through 63. X * This behaviour is isolated in tospchar() and fromspchar(). X * If spchar_map is changed you should change the man page too. X * X * The word can be reconstructed by extracting the first character of the word X * from the soundex code and then looking at the first byte of the content. X * If the UPPER_CHAR bit is on in the first byte of the content then the first X * character of the word should be upper case. X * The length of the content reflects the actual number of bytes stored in the X * dbm. Words that have been deleted from the dbm are stored with a length of X * zero. Because of this, words of length 1 are treated differently: they are X * stored with a length of 1 and with the SINGLE_CHAR bit set. Words with X * original length > 1 will have (length - 1) bytes stored in the content. X * Clear? X */ X#define IS_VALID(w) (isalpha(*w) && (*(w+1) == '\0' || isalpha(*(w+1)) \ X || tospchar(*(w+1)) != '\0')) X#define UPPER_CHAR 0200 /* 1st char of word is upper case */ X#define SINGLE_CHAR 0100 /* single char word */ X#define MASK_CHAR 0077 /* mask out the indicator bits */ X#define QUOTE_CHAR 0064 /* (52) code for single quote */ X#define AMPER_CHAR 0065 /* (53) for ampersand */ X#define PERIOD_CHAR 0066 /* (54) for period */ X#define SPACE_CHAR 0067 /* (55) for blank */ X X/* X * Map for first byte of dbm content (special characters) X * Terminated by a null entry X */ Xstruct spchar_map { X char spchar; X char code; X}; X X#define MAXDICT 10 /* Max number of dictionaries to use */ X#define MAXWORDLEN 50 /* Max word length */ X#define MAXWORDS 400 /* Max number of words in one sp query */ X#define WORDSPACE 20480 /* Max space used words for one sp query */ X X/* X * This is the default path used by sp to find dictionaries X * Adjust for local conditions X */ X#define DEFAULT_SPPATH "/usr/public/lib/sp/soundex:/usr/public/lib/sp/aux.soundex" X X/* X * The following must be the maximum value containable in the count part of X * a key. X * It must be always be less than: (the maximum positive value that can be X * contained in an int) - 1 X * This value imposes a limit on the number of words in a dictionary having the X * same soundex code. For /usr/dict/words (~25K words), a count of 255 is X * sufficient. Larger dictionaries will need more. In any case you can X * always just make another dictionary and split up your words. X * You might want to adjust MAXWORDS and WORDSPACE (above) to reflect MAXCOUNT X * if you've got plenty of memory. X */ X#define MAXCOUNT 1023 /* 2^10 - 1 */ X X/* X * The key used by dbm looks like this: X * X * <10 bits> <5 bits> <9 bits> X * counter first char soundex X * X * A soundex value is treated as a base 7 number (maximum is 666, base 7). X */ X#define KEYSIZE 3 /* in bytes */ Xtypedef unsigned char key_t; X X#define BAD_WORD -1 /* This must be an illegal Soundex */ X#define NO_MATCH 0 X#define MATCHED 1 X Xextern char soundex_code_map[]; X + END-OF-FILE sp.h chmod 'u=rw,g=r,o=r' 'sp.h' echo ' -rw-r--r-- 1 brachman 3561 Dec 12 14:52 sp.h (as sent)' echo -n ' ' /bin/ls -l sp.h echo 'Extracting sp.c' sed 's/^X//' > sp.c << '+ END-OF-FILE sp.c' X/* vi: set tabstop=4 : */ X X/* X * Version 1.3 December 1986 X * X * sp - spell word X * X * Usage: sp [-f dictionary-list] [-eavc] [word ...] X * X * Compute the Soundex code for each word on the command line X * (or each word on the standard input) and compare against a X * dictionary X * X * The soundex dictionary list may be specified on the command line X * The environment variable SPPATH may be set to a list of colon X * separated pathnames of soundex dictionaries. X * If a command line dictionary-list (a colon separated list of pathnames) is X * given in addition to the SPPATH variable, all dictionaries are used. X * X * To reduce the size of the word list, certain heuristics are used: X * the -a option causes all words matched to be printed X * The output is alphabetically sorted and indicators are printed X * beside each word: X * X == exact match X * ! == close match X * * == near match X * ' ' == matched X * X * Note that the maximum number of colliding words is MAXCOUNT due to the X * data structure used. X * X * Permission is given to copy or distribute this program provided you X * do not remove this header or make money off of the program. X * X * Please send comments and suggestions to: X * X * Barry Brachman X * Dept. of Computer Science X * Univ. of British Columbia X * Vancouver, B.C. V6T 1W5 X * X * .. {ihnp4!alberta, uw-beaver}!ubc-vision!ubc-cs!brachman X * brachman@cs.ubc.cdn X * brachman%ubc.csnet@csnet-relay.arpa X * brachman@ubc.csnet X */ X X#include <sys/types.h> X#include <sys/file.h> X#include <ctype.h> X#include <stdio.h> X X#ifdef NEWDBM X#include <ndbm.h> X#else !NEWDBM X#include <dbm.h> X#endif NEWDBM X X#include "sp.h" X X#define streq(X, Y) (!strcmp(X, Y)) X#define range(S) ((strlen(S) + 4) / 5) X X#define USAGE "Usage: sp [-f dictionary-list] [-eavc] [word ...]" X Xchar word[MAXWORDLEN + 2]; X Xdatum FETCH(); X Xchar *fileptr[MAXDICT + 1]; /* Up to MAXDICT dictionaries + sentinel */ Xint dict_ptr = 0; X Xchar *wordptr[MAXWORDS], *wordlistptr; Xchar wordlist[WORDSPACE]; Xint nmatched; X X/* X * Soundex codes X * The program depends upon the numbers zero through six being used X * but this can easily be changed X */ Xchar soundex_code_map[26] = { X/*** A B C D E F G H I J K L M N O P ***/ X 0, 1, 2, 3, 0, 1, 2, 0, 0, 2, 2, 4, 5, 5, 0, 1, X X/*** Q R S T U V W X Y Z ***/ X 2, 6, 2, 3, 0, 1, 0, 2, 0, 2 X}; X Xint aflag, cflag, eflag, vflag; X Xmain(argc, argv) Xint argc; Xchar **argv; X{ X register int fflag, i; X register char *p; X char *getenv(); X X argc--; argv++; X fileptr[0] = (char *) NULL; X while (argc > 0 && argv[0][0] == '-') { X fflag = 0; /* to break out of following loop... */ X for (i = 1; argv[0][i] != '\0' && fflag == 0; i++) { X switch (argv[0][i]) { X case 'a': X aflag = 1; X break; X case 'c': X cflag = 1; X break; X case 'e': X eflag = 1; X break; X case 'f': X if (argc == 1) { X fprintf(stderr, "%s\n", USAGE); X exit(1); X } X mkfilelist(argv[1]); X argc--; X argv++; X fflag = 1; /* break out of loop */ X break; X case 'v': X vflag = 1; X break; X default: X fprintf(stderr, "%s\n", USAGE); X exit(1); X } X } X argc--, argv++; X } X X if ((p = getenv("SPPATH")) != (char *) NULL) X mkfilelist(p); X if (fileptr[0] == (char *) NULL) X mkfilelist(DEFAULT_SPPATH); X if (vflag) { X printf("Using dictionaries:\n"); X for (i = 0; fileptr[i] != (char *) NULL; i++) X if (strlen(fileptr[i]) > 0) X printf("\t%s\n", fileptr[i]); X } X if (argc) { X for (i = 0; i < argc; i++) { X if (!eflag) X printf("%s:\n", argv[i]); X apply(argv[i]); X if (!eflag) X printf("\n"); X } X } X else { X int ch, len; X X while (1) { X printf("Word? "); X if (fgets(word, sizeof(word), stdin) == (char *) NULL) { X printf("\n"); X break; X } X len = strlen(word); X if (word[len - 1] != '\n') { X fprintf(stderr, "sp: Word too long: %s", word); X while ((ch = getchar()) != '\n') /* flush rest of line */ X putc(ch, stderr); X putc('\n', stderr); X continue; X } X word[--len] = '\0'; X if (len > MAXWORDLEN) { X fprintf(stderr, "sp: Word too long: %s\n", word); X continue; X } X X apply(word); X if (!eflag) X printf("\n"); X } X } X} X X/* X * Apply the Soundex search for a word to each dictionary in turn X * Note that 'DBMINIT' opens both the '.dir' and the '.pag' files X * and we must close them to avoid running out of file descriptors X * X * This routine gets called each time a word is looked up and therefore X * the dbm files may be repeatedly opened and closed. Since the vast majority X * of the time this program is invoked for just a single word it doesn't seem X * worthwhile to do the right thing by saving file descriptors/DBM pointers. X * There probably won't be more than two dictionaries in use anyway. X */ Xapply(word) Xchar *word; X{ X register int code, i, nodicts; X X nmatched = 0; X wordlistptr = wordlist; X if ((code = soundex(word, 3)) == BAD_WORD) X return; X nodicts = 1; X for (i = 0; fileptr[i] != (char *) NULL; i++) { X if (strlen(fileptr[i]) == 0) X continue; X if (DBMINIT(fileptr[i], O_RDONLY) != -1) { X proc(code); X nodicts = 0; X } X DBMCLOSE(); X } X if (nodicts) { X fprintf(stderr, "sp: Can't open any dictionaries\n"); X exit(1); X } X if (vflag && !eflag && nmatched == 0) X printf("%s: no match\n", word); X else X choose(word); X} X X/* X * Look the word up in the current dictionary X * and save all the matches X * Note that only three digits are of the Soundex code are stored X * in a dictionary X */ Xproc(soundex) Xint soundex; X{ X register int c, len; X datum dbm_key, dbm_content; X key_t *key, keyvec[KEYSIZE]; X char *mk_word(), *p; X X key = keyvec; X dbm_key.dptr = (char *) key; X dbm_key.dsize = KEYSIZE; X c = 0; X while (1) { X mk_key(key, soundex, c); X dbm_content = FETCH(dbm_key); X X if (dbm_content.dptr == 0) X break; X X if (IS_DELETED(dbm_content)) { X if (++c > MAXCOUNT) { X fprintf(stderr, "sp: entry count overflow\n"); X exit(1); X } X continue; X } X X if (nmatched == MAXWORDS) { X fprintf(stderr, "sp: Too many matches\n"); X exit(1); X } X X p = mk_word(dbm_content.dptr, dbm_content.dsize, soundex); X len = strlen(p); X if (wordlistptr + len >= &wordlist[WORDSPACE]) { X fprintf(stderr, "sp: Out of space for words\n"); X exit(1); X } X strncpy(wordlistptr, p, len); X wordlistptr[len] = '\0'; X wordptr[nmatched++] = wordlistptr; X wordlistptr += len + 1; X if (++c > MAXCOUNT) { X fprintf(stderr, "sp: entry count overflow\n"); X exit(1); X } X } X} X X/* X * Select and print those words which we consider X * to have matched 'word' X */ Xchoose(word) Xregister char *word; X{ X register int c, code, i, len, mcount, wordlen; X register char *p; X int compar(); X X code = soundex(word, 4); X qsort(wordptr, nmatched, sizeof(char *), compar); X c = range(word); X wordlen = strlen(word); X mcount = 0; X for (i = 0; i < nmatched; i++) { X p = wordptr[i]; X if (strmatch(word, p) == 0) { X printf("X"); X if (eflag) { X printf(" %s\n", word); X return; X } X } X else if (eflag) X continue; X else if (soundex(p, 4) == code) X printf("!"); X else if (aflag && X (wordlen < (len = strlen(p)) - c || len > wordlen + c)) X printf(" "); X else if (!cflag) X printf("*"); X else X continue; X printf("%3d. %s\n", mcount + 1, p); X mcount++; X } X if (vflag) X printf("(%d total matches)\n", nmatched); X} X X/* X * Compute an 'n' digit Soundex code for 'word' X * See mksp.c X */ Xsoundex(word, n) Xregister char *word; Xint n; X{ X register int c, digit_part, previous_code, soundex_length; X register char *p, *w; X char wcopy[MAXWORDLEN + 2]; X X strcpy(wcopy, word); X p = w = wcopy; X while (*p != '\0') { X if (isupper(*p)) X *p = tolower(*p); X p++; X } X if (!isalpha(*w)) { X fprintf(stderr, "sp: Improper word: %s\n", word); X return(BAD_WORD); X } X digit_part = 0; X soundex_length = 0; X previous_code = soundex_code_map[*w - 'a']; X for (p = w + 1; *p != '\0' && soundex_length < n; p++) { X if (!isalpha(*p)) X continue; X c = soundex_code_map[*p - 'a']; X if (c == 0 || previous_code == c) { X previous_code = c; X continue; X } X digit_part = digit_part * 7 + c; X previous_code = c; X soundex_length++; X } X while (soundex_length++ < n) X digit_part *= 7; X return((digit_part << 5) + *w - 'a'); X} X X/* X * Process a path string (environment variable SPPATH, DEFAULT_SPPATH, or an X * arg) by separating the pathnames into strings pointed to by elements X * of 'fileptr' X * End of list indicated by fileptr entry of NULL X * X * No attempt made to ignore duplicate pathnames X */ Xmkfilelist(p) Xregister char *p; X{ X register int len; X register char *path, *start; X char *malloc(); X X while (*p != '\0' && dict_ptr < MAXDICT) { X start = p; X while (*p != ':' && *p != '\0') X p++; X if (start == p && *p == ':') { /* colon with nothing else */ X p++; X continue; X } X len = p - start; X path = (char *) malloc((unsigned) (len + 1)); X if (path == (char *) NULL) { X fprintf(stderr, "sp: Out of dictionary space\n"); X exit(1); X } X strncpy(path, start, len); X path[len] = '\0'; X fileptr[dict_ptr++] = path; X } X fileptr[dict_ptr] = (char *) NULL; X} X Xcompar(p, q) Xchar **p, **q; X{ X X return(strmatch(*p, *q)); X/* return(strcmp(*p, *q)); */ /* use if you prefer case sensitive */ X} X + END-OF-FILE sp.c chmod 'u=rw,g=r,o=r' 'sp.c' echo ' -rw-r--r-- 1 brachman 9128 Dec 12 14:52 sp.c (as sent)' echo -n ' ' /bin/ls -l sp.c echo 'Extracting mksp.c' sed 's/^X//' > mksp.c << '+ END-OF-FILE mksp.c' X/* vi: set tabstop=4 : */ X X/* X * mksp - make soundex dictionary X * Version 1.3, December 1986 X * X * If <soundexfile.{dir, pag}> do not exist, try to create them X * If they do exist and are not empty, then words will be added X * from the standard input X * Only when words are being added to an existing database are duplicate words X * ignored X * Valid words (words beginning with an alphabetic) are stored as given but X * comparisons for duplicates is case independent. X * Non-alphabetic characters are ignored in computing the soundex X * X * Permission is given to copy or distribute this program provided you X * do not remove this header or make money off of the program. X * X * Please send comments and suggestions to: X * Barry Brachman X * Dept. of Computer Science X * Univ. of British Columbia X * Vancouver, B.C. V6T 1W5 X * X * .. {ihnp4!alberta, uw-beaver}!ubc-vision!ubc-cs!brachman X * brachman@cs.ubc.cdn X * brachman%ubc.csnet@csnet-relay.arpa X * brachman@ubc.csnet X */ X X#include <ctype.h> X#include <errno.h> X#include <sys/types.h> X#include <sys/file.h> X#include <sys/stat.h> X#include <stdio.h> X X#ifdef NEWDBM X#include <ndbm.h> X#include <sys/file.h> X#else !NEWDBM X#include <dbm.h> X#endif NEWDBM X X#include "sp.h" X X#define NEW_DICT 0 X#define OLD_DICT 1 X X#define VFREQ 1000 /* frequency for verbose messages */ X X#define streq(X, Y) (!strcmp(X, Y)) X#define strneq(X, Y, N) (!strncmp(X, Y, N)) X X#define USAGE "Usage: mksp -tad [-v[#]] <soundexfile>" X Xint map[26][667]; X Xdatum FETCH(), FIRSTKEY(), NEXTKEY(); X Xkey_t keyvec[KEYSIZE]; Xkey_t *key = keyvec; X X/* X * Soundex codes X * The program depends upon the numbers zero through six being used X * but this can easily be changed X */ Xchar soundex_code_map[26] = { X/*** A B C D E F G H I J K L M N O P ***/ X 0, 1, 2, 3, 0, 1, 2, 0, 0, 2, 2, 4, 5, 5, 0, 1, X X/*** Q R S T U V W X Y Z ***/ X 2, 6, 2, 3, 0, 1, 0, 2, 0, 2 X}; X Xint digit_part; X Xint aflag, dflag, tflag, vflag; X Xchar *mk_word(); X Xmain(argc, argv) Xint argc; Xchar **argv; X{ X register int i; X char *file; X X if (argc != 3 && argc != 4) { X fprintf(stderr, "%s\n", USAGE); X exit(1); X } X aflag = dflag = tflag = vflag = 0; X file = (char *) NULL; X X for (i = 1; i < argc; i++) { X if (streq(argv[i], "-a")) X aflag = 1; X else if (streq(argv[i], "-d")) X dflag = 1; X else if (streq(argv[i], "-t")) X tflag = 1; X else if (strneq(argv[i], "-v", 2)) { X if (isdigit(argv[i][2])) X vflag = atoi(&argv[i][2]); X else X vflag = 1; X } X else if (file == (char *) NULL) X file = argv[i]; X else { X fprintf(stderr, "%s\n", USAGE); X exit(1); X } X } X X if (file == (char *) NULL || (tflag + aflag + dflag) != 1) { X fprintf(stderr, "%s\n", USAGE); X exit(1); X } X X if (aflag) { X addwords(file); X if (vflag > 1) { X register int j, total; X int m, max; X X fprintf(stderr, "Counters:\n"); X for (i = 0; i < 26; i++) { X total = max = map[i][0]; X for (j = 1; j < 667; j++) { X total += (m = map[i][j]); X if (m > max) X max = m; X } X if (max > 0) X fprintf(stderr, "%c: max %d total %d\n", 'a'+i, max, total); X } X } X } X else if (dflag) X deletewords(file); X else if (tflag) X prcontents(file); X exit(0); X} X X/* X * Add words read from stdin to the database X * The key is the 3 digit soundex code for the word plus a disambiguating X * counter. Different counter values are used for words with the same soundex X * code. The maximum counter value is MAXCOUNT. If the counter overflows then X * we lose, but given at least 8 bits this seems unlikely. X */ Xaddwords(name) Xchar *name; X{ X register int c, count, delete, duplicate, i, len; X register int *p; X int ch, s, status; X char wcopy[MAXWORDLEN + 2], word[MAXWORDLEN + 2]; X datum dbm_key, dbm_content; X X status = setup(name); X if (DBMINIT(name, O_RDWR) == -1) { X fprintf(stderr, "mksp: Can't initialize\n"); X exit(1); X } X X for (i = 0; i < 26; i++) X for (c = 0; c < 667; c++) X map[i][c] = 0; X X dbm_key.dptr = (char *) key; X dbm_key.dsize = KEYSIZE; X X count = 0; X X while (fgets(word, sizeof(word), stdin) != (char *) NULL) { X len = strlen(word); X if (word[len - 1] != '\n') { X fprintf(stderr, "mksp: Word too long: %s", word); X while ((ch = getchar()) != '\n') /* flush rest of line */ X putc(ch, stderr); X putc('\n', stderr); X continue; X } X word[--len] = '\0'; X if (len > MAXWORDLEN) { X fprintf(stderr, "mksp: Word too long: %s\n", word); X continue; X } X X if ((s = soundex(word, 3)) == BAD_WORD) { X if (vflag) X fprintf(stderr, "Ignoring bad word: %s\n", word); X continue; X } X ch = (isupper(word[0]) ? tolower(word[0]) : word[0]) - 'a'; X p = &(map[ch][digit_part]); X X /* X * If words are being added to an old dictionary, X * check for duplication and watch for a deleted entry X * The reason for only checking for duplicates in old dictionaries is X * that usually when you're creating a new dictionary the words are X * already sorted and unique and the creation of a large dictionary is X * slow enough already. X */ X duplicate = 0; X delete = -1; /* an 'impossible' counter */ X if (status == OLD_DICT) { X c = 0; X while (1) { X mk_key(key, s, c); X dbm_content = FETCH(dbm_key); X if (dbm_content.dptr == 0) X break; X X if (!IS_DELETED(dbm_content)) { X char *str; X X str = mk_word(dbm_content.dptr, dbm_content.dsize, s); X if (strmatch(word, str) == 0) { X duplicate = 1; X if (vflag) X fprintf(stderr, "duplicate: %s\n", word); X break; X } X } X else if (delete < 0) /* choose delete nearest front */ X delete = c; X X if (++c > MAXCOUNT) { X fprintf(stderr, "mksp: Counter overflow\n"); X fprintf(stderr, "soundex: %c%d\n", ch+'a', b10(digit_part)); X exit(1); X } X } X if (duplicate) X continue; X *p = c; X } X if (*p > MAXCOUNT) { X fprintf(stderr, "mksp: Counter overflow\n"); X fprintf(stderr, "soundex: %c%d\n", ch+'a', b10(digit_part)); X exit(1); X } X mk_key(key, s, *p); X *p = *p + 1; X strcpy(wcopy, word); X if (len == 1) { X if (isupper(wcopy[0])) X wcopy[0] |= UPPER_CHAR; X wcopy[0] |= SINGLE_CHAR; X dbm_content.dptr = wcopy; X } X else { X if (isupper(wcopy[1])) X wcopy[1] = wcopy[1] - 'A' + 26; X else if (islower(wcopy[1])) X wcopy[1] = wcopy[1] - 'a'; X else if ((wcopy[1] = tospchar(wcopy[1])) == '\0') { X fprintf(stderr, "Bogus second char: can't happen!\n"); X exit(1); X } X if (isupper(wcopy[0])) X wcopy[1] = wcopy[1] | UPPER_CHAR; X dbm_content.dptr = wcopy + 1; X len--; X } X dbm_content.dsize = len; /* null not stored */ X if (delete < 0) { X if (STORE(dbm_key, dbm_content) == -1) { X fprintf(stderr, "mksp: Can't store\n"); X exit(1); X } X } X else { X if (vflag) X fprintf(stderr, "reusing: %s\n", word); X mk_key(key, s, delete); X if (REPLACE(dbm_key, dbm_content) == -1) { X fprintf(stderr, "mksp: Can't replace\n"); X exit(1); X } X } X count++; X if (vflag > 1) X fprintf(stderr, "%5d: %s(%d)\n", count, word, ex_count(key)); X if (vflag && (count % VFREQ) == 0) X fprintf(stderr, "%5d: %s\n", count, word); X } X if (vflag) X fprintf(stderr, "%d words\n", count); X DBMCLOSE(); X} X X/* X * Print out everything X */ Xprcontents(name) Xchar *name; X{ X register int s; X datum dbm_key, dbm_content; X X if (DBMINIT(name, O_RDONLY) == -1) X exit(1); X X dbm_key = FIRSTKEY(); X while (dbm_key.dptr != NULL) { X dbm_content = FETCH(dbm_key); X if (dbm_content.dptr == 0) X break; /* ??? */ X X if (vflag) X printf("%3d. ", ex_count((key_t *) dbm_key.dptr)); X if (IS_DELETED(dbm_content)) { X if (vflag) X printf("(deleted)\n"); X } X else { X s = ex_soundex((key_t *) dbm_key.dptr); X printf("%s\n", mk_word(dbm_content.dptr, dbm_content.dsize, s)); X } X dbm_key = NEXTKEY(dbm_key); X } X DBMCLOSE(); X} X X/* X * When words are deleted they must be marked as such rather than deleted X * using DELETE(). This is because the sequence of counters must remain X * continuous. If DELETE() is used then any entries with the same soundex X * but with a larger counter value would not be accessible. This approach X * does cost some extra space but if an addition is made to the chain then X * a deleted counter slot will be reused. Also, the storage used by the word X * should be made available to dbm. This could be improved somewhat X * by actually using DELETE() on the last entry of the chain. X */ Xdeletewords(name) Xchar *name; X{ X register int c, ch, len, s; X register char *p; X char word[MAXWORDLEN + 2]; X datum dbm_key, dbm_content; X X if (DBMINIT(name, O_RDWR) == -1) X exit(1); X X while (fgets(word, sizeof(word), stdin) != (char *) NULL) { X len = strlen(word); X if (word[len - 1] != '\n') { X fprintf(stderr, "mksp: Word too long: %s", word); X while ((ch = getchar()) != '\n') /* flush rest of line */ X putc(ch, stderr); X putc('\n', stderr); X continue; X } X word[--len] = '\0'; X if (len > MAXWORDLEN) { X fprintf(stderr, "mksp: Word too long: %s\n", word); X continue; X } X X if ((s = soundex(word, 3)) == BAD_WORD) { X if (vflag) X fprintf(stderr, "Bad word: %s\n", word); X continue; X } X X c = 0; X while (1) { X dbm_key.dptr = (char *) key; X dbm_key.dsize = KEYSIZE; X mk_key(key, s, c); X dbm_content = FETCH(dbm_key); X if (dbm_content.dptr == NULL) { X if (vflag) X fprintf(stderr, "Not found: %s\n", word); X break; X } X X if (!IS_DELETED(dbm_content)) { X p = mk_word(dbm_content.dptr, dbm_content.dsize, s); X if (strmatch(word, p) == 0) { X /* X * Aside: X * Since dptr points to static storage it must be reset X * if we want to retain the old content (content.dptr=word) X * This took a while to determine... X * Anyhow, since there is no need to store the old word X * we free up the space X */ X dbm_content.dptr = ""; X dbm_content.dsize = 0; X if (REPLACE(dbm_key, dbm_content) == -1) X fprintf(stderr, "mksp: delete of '%s' failed\n", word); X else if (vflag) { X if (vflag > 1) X fprintf(stderr, "%d. %s ", c, p); X fprintf(stderr, "deleted\n"); X } X break; X } X else if (vflag > 1) X fprintf(stderr, "%d. %s\n", c, p); X } X else if (vflag > 1) X fprintf(stderr, "%d. (deleted)\n", c); X X if (++c > MAXCOUNT) { X ch = isupper(word[0]) ? tolower(word[0]) : word[0]; X fprintf(stderr, "mksp: Counter overflow\n"); X fprintf(stderr, "soundex: %c%d\n", ch, b10(digit_part)); X exit(1); X } X } X } X DBMCLOSE(); X} X X/* X * Setup the dictionary files if necessary X */ Xsetup(name) Xchar *name; X{ X register int s1, s2; X X s1 = check_dict(name, ".dir"); X s2 = check_dict(name, ".pag"); X if (s1 == NEW_DICT && s2 == NEW_DICT) X return(NEW_DICT); X return(OLD_DICT); X} X X/* X * Check if a dictionary file exists: X * - if not, try to create it X * - if so, see if it is empty X * Return NEW_DICT if an empty file exists, X * OLD_DICT if a non-empty file exists X * Default mode for new files is 0666 X */ Xcheck_dict(name, ext) Xchar *name, *ext; X{ X register int len, s; X char *filename; X struct stat statbuf; X char *malloc(); X extern int errno; X X len = strlen(name) + strlen(ext) + 1; X filename = (char *) malloc((unsigned) len); X if (filename == (char *) NULL) { X fprintf(stderr, "mksp: Can't malloc '%s.%s'\n", name, ext); X exit(1); X } X sprintf(filename, "%s%s", name, ext); X if (stat(filename, &statbuf) == -1) { X if (errno != ENOENT) { X perror("mksp"); X exit(1); X } X if (creat(filename, 0666) == -1) { X perror("mksp"); X exit(1); X } X s = NEW_DICT; X } X else { X if (statbuf.st_size == 0) X s = NEW_DICT; X else X s = OLD_DICT; X } X return(s); X} X X/* X * Compute an 'n' digit Soundex code for 'word' X * As a side effect, leave the digit part of the soundex in digit_part X * X * Since the soundex can be considered a base 7 number, if 'n' is: X * 3 require 9 (10 if base 10) bits for digits X * 4 require 12 (13) bits X * 5 require 15 (17) bits X * 6 require 17 (20) bits X * X * The three slightly different versions of this routine should be coalesced. X */ Xsoundex(word, n) Xregister char *word; Xint n; X{ X register int c, soundex_length, previous_code; X register char *p, *w; X char wcopy[MAXWORDLEN + 2]; X X if (!IS_VALID(word)) X return(-1); X X strcpy(wcopy, word); X p = w = wcopy; X X while (*p != '\0') { X if (isupper(*p)) X *p = tolower(*p); X p++; X } X X digit_part = 0; X soundex_length = 0; X previous_code = soundex_code_map[*w - 'a']; X for (p = w + 1; *p != '\0' && soundex_length < n; p++) { X if (!isalpha(*p)) X continue; X c = soundex_code_map[*p - 'a']; X if (c == 0 || previous_code == c) { X previous_code = c; X continue; X } X digit_part = digit_part * 7 + c; X previous_code = c; X soundex_length++; X } X while (soundex_length++ < n) X digit_part *= 7; X return((digit_part << 5) + *w - 'a'); X} X Xb10(n) Xint n; X{ X register int b10, s; X X for (b10 = 0, s = 1; n != 0; n /= 7) { X b10 += (n % 7) * s; X s *= 10; X } X return(b10); X} X + END-OF-FILE mksp.c chmod 'u=rw,g=r,o=r' 'mksp.c' echo ' -rw-r--r-- 1 brachman 12797 Dec 11 21:56 mksp.c (as sent)' echo -n ' ' /bin/ls -l mksp.c echo 'Extracting dbm.bug' sed 's/^X//' > dbm.bug << '+ END-OF-FILE dbm.bug' XArticle 770 of net.bugs.4bsd: XPath: ubc-cs!ubc-ean!alberta!ihnp4!mhuxn!mhuxr!ulysses!allegra!mit-eddie!genrad!panda!talcott!harvard!seismo!elsie!ado XFrom: ado@elsie.UUCP (Arthur David Olson) XSubject: 4.?bsd dbm's store(k,c) dies if (i=k.dsize+c.dsize)==1018||i==1019--FIX XDate: Wed, 10-Apr-85 09:02:36 PST XDate-Received: Thu, 11-Apr-85 13:03:49 PST XOrganization: NIH-LEC, Bethesda, MD X XIndex: lib/libdbm/dbm.c Fix X XDescription: X 4.?bsd dbm's store function misbehaves if the sum of the key data X size and content data size is either 1018 or 1019. X XRepeat-By: X Compile this program with the "dbm" library: X X typedef struct { X char * dptr; X int dsize; X } datum; X X char buf[1024]; X X main(argc, argv) X int argc; X char * argv[]; X { X int result; X datum key; X datum content; X X key.dptr = content.dptr = buf; X key.dsize = atoi(argv[1]); X content.dsize = 0; X creat("fake.dir", 0600); X creat("fake.pag", 0600); X dbminit("fake"); X result = store(key, content); X printf("%d\n", result); X } X X Then run the program. If you use commands such as X a.out 0 X a.out 1 X ... X a.out 1017 X things go swimmingly. If you use commands such as X a.out 1019 X a.out 1020 X ... X an error message is (correctly) produced. But if you use either X the command X a.out 1018 X or X a.out 1019 X things go wild. X XFix: X As usual, the trade secret status of the code involved precludes a X clearer posting. The fix is to change one line in "dbm.c"; it X causes an error message to be produced in the 1018/1019 cases: X X #ifdef OLDVERSION X if(key.dsize+dat.dsize+2*sizeof(short) >= PBLKSIZ) { X #else X if(key.dsize+dat.dsize+3*sizeof(short) >= PBLKSIZ) { X #endif X-- XBugs is a Warner Brothers trademark X-- X UUCP: ..decvax!seismo!elsie!ado ARPA: elsie!ado@seismo.ARPA X DEC, VAX and Elsie are Digital Equipment and Borden trademarks X X + END-OF-FILE dbm.bug chmod 'u=rw,g=r,o=r' 'dbm.bug' echo ' -rw-r--r-- 1 brachman 1839 Dec 1 10:50 dbm.bug (as sent)' echo -n ' ' /bin/ls -l dbm.bug echo 'Extracting Makefile.newdbm' sed 's/^X//' > Makefile.newdbm << '+ END-OF-FILE Makefile.newdbm' X X# Makefile for systems with the new dbm routines (e.g., 4.3BSD systems), X# those with a dbm library containing dbmclose() (e.g., Sun 2 and Sun 3), X# and those with the old dbm without source X XCFLAGS=-O -DNEWDBM XLIB= X X# If you are using the old dbm routines with dbmclose(), uncomment X# the following two lines (otherwise comment them out) XCFLAGS=-O -DHAS_CLOSE XLIB=-ldbm X X# If you are using the old dbm routines without dbmclose(), uncomment X# the following two lines (otherwise comment them out) X# CFLAGS=-O X# LIB=-ldbm X Xall: sp mksp X Xsp: sp.o dbmstuff.o misc.o sp.h X cc ${CFLAGS} -o sp sp.o dbmstuff.o misc.o ${LIB} X Xmksp: mksp.o dbmstuff.o sp.h X cc ${CFLAGS} -o mksp mksp.o dbmstuff.o misc.o ${LIB} X Xcalcsoundex: calcsoundex.c X cc ${CFLAGS} -o calcsoundex calcsoundex.c X X.c.o: X cc -c ${CFLAGS} $? X Xclean: X rm -f sp.o mksp.o dbmstuff.o dbm.o X + END-OF-FILE Makefile.newdbm chmod 'u=rw,g=r,o=r' 'Makefile.newdbm' echo ' -rw-r--r-- 1 brachman 846 Dec 12 14:52 Makefile.newdbm (as sent)' echo -n ' ' /bin/ls -l Makefile.newdbm echo 'Extracting sp.ml' sed 's/^X//' > sp.ml << '+ END-OF-FILE sp.ml' X; X; Written by Donald Acton and Barry Brachman with help from Marc Majka X; March 11/85 X; X; Dept. of Computer Science X; University of British Columbia X; X X(defun X (sp spword X (setq spword (get-tty-string "Word: " )) X (do-sp spword) X ) X) X X(defun ; bjb X (sp-from-buffer spword X (setq spword (get-next-word)) X (message (concat "Word: " spword)) X (sit-for 0) X (do-sp spword) X ) X) X X(defun X (do-sp curr found tmp X (setq curr (current-buffer-name)) X (pop-to-buffer "sp") X (setq needs-checkpointing 0) X (erase-buffer) X (set-mark) X (filter-region (concat "sp " spword)) X (beginning-of-file) X (setq found " *not found*") X (setq tmp case-fold-search) X (setq case-fold-search 1) X (error-occurred (re-search-forward (concat "\. " spword "$")) X (beginning-of-line) X (line-to-top-of-window) X (setq found " *found*")) X (setq case-fold-search tmp) X (setq mode-line-format " %b - %m %p") X (setq mode-string (concat spword found)) X (pop-to-buffer curr) X (novalue))) X X; X; Return the word the cursor is pointing at X; or the word immediately to the left of the X; cursor if it is between words X; X; April 15/85 - bjb X; X(defun X (get-next-word original-dot rb spword X (save-excursion X (setq original-dot (dot)) X (backward-word) (forward-word) (setq rb (dot)) X (if (> original-dot rb) X (progn (forward-word) (backward-word)) X (backward-word)) X (set-mark) X (forward-word) X (setq spword (region-to-string)) X ) X spword X ) X) X + END-OF-FILE sp.ml chmod 'u=rw,g=r,o=r' 'sp.ml' echo ' -rw-r--r-- 1 brachman 1613 Dec 8 15:06 sp.ml (as sent)' echo -n ' ' /bin/ls -l sp.ml echo 'Extracting sp.1' sed 's/^X//' > sp.1 << '+ END-OF-FILE sp.1' X.TH SP 1-LOCAL "12 December 1986" X.UC 4 X.SH NAME Xsp \- give possible spellings X.br Xmksp \- maintain sp dictionaries X.br Xcalcsoundex \- calculate soundex values X.SH SYNOPSIS X.B sp X[ X.B -vace X] [ X.B -f dictionary-list X] [ X.B word ... X] X.br X.B mksp X[ X.B -adt X] [ X.B -v# X] X.B dictionary X.br X.B calcsoundex X[ X-v X] X.SH DESCRIPTION X.I Sp Xtakes one or more words as input and for each word prints Xa list of possible spellings. XIf the words are not given on the command line, the program prompts and reads Xfrom the standard input. XYou must know the first letter of the word. XUpper case is mapped to lower case. XWords must start with an alphabetic character, but any subsequent Xcharacters need not be alphabetic (but see the limitation below on words that Xmay appear in the dictionary). XBlanks are allowed within a word. X.PP XUp to ten dictionaries previously created by X.I mksp Xmay be specified by a command line argument and an environment variable. XThe name of a dictionary is specified by a pathname, Xnot including the suffix (.dir or .pag). XA list of dictionaries consists of one or more colon separated dictionary Xnames. XThe environment variable SPPATH may be set to a dictionary list. XIf a command line dictionary list is given in addition to the SPPATH variable, Xall dictionaries are used. XIf no dictionaries are specified the program looks for default dictionaries. X.PP XTo reduce the size of the word list, certain heuristics are used. XNormally, all words X.I sp Xconsiders to be a "satisfactory" match are printed. XThe \fB-c\fR option causes only close Xmatches or an exact match to be printed. XThe \fB-e\fR option only prints an exact match. XThe \fB-a\fR option causes all words matched to be printed. XThe output is sorted alphabetically and indicators are printed beside each Xword: X.sp 2 X.in +0.5i X.nf X.na X X == exact match X.br X ! == close match X.br X * == good match X.br X ' ' == matched X.in -0.5i X.fi X.ad X.sp 2 XDuplicated words are not removed from the listing. X.PP XIf the \fB-v\fR flag is given, X.I sp Xbecomes verbose. X.PP X.I Mksp Xis a program to maintain dictionaries for use with X.I sp. XThe \fB-a\fR option is used to create a new dictionary or to add Xwords to an existing dictionary. XThe words to be put in the dictionary are read from the standard input, Xone per line. XThe \fB-v\fR flag (which may immediately be followed by an optional number) Xcauses some information to be printed as words are processed. XA non-flag argument to X.I mksp Xis assumed to be the prefix of the name of the dictionary files. XThe dictionary consists of two files, one with a ".dir" suffix and one with Xa ".pag" suffix (see \fBdbm\fR(3X)). XIf these files do not exist, X.I mksp Xwill create both. XThe words need not be sorted. XThere should not be duplicates in the word list when creating a new dictionary Xbut when words are added to an existing dictionary, duplicates are ignored. XUpper case letters are stored in the dictionary but X.I mksp Xmaps upper case to lower case when checking for duplicate words. X.I Sp Xis case insensitive when searching the database. XThe first character of a word must be alphabetic and Xthe second character (if present) must be either alphabetic, a single quote, Xan ampersand, a period, or a blank. XThere is no restriction on the third and subsequent characters. X.PP XThe \fB-d\fR option is used to delete words from the specified dictionary. XThe words are read from the standard input. XIf a word is not found in the dictionary, no message is printed. XThe comparison is case insensitive. X.PP XThe \fB-t\fR option prints the contents of the specified dictionary. XThe words are not sorted. X.PP X.I Calcsoundex Xreads words from the standard input (one per line) and prints the soundex Xcode corresponding to each word on stdout. XWith the \fB-v\fR flag, X.I calcsoundex Xalso echoes each word. X.SH EXAMPLE X.in +0.5i X.na X.nf X% mksp -a mydictionary Xaardvark Xprecipitation X<more words> Xzyzz X<^D> X% sp -c -f mydictionary:herdictionary XWord? propogate X! 1. perfect X! 2. perfectible X<more words> X! 7. propagate X! 8. proposition X<more words> XWord? <^D> X.fi X.ad X.in -0.5i X.SH FILES X.nf X.na X<dictionary.pag>, <dictionary.dir> dictionary data base X/usr/public/lib/sp/{soundex,aux.soundex} default dictionaries X.fi X.ad X.SH LIMITATIONS XNo more than 10 dictionaries may be specified. XThe maximum length of a word is 50 characters. XThe program can return up to 400 matches taking up a maximum of 20480 bytes. XThe limitation on what the second character of a word can be is due to Xthe algorithm used to compress the dictionaries; there is some room in the Xdata structure for a few other characters to become valid. XDbm doesn't work between a Sun and VAX across NFS so you can't share a Xdictionary between a VAX and a Sun. X.PP XThere is a limit on the number of words having the same soundex code that can Xappear in a single dictionary. XThis value should be at least 256 (on a VAX/Sun it is 1024), but it depends Xon how the program has been configured locally. XIf you come up against this limitation you can split your dictionary; e.g., Xextract every second word from the big dictionary to make a new dictionary, Xthen delete the words from the big dictionary. XThe following pipeline can be used to determine the number of times each Xsoundex code appears in a list of words (one per line): X.sp 2 X.ti +0.5i Xcalcsoundex | sort | uniq -c | sort -r -0n X.SH "SEE ALSO" Xspell(1), uniq(1), dbm(3X), ndbm(3) X.br XDonald E. Knuth, The Art of Computer Programming, Vol. 3, XSorting and Searching, Addison-Wesley, pp. 391-392, 1973. X.SH AUTHOR XBarry Brachman X.br XDept. of Computer Science X.br XUniversity of British Columbia X.SH BUGS XYou may not agree on what constitutes a match. XYou are likely to have to create your own dictionary as the UNIX Xdictionary is far from complete. In particular, the suffixes Xhave been removed from most words. XThe limitations mentioned above are arbitrary. XThe limitation on the second character of a word is disgusting. X + END-OF-FILE sp.1 chmod 'u=rw,g=r,o=r' 'sp.1' echo ' -rw-r--r-- 1 brachman 5937 Dec 12 21:08 sp.1 (as sent)' echo -n ' ' /bin/ls -l sp.1 echo 'Extracting sp.9' sed 's/^X//' > sp.9 << '+ END-OF-FILE sp.9' X.TH SP 9-EMACS "8-Aug-1985" X.UC X.SH NAME Xsp, sp-from-buffer \- give possible spellings X.br Xget-next-word \- return word at cursor X.SH SYNOPSIS X.B (sp) X.br X.B (sp-from-buffer) X.br X.B (get-next-word) X.SH DESCRIPTION X.I Sp Xprompts for a word and then invokes the X.B sp Xprogram with the word as an argument. XThe output of the X.B sp Xprogram, a list of words, is placed in a buffer called "sp". XIf the word is found, then the cursor is positioned at the word. XIf the word isn't found, the cursor is left at the start of the buffer. XIn either case the mode line of the buffer indicates whether the word Xwas found. X.sp 2 X.B Sp-from-buffer Xis the same as X.B sp Xexcept that the user is not prompted for the word; the word under or Ximmediately to the left of the cursor is used. X.sp 2 X.B Get-next-word Xreturns the word the cursor is pointing at or the word immediately Xto the left of the cursor if it is between words. X.SH BUFFERS USED Xsp \- result X.SH FILES X/usr/local/lib/emacs/sp.ml X.br X/usr/local/sp X.SH SEE ALSO X.B sp(1-LOCAL) X.SH AUTHOR XBarry Brachman X.br XDept. of Computer Science X.br XUniversity of British Columbia + END-OF-FILE sp.9 chmod 'u=rw,g=r,o=r' 'sp.9' echo ' -rw-r--r-- 1 brachman 1112 Dec 11 21:43 sp.9 (as sent)' echo -n ' ' /bin/ls -l sp.9 exit 0