bill@twwells.com (T. William Wells) (10/25/89)
I have been beating my head against the wall, looking for speed, *speed*, and MORE SPEED in this program I'm working on. At only 800 WPM (on my 16MHz '386) it is not exactly a speed demon. Especially since it has to be run on large lists of words, none of which are smaller than 80,000 words. I'm not going to say exactly what the code is for (confidentiality and all that), but the immediate problem I'm trying to solve is storing all the substrings of the input words, up to some run time specified length limit, so that I can get to them *very* quickly. I also don't have lots of memory. Well, actually, I do, but I'm spending it on a *much* bigger data structure: I'm storing information on *pairs* of these substrings.... Originally, I had a trie (urk, what a word!), but that proved too slow. What I have now, and am presenting here, is a variation on it. Here's what I did. I still have a tree, of sorts, but the only pointers in it are parent pointers. The tree is embedded in a hash table. The elements of the hash table contain a parent pointer and a character. To find a string, hash the string and check each element of the table from that hash point, looking up the parent pointers to find the other characters in the string. This code, BTW, takes 60% of the time of the old trie routine. Since it, now, takes a third of my program's time, you can see that I'm still likely to profit by speeding it up. You might be wondering why I don't just store the strings in the table and be done with it. Well, first, storing the strings means that each hash entry either needs some dynamic allocation (expensive!) or a fixed string area, imposing a fixed upper bound on the string length (undesirable). Moreoever, the nature of the problem is such that, if a string is stored in the table, so is every prefix of the string. And the rest of the program depends on being able to find a prefix of a string very quickly. This code is a bit rough, since I'm likely to tinker further with it to get the speed I need, and it shows some of its ancestry, but I thought some people might be interested. Also, I've just picked it out of the program; that means it ought to work, but, on the other hand, I could have picked it incorrectly. And, maybe, we'll get to talk about some *C*, instead of the fever dreams that have been infesting this group lately. :-( Speaking of which, if anyone has suggestions for speeding this up, besides loop unrolling which is what I'm going to try next, I'd appreciate hearing about them. Better algorithms would especially be appreciated. And, oh yes, assembly is not an option. It has to run on my '386 and three different Sun platforms. And whatever else the company might buy next year. #define ARCSIZE 16411 /* size of hash table for arcs, prime */ typedef struct ARC { struct ARC *ar_parent; /* parent of this arc */ short ar_label; /* character labeling the arc */ } ARC; ARC Arc[ARCSIZE + 1]; /* hash table for arcs */ long Arc_call; /* number of calls to add to the arcs */ long Arc_cnt; /* number of arcs assigned */ long Arc_miss; /* number of skipped arc entries */ /* This routine stores a string in the arc table. It returns the arc of the last character in the string. */ ARC * store_string(str, ep) char *str; /* start of the string to add */ char *ep; /* end of the string */ { register ARC *ap1; register ARC *ap2; register char *ptr; unsigned long hash; /* A null string is represented by a null pointer. */ if (str == ep) { return (0); } /* Compute the hash code for the string. */ hash = 0; for (ptr = str; ptr < ep; ++ptr) { hash += (hash << 5) + *ptr + 1; } /* Search till we have a null entry or have found a match. We scan first from the hash position to the end of the table; note the use of the nul character at the end of the table as a sentinel. */ ++Arc_call; for (ap1 = &Arc[hash % ARCSIZE]; ap1->ar_label; ++ap1) { ap2 = ap1; ptr = ep; while (ap2->ar_label == *--ptr) { if (!(ap2 = ap2->ar_parent)) { if (ptr == str) { return (ap1); } break; } else if (ptr == str) { break; } } ++Arc_miss; } /* If we hit the sentinel, scan from the start of the table. */ if (ap1 == Arc + ARCSIZE) { for (ap1 = Arc; ap1->ar_label; ++ap1) { ap2 = ap1; ptr = ep; while (ap2->ar_label == *--ptr) { if (!(ap2 = ap2->ar_parent)) { if (ptr == str) { return (ap1); } break; } else if (ptr == str) { break; } } ++Arc_miss; } } /* If we got here, the string is not in the table. We also have an empty entry which we can use. So, we use that and then call ourselves recursively to put the prefix of this string into the table. Note the kludge with the two assignments of ar_label. The point is to assign the entry but to make it have an illegal value so that the matching code doesn't accidentally succeed. */ if (Arc_cnt++ == ARCSIZE) { fflush(stdout); fprintf(stderr, "arc table overflow\n"); exit(1); } ap1->ar_label = -1; ap1->ar_parent = store_string(str, --ep); ap1->ar_label = *ep; return (ap1); } --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
hutch%citron.cs.clemson.edu@hubcap.clemson.edu (David Hutchens) (10/25/89)
From article <1989Oct25.091519.10718@twwells.com>, by bill@twwells.com (T. William Wells): > I have been beating my head against the wall, looking for speed, > *speed*, and MORE SPEED in this program I'm working on. At only > 800 WPM (on my 16MHz '386) it is not exactly a speed demon. > Especially since it has to be run on large lists of words, none of > which are smaller than 80,000 words. > > Bill { uunet | novavax | ankh | sunvice } !twwells!bill > bill@twwells.com First, forget the tree, and the hash. What you need is a Finite State Automaton. The basic idea for this can be found in the May 1988 issue of CACM in an article on an implementation of SCRABBLE(tm). I've implemented a version of this and I can build the FSA in under a minute (for about 50000 words) on a sun-3 and under 5 minutes on a 80286. If possible, you should build the FSA for the words in a separate processing step since it does take a bit of memory. You'll need about a Meg. The FSA, by its very nature allows you to get at prefix strings very easily (that is quite useful in SCRABBLE). David Hutchens hutch@hubcap.clemson.edu PS: Since you are doing something confidential, I assume it is for *business*. I'll be glad to discuss selling you the right to incorporate my program.
bill@twwells.com (T. William Wells) (10/30/89)
In article <6871@hubcap.clemson.edu> hutch%citron.cs.clemson.edu@hubcap.clemson.edu writes:
: From article <1989Oct25.091519.10718@twwells.com>, by bill@twwells.com (T. William Wells):
: > I have been beating my head against the wall, looking for speed,
: > *speed*, and MORE SPEED in this program I'm working on. At only
: > 800 WPM (on my 16MHz '386) it is not exactly a speed demon.
: > Especially since it has to be run on large lists of words, none of
: > which are smaller than 80,000 words.
: >
: > Bill { uunet | novavax | ankh | sunvice } !twwells!bill
: > bill@twwells.com
:
: First, forget the tree, and the hash. What you need is a Finite State
: Automaton. The basic idea for this can be found in the May 1988
: issue of CACM in an article on an implementation of SCRABBLE(tm).
My first implementation (a trie) is exactly what an FSA would
generate as output. The major cost in it is searching each level
of the tree. Knuth shows a way around that; it has been
implemented in the patgen program that came (comes?) with TeX.
There was also that Programming Pearls article in CACM. (Sorry,
exact references are not handy.)
However, those have the drawback that a test of each level of the
tree is a bit more complex than in mine. Also, the code to insert
into those trees is nontrivial. And slow. Yes, I've used these
algorithms.
: I've implemented a version of this and I can build the FSA in under
: a minute (for about 50000 words) on a sun-3 and under 5 minutes on
: a 80286.
In a recent test, my current implementation has the routine
called 454,094 times, 19,194 of which actually add new strings to
the table. It takes 24 seconds on my '386, which is about .9 VAX
MIPS. I think that beats your FSA.
BTW, for those interested, the hashing line in my original code
should be replaced with:
hash += (hash << 6) + *ptr;
The hit rate goes down by a factor of 6!
: If possible, you should build the FSA for the words in
: a separate processing step since it does take a bit of memory.
Not possible, since the strings have to be generated on the fly.
To do it otherwise would require generating a *huge* temp file.
: PS: Since you are doing something confidential, I assume it is
: for *business*.
Not a good assumption, since it could be research someone doesn't
want yet bruited about, or for the government, etc., but in this
case, you are right.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com