[comp.lang.c] Storing substrings: some real C code in comp.lang.c!

bill@twwells.com (T. William Wells) (10/25/89)

I have been beating my head against the wall, looking for speed,
*speed*, and MORE SPEED in this program I'm working on. At only
800 WPM (on my 16MHz '386) it is not exactly a speed demon.
Especially since it has to be run on large lists of words, none of
which are smaller than 80,000 words.

I'm not going to say exactly what the code is for (confidentiality
and all that), but the immediate problem I'm trying to solve is
storing all the substrings of the input words, up to some run time
specified length limit, so that I can get to them *very* quickly.
I also don't have lots of memory. Well, actually, I do, but I'm
spending it on a *much* bigger data structure: I'm storing
information on *pairs* of these substrings....

Originally, I had a trie (urk, what a word!), but that proved too
slow. What I have now, and am presenting here, is a variation on
it.

Here's what I did. I still have a tree, of sorts, but the only
pointers in it are parent pointers. The tree is embedded in a hash
table. The elements of the hash table contain a parent pointer and
a character. To find a string, hash the string and check each
element of the table from that hash point, looking up the parent
pointers to find the other characters in the string.

This code, BTW, takes 60% of the time of the old trie routine.
Since it, now, takes a third of my program's time, you can see
that I'm still likely to profit by speeding it up.

You might be wondering why I don't just store the strings in the
table and be done with it. Well, first, storing the strings means
that each hash entry either needs some dynamic allocation
(expensive!) or a fixed string area, imposing a fixed upper bound
on the string length (undesirable). Moreoever, the nature of the
problem is such that, if a string is stored in the table, so is
every prefix of the string. And the rest of the program depends on
being able to find a prefix of a string very quickly.

This code is a bit rough, since I'm likely to tinker further with
it to get the speed I need, and it shows some of its ancestry, but
I thought some people might be interested. Also, I've just picked
it out of the program; that means it ought to work, but, on the
other hand, I could have picked it incorrectly.

And, maybe, we'll get to talk about some *C*, instead of the fever
dreams that have been infesting this group lately. :-(

Speaking of which, if anyone has suggestions for speeding this up,
besides loop unrolling which is what I'm going to try next, I'd
appreciate hearing about them. Better algorithms would especially
be appreciated. And, oh yes, assembly is not an option. It has to
run on my '386 and three different Sun platforms. And whatever
else the company might buy next year.

#define ARCSIZE         16411   /* size of hash table for arcs, prime */

typedef struct ARC {
	struct ARC *ar_parent;  /* parent of this arc */
	short   ar_label;       /* character labeling the arc */
} ARC;

ARC     Arc[ARCSIZE + 1];       /* hash table for arcs */
long    Arc_call;               /* number of calls to add to the arcs */
long    Arc_cnt;                /* number of arcs assigned */
long    Arc_miss;               /* number of skipped arc entries */

/* This routine stores a string in the arc table. It returns the arc of the
   last character in the string. */

ARC *
store_string(str, ep)
char    *str;                   /* start of the string to add */
char    *ep;                    /* end of the string */
{
	register ARC *ap1;
	register ARC *ap2;
	register char *ptr;
	unsigned long hash;

	/* A null string is represented by a null pointer. */

	if (str == ep) {
		return (0);
	}
	/* Compute the hash code for the string. */

	hash = 0;
	for (ptr = str; ptr < ep; ++ptr) {
		hash += (hash << 5) + *ptr + 1;
	}
	/* Search till we have a null entry or have found a match. We scan
	   first from the hash position to the end of the table; note the use
	   of the nul character at the end of the table as a sentinel. */

	++Arc_call;
	for (ap1 = &Arc[hash % ARCSIZE]; ap1->ar_label; ++ap1) {
		ap2 = ap1;
		ptr = ep;
		while (ap2->ar_label == *--ptr) {
			if (!(ap2 = ap2->ar_parent)) {
				if (ptr == str) {
					return (ap1);
				}
				break;
			} else if (ptr == str) {
				break;
			}
		}
		++Arc_miss;
	}
	/* If we hit the sentinel, scan from the start of the table. */

	if (ap1 == Arc + ARCSIZE) {
		for (ap1 = Arc; ap1->ar_label; ++ap1) {
			ap2 = ap1;
			ptr = ep;
			while (ap2->ar_label == *--ptr) {
				if (!(ap2 = ap2->ar_parent)) {
					if (ptr == str) {
						return (ap1);
					}
					break;
				} else if (ptr == str) {
					break;
				}
			}
			++Arc_miss;
		}
	}
	/* If we got here, the string is not in the table. We also have an
	   empty entry which we can use. So, we use that and then call
	   ourselves recursively to put the prefix of this string into the
	   table. Note the kludge with the two assignments of ar_label. The
	   point is to assign the entry but to make it have an illegal value
	   so that the matching code doesn't accidentally succeed. */

	if (Arc_cnt++ == ARCSIZE) {
		fflush(stdout);
		fprintf(stderr, "arc table overflow\n");
		exit(1);
	}
	ap1->ar_label = -1;
	ap1->ar_parent = store_string(str, --ep);
	ap1->ar_label = *ep;
	return (ap1);
}

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

hutch%citron.cs.clemson.edu@hubcap.clemson.edu (David Hutchens) (10/25/89)

From article <1989Oct25.091519.10718@twwells.com>, by bill@twwells.com (T. William Wells):
> I have been beating my head against the wall, looking for speed,
> *speed*, and MORE SPEED in this program I'm working on. At only
> 800 WPM (on my 16MHz '386) it is not exactly a speed demon.
> Especially since it has to be run on large lists of words, none of
> which are smaller than 80,000 words.
>
> Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
> bill@twwells.com

First, forget the tree, and the hash.  What you need is a Finite State
Automaton.  The basic idea for this can be found in the May 1988
issue of CACM in an article on an implementation of SCRABBLE(tm).

I've implemented a version of this and I can build the FSA in under
a minute (for about 50000 words) on a sun-3 and under 5 minutes on
a 80286.  If possible, you should build the FSA for the words in
a separate processing step since it does take a bit of memory.
You'll need about a Meg.

The FSA, by its very nature allows you to get at prefix strings
very easily (that is quite useful in SCRABBLE).

	David Hutchens
	hutch@hubcap.clemson.edu

PS:  Since you are doing something confidential, I assume it is
for *business*.  I'll be glad to discuss selling you the right
to incorporate my program.

bill@twwells.com (T. William Wells) (10/30/89)

In article <6871@hubcap.clemson.edu> hutch%citron.cs.clemson.edu@hubcap.clemson.edu writes:
: From article <1989Oct25.091519.10718@twwells.com>, by bill@twwells.com (T. William Wells):
: > I have been beating my head against the wall, looking for speed,
: > *speed*, and MORE SPEED in this program I'm working on. At only
: > 800 WPM (on my 16MHz '386) it is not exactly a speed demon.
: > Especially since it has to be run on large lists of words, none of
: > which are smaller than 80,000 words.
: >
: > Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
: > bill@twwells.com
:
: First, forget the tree, and the hash.  What you need is a Finite State
: Automaton.  The basic idea for this can be found in the May 1988
: issue of CACM in an article on an implementation of SCRABBLE(tm).

My first implementation (a trie) is exactly what an FSA would
generate as output. The major cost in it is searching each level
of the tree. Knuth shows a way around that; it has been
implemented in the patgen program that came (comes?) with TeX.
There was also that Programming Pearls article in CACM. (Sorry,
exact references are not handy.)

However, those have the drawback that a test of each level of the
tree is a bit more complex than in mine. Also, the code to insert
into those trees is nontrivial. And slow. Yes, I've used these
algorithms.

: I've implemented a version of this and I can build the FSA in under
: a minute (for about 50000 words) on a sun-3 and under 5 minutes on
: a 80286.

In a recent test, my current implementation has the routine
called 454,094 times, 19,194 of which actually add new strings to
the table. It takes 24 seconds on my '386, which is about .9 VAX
MIPS. I think that beats your FSA.

BTW, for those interested, the hashing line in my original code
should be replaced with:

		hash += (hash << 6) + *ptr;

The hit rate goes down by a factor of 6!

:           If possible, you should build the FSA for the words in
: a separate processing step since it does take a bit of memory.

Not possible, since the strings have to be generated on the fly.
To do it otherwise would require generating a *huge* temp file.

: PS:  Since you are doing something confidential, I assume it is
: for *business*.

Not a good assumption, since it could be research someone doesn't
want yet bruited about, or for the government, etc., but in this
case, you are right.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com