OKeefe.R.A.%EDXA%UCL-CS@sri-unix.UUCP (01/28/84)
From: O'Keefe HPS (on ERCC DEC-10) <OKeefe.R.A.%EDXA@UCL-CS> MoRe: Sets and Hashing. 1. I forgot to mention two VERY important representations of sets I use all the time. (f) the union of two other sets (g) the difference of two other sets The two other sets can be represented any old how, and need not have the same representation. It is not uncommon to design an algorithm with 3 set variables: EveryThing, StillToDo, AlreadyDone. What we want to do is transfer elements from StillToDo to AlreadyDone *fast*, and to test *fast* whether StillToDo is empty. A useful approach is to make StillToDo a list (perhaps an ordered list, what setof returns is *ideal* for this sort of work), and to make AlreadyDone some sort of binary tree. Then EveryThing is StillToDo U AlreadyDone. This technique of designing a program using lots of variables with exactly what you want in them, and then finding a representation in terms of another set of variables where many operations of interest are vacuous is one I got from some of Dijkstra's papers. 2. I pointed out that comparison can be made cheap. I should also have admitted that hashing can be made similarly cheap, at a cost of one word per functor, holding the hash code of the functor. Of course IF the garbage collector promises not to move functor blocks then you can use the address, but that's a stupid thing for a garbage collector to promise. (DEC-10 Prolog is the only Prolog system known to me which currently has its own garbage collector, though Prolog-X is getting one and NIP will have one before it is released. Poplog and LM-Prolog have garbage collectors, but they inherit them from their implementation languages. Though PopLog's garbage collector has come to know a fair bit about Prolog.) You can use the address in C-Prolog, because it hasn't got a garbage collector. Having an extra word per functor would let the garbage collector collect *everything*, and it would also give us some choice about what the hash code for a functor was. We could let the hash code of the Nth functor to be created be N, as in the AI package PEARL. We could calculate it from the name of the symbol of the functor and the arity of the functor, using any hashing function that takes our fancy. (C-Prolog just adds up all the characters in the name. I thought this was dreadful, and tested it and several other string hashing functions on the entire C-Prolog system, and just plain adding up the characters was the best method. Sigh.) Or we could take the Nth element of a pseudorandom sequence as the code of the Nth functor. Calculating the hash code from the name might make it easier for the garbage collector to rebuild the symbol table, and even change its size. 3. We can get around the necessity for terms to be ground before they can reliably be hashed by specifying a level below which the hasher is not to look. So the built-in hash function should look like term_hash(Term, Depth, NBuckets, HashCode) returning a HashCode between 1 and NBuckets (since Prolog vectors are indexed from 1). It should be an instantiation error if the term has a variable at or above the given depth. Depth 0 would just use the value of a number or the hash code associated with a term's principal functor. The question which remains is "what is a good way of calculating a hash code for a compound term given the hash code of its principal functor and the sequence of hash codes of its arguments"? I'm inclined to suggest combined-hash([F,A1,...,An]) = F + A1/2 + ... + An/2^n which is pretty easy to calculate, but I haven't tried it yet. The reason for making NBuckets a parameter of the hashing function is to let the implementation use full-precision machine arithmetic internally. Does this seem worth having? If enough people say yes, it'll be provided in C-Prolog v1.4d.EDAI.