[net.lang.prolog] More on Sets and Hashing

OKeefe.R.A.%EDXA%UCL-CS@sri-unix.UUCP (01/28/84)

From:  O'Keefe HPS (on ERCC DEC-10) <OKeefe.R.A.%EDXA@UCL-CS>

MoRe: Sets and Hashing.

1.  I forgot to mention two VERY important representations of
sets I use all the time.

    (f) the union of two other sets
    (g) the difference of two other sets

The two other sets can be represented any old how, and need not
have the same representation.  It is not uncommon to design an
algorithm with 3 set variables: EveryThing, StillToDo, AlreadyDone.
What we want to do is transfer elements from StillToDo to AlreadyDone
*fast*, and to test *fast* whether StillToDo is empty.  A useful
approach is to make StillToDo a list (perhaps an ordered list, what
setof returns is *ideal* for this sort of work), and to make
AlreadyDone some sort  of binary tree.  Then EveryThing is StillToDo
U AlreadyDone.  This technique of designing a program using lots of
variables with exactly what you want in them, and then finding a
representation in terms of another set of variables where many
operations of interest are vacuous is one I got from some of
Dijkstra's papers.

2.  I pointed out that comparison can be made cheap.  I should also
have admitted that hashing can be made similarly cheap, at a cost of
one word per functor, holding the hash code of the functor.  Of
course IF the garbage collector promises not to move functor blocks
then you can use the address, but that's a stupid thing for a garbage
collector to promise.  (DEC-10 Prolog is the only Prolog system known
to me which currently has its own garbage collector, though Prolog-X
is getting one and NIP will have one before it is released.  Poplog
and LM-Prolog have garbage collectors, but they inherit them from
their implementation languages.  Though PopLog's garbage collector
has come to know a fair bit about Prolog.)  You can use the address
in C-Prolog, because it hasn't got a garbage collector.  Having an
extra word per functor would let the garbage collector collect
*everything*, and it would also give us some choice about what the
hash code for a functor was.  We could let the hash code of the Nth
functor to be created be N, as in the AI package PEARL.  We could
calculate it from the name of the symbol of the functor and the
arity of the functor, using any hashing function that takes our
fancy.  (C-Prolog just adds up all the characters in the
name.  I thought this was dreadful, and tested it and several other
string hashing functions on the entire C-Prolog system, and just
plain adding up the characters was the best method.  Sigh.)  Or
we could take the Nth element of a pseudorandom sequence as the
code of the Nth functor.  Calculating the hash code from the name
might make it easier for the garbage collector to rebuild the symbol
table, and even change its size.

3.  We can get around the necessity for terms to be ground before
they can reliably be hashed by specifying a level below which the
hasher is not to look.  So the built-in hash function should look
like
        term_hash(Term, Depth, NBuckets, HashCode)

returning a HashCode between 1 and NBuckets (since Prolog vectors
are indexed from 1).  It should be an instantiation error if the
term has a variable at or above the given depth.  Depth 0 would
just use the value of a number or the hash code associated with
a term's principal functor.  The question which remains is "what
is a good way of calculating a hash code for a compound term
given the hash code of its principal functor and the sequence of
hash codes of its arguments"?  I'm inclined to suggest

        combined-hash([F,A1,...,An]) = F + A1/2 + ... + An/2^n

which is pretty easy to calculate, but I haven't tried it yet.
The reason for making NBuckets a parameter of the hashing
function is to let the implementation use full-precision machine
arithmetic internally.

Does this seem worth having?  If enough people say yes, it'll be
provided in C-Prolog v1.4d.EDAI.