[net.lang.prolog] Sets in Prolog

OKeefe.R.A.%EDXA%UCL-CS@sri-unix.UUCP (01/27/84)
From:        "OKeefe.R.A." <OKeefe.R.A.%EDXA@UCL-CS>

Re: Shimon Cohen's note on sets, V2 #5 Prolog Digest.

1.  He expressed concern at "the inefficient implementation of sets
    in Prolog".  This is like being concerned about the anti-clerical
    attitude of the present King of France.  There *is* none. I do 
    not know about LM-Prolog, but none of {DEC-10 Prolog, C-Prolog, 
    UNSW Prolog, MU Prolog, PopLog, Prolog-X, NIP} offers you a 
    Prolog data-type called "set" or, for that matter, "bag".

2.  There is a very good reason for this.  In various programs, I 
    have used the following representations of sets:

    a) unordered lists (supported by <PROLOG>SETUTL.PL)
       -- this may be the library file whose inefficiency is worrisome

    b) ordered lists with no duplicates (supported by <PROLOG>ORDSET.PL)
       -- except for membership tests, these routines are as efficient
       -- as anything you can do with hash tables.  Trouble is, they
       -- only really work with "sufficiently instantiated" elements,
       -- but then hash tables have precisely the same limitation (and
       -- you need an unusually co-operative garbage collector if you
       -- are not to be restricted to ground elements only)

    c) ordered binary trees (I haven't submitted this to the library at
	SU-SCORE yet, but <PROLOG>ASSOC.PL comes close)
       -- element testing takes O(lg N) time compared with O(N) for (a)
       -- or (b), all the other operations take O(N) time as in (b),(a)
       -- of course takes O(N^2) time.  The union intersection ... code
       -- is rather more complex than in (b)

    d) vectors of var/nonvar (not submitted as I think it's hacky)
       -- this applies to subsets of [1..N] (N can be up to 100 in
       -- obsolete versions of C-Prolog, up to 200 in current versions,
       -- up to several thousand in DEC-10 Prolog, any number in NIP
       -- or PopLog), an element is in the set if arg(X,Set,A),nonvar(A)
       -- or absent from it if var(A).  I have used this for colouring
       -- various kinds of graphs, where in the nonvar(A) case A is the
       -- colour assigned.

    e) bit-vectors using the /\ \/ / << and >> arithmetic operators.(If
	you think I'm going to submit this you must be joking, it's
	TRIVIAL, like so many things in Prolog if you look for a simple
	solution, but you have to beware of different word-lengths)
    	-- this again applies to subsets of [1..N], but for much larger
	-- N.  An integer is 18 bits in DEC-10 Prolog, 24 bits in 
           Prolog-X
	-- and NIP, 30 bits in PopLog(?) and TIP. Note that I am talking
	-- about a *vector* of integers, not a list.  Lists can be used
	-- too, but I've never had occasion to work with sets of 
           integers
	-- where I didn't know N before I wanted to construct the set.

    f) a base set, represented as a table of elements (either vector 
       form or binary tree, see <PROLOG>TREES.PL), and sets of integers
	representing subsets of the base set in (d) or (e) form
    I have also used (a), nested (b), (d), and (f)/(d) for graphs.

    In various programs, I have used the following representations of
    bags:

    A) unordered lists (there is no library file to support this)
    B) ordered sequences of <element>-<count> pairs (<PROLOG>BAGUTL.PL)
    C) binary trees of <element>-<count> pairs (no library file for 
       this but it would be trivial to adapt <PROLOG>ASSOC.PL)
    D) for bags of 1..N, vectors of counts
	-- adding an element requires copying the vector, but see
	-- <PROLOG>ARRAYS.PL for constant-time updatable arrays in 
           Prolog
	-- so adding an element to a bag takes constant time
    E) a base set and a bag of integers in (A) (B) (C) or (D) form.

    **** Not *ONE* of these representations is best for all purposes.

3.  Hash tables are not best for all purposes either.  If you spend 
    a lot of time working on entire sets, or if you are interested 
    in enumerating elements rather than testing for a single element,
    ordered lists (use compare/3, {@<,@>=,@>,@=<,==,\==}/2) are hard
    to beat.  They do need the elements to be sufficiently instantiated
    for you to tell them apart (basically, when making a set from a 
    list or other structure, you need (X = Y) => (X == Y)).  But so
    do hash tables.  There are at least two ways of calculating hash 
    functions.  You can use the print form (numbering variables from 
    left to right in the style of numbervars) or you can use the 
    internal addresses of variables, atoms, and functors, and the 
    values of integers.  The former is slow, the latter means that 
    the garbage collector not only has to be prepared to rehash all 
    your sets, it has to FIND all your sets.  (PopLog has hash
    tables, but they are pop11 data structures, not Prolog data 
    structures [Prolog can't do anything to them except call Pop 
    functions], and the garbage collector CAN find the tables, and 
    DOES rehash them.  Even so, ordered lists in C-Prolog are  
    competetive with hash-tables in PopLog.)  It must be admitted 
    that DEC-10 Prolog and C-Prolog do look at the source form of 
    atoms to compare them, however the cost of atom comparison can 
    easily be reduced to the cost of integer comparison in a new
    Prolog implementation at the cost of 3 pointers and 1 integer 
    extra per atom, and at the cost of making the creation of a 
    new atom O(lg|atoms|) instead of O(1).

4.  It follows that "If you are honestly trying to make Prolog a 
    useful language", then by all means provide a "hash_on_name
    (+Term,?Hash)" primitive so that people can build hash tables 
    (using vectors to hold the buckets, buckets being represented 
    perhaps as lists with unbound tails), but instead of forcing 
    the use of hash tables in every place, make them *JUST ONE* of 
    *SEVERAL* implementations of the same abstract data type.

5.  A great deal depends on the needs of the particular programmer 
    for the program he is actually writing.  Most of the time when 
    I call setof/3, I either want to perform some operation on the 
    whole set as it stands, in which case I exploit the fact that 
    in DEC-10 Prolog, C-Prolog, Prolog-X, and NIP the answer is an 
    ORDERED list, and <PROLOG>ORDSET.PL provides efficient operations 
    on ordered lists,or else I want to map down the entire set 
    transforming the elements in some way.  It is precisely these 
    tasks to which hash tables are least well suited.  I cannot think 
    of any occasion on which I have calculated a setof and followed
    it by a membership test (as opposed to an enumeration using 
    member).  This is NOT, and is not intended as, an argument
    against having hash tables around, nor indeed against having 
    hashed_setof/3 around.  What it IS an argument FOR is for having 
    setof/3 in its present form (see <PROLOG>SETOF.PL) and the
    ordered set library [as well].

6.  If you want to do a lot of membership tests, using binary trees
    is nearly as efficient as using hash tables.  For the next 
    couple of years on VAXen and M68000s (or on the DEC-10 with its 
    tiny address space), the difference between O(lgN) and O(1) is 
    likely to be swallowed up by the overheads of hashing.  In the 
    longer term, the fact that (provided you accept comparison as a 
    primitive operation, and if you don't like comparing variables, 
    you can imitate MU-Prolog and delay until the terms are 
    sufficiently instantiated) binary trees can be accessed and 
    UPDATED in pure Prolog is likely to make binary trees the 
    method of choice in massively parallel systems.  This fact 
    already provides binary trees with an important advantage 
    over hash tables when doing elementwise insertion and deletion.  
    See next paragraph.

7.  The scheme Cohen describes for maintaining multiple versions
    of a hash table has some problems.  He says "Each set has sort
    of internal clock which is modified whenever we create new 
    pointer to this set."  I presume this doesn't actually mean 
    that after doing "X = Set" we have to increment the counter, 
    because he says he has implemented his ideas in C-Prolog, and 
    having an extensive knowledge of the internals of C-Prolog (the 
    speed of old versions was 800 LIPS on a VAX 750.  The speed of 
    version 1.4b.edai is 1100 LIPS, and we are now up to version 
    1..4c.edai.) I am quite sure that there is no way of detecting 
    when we "create a new pointer to" anything at all without 
    substantial changes to main.c and unify.c and a MASSIVE overhead.

    So I surmise that he means that the clock is modifed whenever a 
    new VERSION of a set is created, that is that the "insert" and 
    "delete" operations update the clock, and nothing else.  Now we 
    have the question: what values does the clock take?  If the 
    answer is "numbers", then the scheme has a serious deficiency.

	%  element(Element, SetWithIt, SetWithoutIt)
	%  is the assumed primitive, insert/delete depending on the
	%  instantiation state of the variables, just 'insert' here.
	... :-
		...
		element(x, Set1, Set0),
		element(y, Set2, Set1),
		element(z, Set3, Set2),
		...
    works just fine, and operations on Set0..Set2 which just LOOK at
    their elements work just fine as well.  BUT
	... :-
		...
		element(x, Setx, Set0),
		element(y, Sety, Set0),
		element(w, Setw, Setx),
		element(v, Setv, Sety),
		...
    will almost certainly NOT work, because Setx and Sety will get
    the SAME clock value, and Setw and Setv will also get the same
    (but different from Setx and Sety) clock value.  There is a cure
    for this, which involves using structured clocks.  I won't go
    into the details, because it turns out that by the time you have
    something that resets itself when you backtrack and permits
    multiple versions AND lets the version space branch you would be
    MUCH better off with binary trees.

8.  The updatable array implementation in <PROLOG>ARRAYS.PL does not
    provide multiple versions.  About a week before I sent it to
    SU-SCORE I had this obvious idea about timestamping changes to
    the array and started coding it.  Then the deficiency struck me,
    and I spent a couple of days trying to figure out an efficient
    way of sharing as much structure as possible with a branching
    version space.  That's when I came up with <PROLOG>TREES.PL.
    Actual timing tests showed that the (compiled Prolog) binary 
    tree implementation of arrays was always much faster than the 
    (compiled Prolog) updatable vector implementation, and it was 
    purer Prolog and could be generalised to non-numeric indices.  
    Updatable hash tables which support branching versions spaces 
    can easily be built from <PROLOG>TREES.PL given the
    hash_on_name/2 operation (which is not available in any Prolog 
    available to me except Poplog, where I wrote the Prolog-term 
    hash function -- the Pop hash function hashes on addresses).  
    If Shimon Cohen or Ken Forbus has actually invented a technique
    for handling a branching version space in updatable structures 
    at LESS cost than O(lgN) space and O(lgN) time, I shall propose 
    his or their election to the pantheon and ask for a signed 
    photograph.

9.  I would very much like to see Shimon Cohen's code. Thank you, 
    sir, for offering to mail it.  If this means physically posting 
    it to me, fine, but it might be better if you posted it to SCORE 
    where anyone who is interested can FTP it without bothering you. 
    (Actually, I have an ulterior motive.  I want your stuff to sink 
    into the total oblivion that <PROLOG>ARRAYS.PL and <PROLOG>
    ORDSET.PL seem to have sunk into.  (:-})  I would also like to
    see Ken Forbus's code.  As implementing constant-time updatable 
    arrays in Prolog is basically trivial (have a look at <PROLOG>
    ARRAYS.PL if you don't believe me) I would expect that the
    ZetaLisp code would be similarly trivial. If anyone is really 
    interested in having multiple versions of arrays and can put 
    with with a purely sequential version space, I could send my 
    code for that to SU-SCORE as well.

    [ Shimons code is available from {SU-SCORE} as:

      SCORE:<Prolog>Cohen_Sets.Txt    ! note for using the PL code
                    Cohen_Sets.PL     ! the PL code           - ed ]

10. It just occurred to me that some people who haven't thought about 
    data structures in Prolog or read the manual or the Clocksin & 
    Mellish book with any degree of attention may wonder what I mean
    by "vectors". I refer to the use of the functor/3 and arg/3 
    primitives which support 1-dimensional origin-1 vectors very 
    efficiently in any usable Prolog.  I didn't call them "arrays" 
    because once you have bound an argument of a term you can't 
    change it.

11. PS: there are versions of many of the {SU-SCORE}<PROLOG>*.PL 
    files for C-Prolog.  If enough people are interested I could 
    post them to net.sources now that our VAX is on the air again.  
    It is not possible for me to post them to ARPANet.

-- Richard A. O'Keefe
   (Te Tohunga Prolog)