OKeefe.R.A.%EDXA%UCL-CS@sri-unix.UUCP (01/27/84)
From: "OKeefe.R.A." <OKeefe.R.A.%EDXA@UCL-CS> Re: Shimon Cohen's note on sets, V2 #5 Prolog Digest. 1. He expressed concern at "the inefficient implementation of sets in Prolog". This is like being concerned about the anti-clerical attitude of the present King of France. There *is* none. I do not know about LM-Prolog, but none of {DEC-10 Prolog, C-Prolog, UNSW Prolog, MU Prolog, PopLog, Prolog-X, NIP} offers you a Prolog data-type called "set" or, for that matter, "bag". 2. There is a very good reason for this. In various programs, I have used the following representations of sets: a) unordered lists (supported by <PROLOG>SETUTL.PL) -- this may be the library file whose inefficiency is worrisome b) ordered lists with no duplicates (supported by <PROLOG>ORDSET.PL) -- except for membership tests, these routines are as efficient -- as anything you can do with hash tables. Trouble is, they -- only really work with "sufficiently instantiated" elements, -- but then hash tables have precisely the same limitation (and -- you need an unusually co-operative garbage collector if you -- are not to be restricted to ground elements only) c) ordered binary trees (I haven't submitted this to the library at SU-SCORE yet, but <PROLOG>ASSOC.PL comes close) -- element testing takes O(lg N) time compared with O(N) for (a) -- or (b), all the other operations take O(N) time as in (b),(a) -- of course takes O(N^2) time. The union intersection ... code -- is rather more complex than in (b) d) vectors of var/nonvar (not submitted as I think it's hacky) -- this applies to subsets of [1..N] (N can be up to 100 in -- obsolete versions of C-Prolog, up to 200 in current versions, -- up to several thousand in DEC-10 Prolog, any number in NIP -- or PopLog), an element is in the set if arg(X,Set,A),nonvar(A) -- or absent from it if var(A). I have used this for colouring -- various kinds of graphs, where in the nonvar(A) case A is the -- colour assigned. e) bit-vectors using the /\ \/ / << and >> arithmetic operators.(If you think I'm going to submit this you must be joking, it's TRIVIAL, like so many things in Prolog if you look for a simple solution, but you have to beware of different word-lengths) -- this again applies to subsets of [1..N], but for much larger -- N. An integer is 18 bits in DEC-10 Prolog, 24 bits in Prolog-X -- and NIP, 30 bits in PopLog(?) and TIP. Note that I am talking -- about a *vector* of integers, not a list. Lists can be used -- too, but I've never had occasion to work with sets of integers -- where I didn't know N before I wanted to construct the set. f) a base set, represented as a table of elements (either vector form or binary tree, see <PROLOG>TREES.PL), and sets of integers representing subsets of the base set in (d) or (e) form I have also used (a), nested (b), (d), and (f)/(d) for graphs. In various programs, I have used the following representations of bags: A) unordered lists (there is no library file to support this) B) ordered sequences of <element>-<count> pairs (<PROLOG>BAGUTL.PL) C) binary trees of <element>-<count> pairs (no library file for this but it would be trivial to adapt <PROLOG>ASSOC.PL) D) for bags of 1..N, vectors of counts -- adding an element requires copying the vector, but see -- <PROLOG>ARRAYS.PL for constant-time updatable arrays in Prolog -- so adding an element to a bag takes constant time E) a base set and a bag of integers in (A) (B) (C) or (D) form. **** Not *ONE* of these representations is best for all purposes. 3. Hash tables are not best for all purposes either. If you spend a lot of time working on entire sets, or if you are interested in enumerating elements rather than testing for a single element, ordered lists (use compare/3, {@<,@>=,@>,@=<,==,\==}/2) are hard to beat. They do need the elements to be sufficiently instantiated for you to tell them apart (basically, when making a set from a list or other structure, you need (X = Y) => (X == Y)). But so do hash tables. There are at least two ways of calculating hash functions. You can use the print form (numbering variables from left to right in the style of numbervars) or you can use the internal addresses of variables, atoms, and functors, and the values of integers. The former is slow, the latter means that the garbage collector not only has to be prepared to rehash all your sets, it has to FIND all your sets. (PopLog has hash tables, but they are pop11 data structures, not Prolog data structures [Prolog can't do anything to them except call Pop functions], and the garbage collector CAN find the tables, and DOES rehash them. Even so, ordered lists in C-Prolog are competetive with hash-tables in PopLog.) It must be admitted that DEC-10 Prolog and C-Prolog do look at the source form of atoms to compare them, however the cost of atom comparison can easily be reduced to the cost of integer comparison in a new Prolog implementation at the cost of 3 pointers and 1 integer extra per atom, and at the cost of making the creation of a new atom O(lg|atoms|) instead of O(1). 4. It follows that "If you are honestly trying to make Prolog a useful language", then by all means provide a "hash_on_name (+Term,?Hash)" primitive so that people can build hash tables (using vectors to hold the buckets, buckets being represented perhaps as lists with unbound tails), but instead of forcing the use of hash tables in every place, make them *JUST ONE* of *SEVERAL* implementations of the same abstract data type. 5. A great deal depends on the needs of the particular programmer for the program he is actually writing. Most of the time when I call setof/3, I either want to perform some operation on the whole set as it stands, in which case I exploit the fact that in DEC-10 Prolog, C-Prolog, Prolog-X, and NIP the answer is an ORDERED list, and <PROLOG>ORDSET.PL provides efficient operations on ordered lists,or else I want to map down the entire set transforming the elements in some way. It is precisely these tasks to which hash tables are least well suited. I cannot think of any occasion on which I have calculated a setof and followed it by a membership test (as opposed to an enumeration using member). This is NOT, and is not intended as, an argument against having hash tables around, nor indeed against having hashed_setof/3 around. What it IS an argument FOR is for having setof/3 in its present form (see <PROLOG>SETOF.PL) and the ordered set library [as well]. 6. If you want to do a lot of membership tests, using binary trees is nearly as efficient as using hash tables. For the next couple of years on VAXen and M68000s (or on the DEC-10 with its tiny address space), the difference between O(lgN) and O(1) is likely to be swallowed up by the overheads of hashing. In the longer term, the fact that (provided you accept comparison as a primitive operation, and if you don't like comparing variables, you can imitate MU-Prolog and delay until the terms are sufficiently instantiated) binary trees can be accessed and UPDATED in pure Prolog is likely to make binary trees the method of choice in massively parallel systems. This fact already provides binary trees with an important advantage over hash tables when doing elementwise insertion and deletion. See next paragraph. 7. The scheme Cohen describes for maintaining multiple versions of a hash table has some problems. He says "Each set has sort of internal clock which is modified whenever we create new pointer to this set." I presume this doesn't actually mean that after doing "X = Set" we have to increment the counter, because he says he has implemented his ideas in C-Prolog, and having an extensive knowledge of the internals of C-Prolog (the speed of old versions was 800 LIPS on a VAX 750. The speed of version 1.4b.edai is 1100 LIPS, and we are now up to version 1..4c.edai.) I am quite sure that there is no way of detecting when we "create a new pointer to" anything at all without substantial changes to main.c and unify.c and a MASSIVE overhead. So I surmise that he means that the clock is modifed whenever a new VERSION of a set is created, that is that the "insert" and "delete" operations update the clock, and nothing else. Now we have the question: what values does the clock take? If the answer is "numbers", then the scheme has a serious deficiency. % element(Element, SetWithIt, SetWithoutIt) % is the assumed primitive, insert/delete depending on the % instantiation state of the variables, just 'insert' here. ... :- ... element(x, Set1, Set0), element(y, Set2, Set1), element(z, Set3, Set2), ... works just fine, and operations on Set0..Set2 which just LOOK at their elements work just fine as well. BUT ... :- ... element(x, Setx, Set0), element(y, Sety, Set0), element(w, Setw, Setx), element(v, Setv, Sety), ... will almost certainly NOT work, because Setx and Sety will get the SAME clock value, and Setw and Setv will also get the same (but different from Setx and Sety) clock value. There is a cure for this, which involves using structured clocks. I won't go into the details, because it turns out that by the time you have something that resets itself when you backtrack and permits multiple versions AND lets the version space branch you would be MUCH better off with binary trees. 8. The updatable array implementation in <PROLOG>ARRAYS.PL does not provide multiple versions. About a week before I sent it to SU-SCORE I had this obvious idea about timestamping changes to the array and started coding it. Then the deficiency struck me, and I spent a couple of days trying to figure out an efficient way of sharing as much structure as possible with a branching version space. That's when I came up with <PROLOG>TREES.PL. Actual timing tests showed that the (compiled Prolog) binary tree implementation of arrays was always much faster than the (compiled Prolog) updatable vector implementation, and it was purer Prolog and could be generalised to non-numeric indices. Updatable hash tables which support branching versions spaces can easily be built from <PROLOG>TREES.PL given the hash_on_name/2 operation (which is not available in any Prolog available to me except Poplog, where I wrote the Prolog-term hash function -- the Pop hash function hashes on addresses). If Shimon Cohen or Ken Forbus has actually invented a technique for handling a branching version space in updatable structures at LESS cost than O(lgN) space and O(lgN) time, I shall propose his or their election to the pantheon and ask for a signed photograph. 9. I would very much like to see Shimon Cohen's code. Thank you, sir, for offering to mail it. If this means physically posting it to me, fine, but it might be better if you posted it to SCORE where anyone who is interested can FTP it without bothering you. (Actually, I have an ulterior motive. I want your stuff to sink into the total oblivion that <PROLOG>ARRAYS.PL and <PROLOG> ORDSET.PL seem to have sunk into. (:-}) I would also like to see Ken Forbus's code. As implementing constant-time updatable arrays in Prolog is basically trivial (have a look at <PROLOG> ARRAYS.PL if you don't believe me) I would expect that the ZetaLisp code would be similarly trivial. If anyone is really interested in having multiple versions of arrays and can put with with a purely sequential version space, I could send my code for that to SU-SCORE as well. [ Shimons code is available from {SU-SCORE} as: SCORE:<Prolog>Cohen_Sets.Txt ! note for using the PL code Cohen_Sets.PL ! the PL code - ed ] 10. It just occurred to me that some people who haven't thought about data structures in Prolog or read the manual or the Clocksin & Mellish book with any degree of attention may wonder what I mean by "vectors". I refer to the use of the functor/3 and arg/3 primitives which support 1-dimensional origin-1 vectors very efficiently in any usable Prolog. I didn't call them "arrays" because once you have bound an argument of a term you can't change it. 11. PS: there are versions of many of the {SU-SCORE}<PROLOG>*.PL files for C-Prolog. If enough people are interested I could post them to net.sources now that our VAX is on the air again. It is not possible for me to post them to ARPANet. -- Richard A. O'Keefe (Te Tohunga Prolog)