[comp.lang.scheme] looking for balancing

lam@uicbert.eecs.uic.edu (Michael Lam) (05/20/91)

I am looking for scheme programs written for balancing 
binary/avl tree for database applications.  Any related
things (like B-trees) are also of interest.  Thanks in
advance.

Michael Lam
EECS Dept.
University of Illinois at Chicago
lam@uicbert.eecs.uic.edu

markf@zurich.ai.mit.edu (Mark Friedman) (05/21/91)

In article <1991May20.150025.16099@uicbert.eecs.uic.edu> lam@uicbert.eecs.uic.edu (Michael Lam) writes:

   I am looking for scheme programs written for balancing 
   binary/avl tree for database applications.  Any related
   things (like B-trees) are also of interest.  Thanks in
   advance.

I have placed pub/btree.scm on altdorf.ai.mit.edu available for
anonymous FTP. Chris Hanson wrote it (based on Knuth). I believe that
it is vanilla Scheme, except for some DEFINE-INTEGRABLE's that you may
feel free to change to DEFINE's.

-Mark
--

Mark Friedman
MIT Artificial Intelligence Lab
545 Technology Sq.
Cambridge, Ma. 02139

markf@zurich.ai.mit.edu

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (05/21/91)

In article <MARKF.91May20135551@montreux.ai.mit.edu>, markf@zurich.ai.mit.edu (Mark Friedman) writes:
> I have placed pub/btree.scm on altdorf.ai.mit.edu available for
> anonymous FTP. Chris Hanson wrote it (based on Knuth). I believe that
> it is vanilla Scheme, except for some DEFINE-INTEGRABLE's that you may
> feel free to change to DEFINE's.

I've added #|...|# syntax to my copy of Elk, but it isn't standard Scheme.
The calls to (error ...) will need to be changed too.  (Wouldn't it be nice
if the standard had specified the interface to 'error', like CLtL1 did?)
The thing that really worries me is that the last line of the file that
I picked up was incomplete.  It read
"	   (case-3)))))"
(minus the quotes), and the file just *stopped* at that point without
even the LF that one expects to terminate every line.  Is anything more
than the LF missing?

Is there some documentation for this file?  What is "wrapping" and
"unwrapping" a key about?  Not having a copy of Knuth V3 handy, what
is the method used here?  It looks as though it might be AVL, but
there seems to be too much code for it to be that.
-- 
There is no such thing as a balanced ecology; ecosystems are chaotic.

cutting@parc.xerox.com (Doug Cutting) (05/21/91)

In article <1991May20.150025.16099@uicbert.eecs.uic.edu> lam@uicbert.eecs.uic.edu (Michael Lam) writes:

> I am looking for scheme programs written for balancing 
> binary/avl tree for database applications.  Any related
> things (like B-trees) are also of interest.  Thanks in
> advance.

FYI, Skip lists are an attractive alternative to balanced trees.  The
order of the various operations is similar (log insert & delete,
linear enumeration) but the constant factors (space & time) tend to be
lower and the implementation is *much* simpler.  They're described in
the June 1990 CACM (v33#6) pp668-680.

B-trees are really designed for indices which are too large to fit in
main memory.  A proper implementation is a tricky matter.  For some
reason, people like to implement memory-resident B-trees, but there
other data structures better suited to this purpose (e.g. above).

	Doug

cph@altdorf.ai.mit.EDU (Chris Hanson) (05/21/91)

   Date: 21 May 91 01:43:27 GMT
   From: "Richard A. O'Keefe" <ok@goanna.cs.rmit.oz.au>

   In article <MARKF.91May20135551@montreux.ai.mit.edu>, markf@zurich.ai.mit.edu (Mark Friedman) writes:
   > I have placed pub/btree.scm on altdorf.ai.mit.edu available for
   > anonymous FTP. Chris Hanson wrote it (based on Knuth). I believe that
   > it is vanilla Scheme, except for some DEFINE-INTEGRABLE's that you may
   > feel free to change to DEFINE's.

   The thing that really worries me is that the last line of the file that
   I picked up was incomplete.  It read
   "          (case-3)))))"
   (minus the quotes), and the file just *stopped* at that point without
   even the LF that one expects to terminate every line.  Is anything more
   than the LF missing?

There's nothing else missing.  I never add a final newline to a Scheme
source file, because Scheme doesn't require it.  I refuse to give in
to the short-sighted unix designers who have decided that every file
should end in a newline.

   Is there some documentation for this file?  What is "wrapping" and
   "unwrapping" a key about?  Not having a copy of Knuth V3 handy, what
   is the method used here?  It looks as though it might be AVL, but
   there seems to be too much code for it to be that.

There's no documentation.  This is an implementation of balanced
binary trees -- the tree is rebalanced on each insertion or deletion.
I have never read the papers on AVL trees, but my naive understanding
is that the AVL algorithm is more sophisticated than this one.

WRAP-KEY and UNWRAP-KEY allow the objects stored in the tree to be
compound objects that contain the key as one of their components.
Alternatively you could build the functionality of UNWRAP-KEY into the
< predicate, in which case the wrapping wouldn't be needed.

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (05/21/91)

In article <CUTTING.91May20211202@kolsaas.parc.xerox.com>, cutting@parc.xerox.com (Doug Cutting) writes:
> In article <1991May20.150025.16099@uicbert.eecs.uic.edu> lam@uicbert.eecs.uic.edu (Michael Lam) writes:
> FYI, Skip lists are an attractive alternative to balanced trees.  The
> order of the various operations is similar (log insert & delete,
> linear enumeration) but the constant factors (space & time) tend to be
> lower and the implementation is *much* simpler.  They're described in
> the June 1990 CACM (v33#6) pp668-680.

I've not seen that CACM issue.  I _have_ seen two implementations of
skip lists in "The C Users' Journal".  Compared with a clean AVL code,
the code looked amazingly complicated.  Maybe that was just their
style.  I was not charmed by the statement that skip lists do on the
average twice as many comparisons as balanced trees, and I was not
charmed by the worst-case behaviour (O(N), not O(lgN)).

A node in a skip list has to hold the key (and any satellite information),
a randomly selected number of pointer fields, averaging 1.35ish, and a
number which tells you how many pointers there are.  In C, this would be
	struct foo { key_type the_key; short n; struct foo ptrs[n]; }
if only that were legal.  In Lisp you might use
	#(key ptr-1 ... ptr-n)
for a skip-list node with n pointers.  Now the space cost depends on
how vectors are implemented.  A possible implementation would take n+2
words for such a vector.  I'm not quite sure of the best way to encode
an AVL tree node in Lisp; in Prolog <bal>(Key,L,R) would do it,
e.g. <(K,L,R), =(K,L,R), or >(K,L,R), for a cost of 4 cells per node.
A similar approach in Lisp might use one of three record types, again
for a cost of 4 cells per node.  Compare then a cost of 4 cells per node
for a data structure with *guaranteed* logarithmic performance with an
average of 3.35 cells per node for a data structure with linear worst case.

Is a (average) constant factor of 3.35 really such a huge improvement
over 4?  Depending on how your Lisp implements structs and vectors,
skip-list nodes might _always_ be _bigger_ than AVL nodes!
There doesn't seem to be any time saving either; scanning a balanced
tree can be done iteratively:
	(let loop ((n (top-of-tree)))
	    (if (null? n)
		(not-found)
		(case (compare key n)
		    ((=) (found n))
		    ((<) (loop (left-son n)))
		    ((>) (loop (right-son n))))))
This is a lot simpler than the code for scanning a skip-list!
-- 
There is no such thing as a balanced ecology; ecosystems are chaotic.

cutting@parc.xerox.com (Doug Cutting) (05/21/91)

In article <5877@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> I've not seen that CACM issue.  I _have_ seen two implementations of
> skip lists in "The C Users' Journal".  Compared with a clean AVL code,
> the code looked amazingly complicated.  Maybe that was just their
> style.  I was not charmed by the statement that skip lists do on the
> average twice as many comparisons as balanced trees, and I was not
> charmed by the worst-case behaviour (O(N), not O(lgN)).

This is certainly the wrong group for this discussion.  Sorry.

I've not seen these C skip-lists and I've never seen clean AVL code.
As for the number of comparisons, Pugh's 'Skip List Cookbook'
describes a simple technique which mends this problem.  And as for the
worst case behaviour, it can only be achieved if your random number
generator delivers the same value N consecutive times -- a very
unlikely event which is completely independent of the objects stored.

C & Pascal code and the 'Skip List Cookbook' can be found in
mimsy.umd.edu:pub/skipLists.

I'm not the designer of skip lists, just a happy implementor & user.

	Doug

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (05/22/91)

I wrote
>    The thing that really worries me is that the last line of the file that
>    I picked up was incomplete.  It read
>    "          (case-3)))))"
>    (minus the quotes), and the file just *stopped* at that point without
>    even the LF that one expects to terminate every line.  Is anything more
>    than the LF missing?

In article <9105210031.aa21207@mc.lcs.mit.edu>, cph@altdorf.ai.mit.EDU (Chris Hanson) writes:
> There's nothing else missing.  I never add a final newline to a Scheme
> source file, because Scheme doesn't require it.  I refuse to give in
> to the short-sighted unix designers who have decided that every file
> should end in a newline.

These days, due largely to the influence of UNIX, we expect a file to be
a sequence of ``bytes''.  That works fine under UNIX, although as I
mentioned a lot of tools don't quite believe it, and it works fine on
Macintoshes.  It doesn't quite work under MS-DOS, where traces of CP/M
influence still linger.  It doesn't work under VMS.  VMS's "native"
file formats are all record oriented.  VMS does have "stream" record
formats (STREAM, STREAM_LF, STREAM_CR) which basically mimic the UNIX
kind of model, but RMS still requires each "record" thus delimited to
be at most 32k bytes long.  MVS/TSO and VM/CMS are very much record
oriented; their record formats (like VMS's "native" formats) do not
admit an unterminated record as a logical possibility.

It's not a matter of giving in to "unix designers", it's a matter of
not giving a tinker's for portability or for people who might want to
read your files into other file systems, let alone try to use what
you write.

I wouldn't be too sanguine about Scheme not requiring terminated lines,
either.  Common Lisp has an operation called "read-line", and from
p572 of CLtL2, "read-line reads in a line of text TERMINATED BY A NEWLINE".
What relevance has this to Scheme?  Well, several Schemes provide
read-line (though some call it read-string).  Scheme not having multiple
results, the Scheme versions of read-line that I've seen are even pickier
about proper termination than CL.

And at a minimum, leaving the last line of a file unterminated is a
discourtesy to human readers.  The result looks exactly like a file
which has been truncated at an inappropriate point by a mailer or a
broken 'tar' or a file system which filled up while the file was
being written (given the state of my 'quota', a very likely event),
and so on.  Some considerate programmers go so far as to include a
little comment like ";; eof" at the end of their files.  Don't think
I haven't had occasion to be grateful for that.  Another couple of
times and I'll start doing it myself.

-- 
There is no such thing as a balanced ecology; ecosystems are chaotic.