[comp.lang.c++] Implementing LISP in C++

jordan@aero.org (Larry M. Jordan) (03/13/91)

I'm endeavoring to implement an interpreter/compiler for a LISP-like language
in C++.  The question I wish to have answered is "Can I implement a
type-less lang. like LISP in C++ without resorting at some point to
actually asking an object for its type?" I just don't see how not to.
Also, I'm implementing a simple mark/sweep garbage collector which takes 
liberties with Cons cell pointers to avoid recursion.  I don't see a 
reasonable way to do this without asking objects during the mark phase
'are you a Cons?'. (Of course, I will need to implement the predicate 
functions like 'null, consp, listp, etc.').   I'd also like some
remarks on the 'reasonableness' of the implementation scheme I've 
chosen (to be described next paragraph).

Techniques I've seen in the past for implementing LISP objects in
languages like C or Modula-2 define an enumeration type and create
a discriminated union using the enum as a type 'tag' or discriminant.
XLISP and XSCHEME are implemented this way and it seems perfectly
reasonable.  I've taken a different approach since I'm using C++.
I'm implementing LISP objects as C++ classes--Cons, Vector, Fixnum, 
Closure, Environment...--derived from a common base class Node.  Most
methods are virtual and defined in class Node with a call to an error 
routine.  It is each derived class's responsiblity to redefine the
methods which are meaningfull--i.e.  the Cons class redefines
'car() and cdr()':

  class Cons : Node {
    Node *theCar;
    Node *theCdr;
    virtual Node *car() { return theCar; }
    virtual Node *cdr() { return theCdr; }
    virtual void setCar(Node *n) { theCar = n; }
    virtual void setCdr(Node *n) { theCdr = n; }
    ...
  };

Consequently, there are a large number of virtual methods in the base
class Node--20+ and growing.  At first I thought there would be a lot
of space dedicated to large vft's--sizeof(char*) * 
NumberOfVirtualMethodsInBase for each LISP object
derived from Node!  I was concerned about using vast amounts of memory
for vft's.  But then I considered what the alternative would be--
using code space to check the type of the object I'm about to 
reference as a 'Cons' object (which would be an operation that
would widely distributed throughout the system) rather that just 
envoke the method.  I get the error checking at the expense of a
calling a virtual method.  I'm convinced that this experiment is 
worthwhile to continue.

Thanks in advance.

--Larry

dsr@mir.mitre.org (Douglas S. Rand) (03/13/91)

In article <1991Mar12.221015.22144@aero.org>, jordan@aero.org (Larry M. Jordan) writes:
> I'm endeavoring to implement an interpreter/compiler for a LISP-like language
> in C++.  The question I wish to have answered is "Can I implement a
> type-less lang. like LISP in C++ without resorting at some point to
> actually asking an object for its type?" I just don't see how not to.
> Also, I'm implementing a simple mark/sweep garbage collector which takes 
> liberties with Cons cell pointers to avoid recursion.  I don't see a 
> reasonable way to do this without asking objects during the mark phase
> 'are you a Cons?'. (Of course, I will need to implement the predicate 
> functions like 'null, consp, listp, etc.').   I'd also like some
> remarks on the 'reasonableness' of the implementation scheme I've 
> chosen (to be described next paragraph).
> 

Well to start with,  why not have a virtual function like GCmarkPhase
which does something with conses or their decendents but not with other
LISP objects?

> Techniques I've seen in the past for implementing LISP objects in
> languages like C or Modula-2 define an enumeration type and create
> a discriminated union using the enum as a type 'tag' or discriminant.
> XLISP and XSCHEME are implemented this way and it seems perfectly
> reasonable.  I've taken a different approach since I'm using C++.

It is if the language doesn't do this for you with virtuals.

> ... some text deleted ...
> Consequently, there are a large number of virtual methods in the base
> class Node--20+ and growing.  At first I thought there would be a lot
> of space dedicated to large vft's--sizeof(char*) * 
> NumberOfVirtualMethodsInBase for each LISP object
> derived from Node!  I was concerned about using vast amounts of memory
> for vft's.  But then I considered what the alternative would be--

Why, do you think each instance would have a copy of the vtab?  The
vtab is equivalent to a class variable.  There is one copy per class.

> using code space to check the type of the object I'm about to 
> reference as a 'Cons' object (which would be an operation that
> would widely distributed throughout the system) rather that just 
> envoke the method.  I get the error checking at the expense of a

And the type check via the virtual function call is much faster and
more foolproof.

> calling a virtual method.  I'm convinced that this experiment is 
> worthwhile to continue.
> 

Have fun.

-- 
Douglas S. Rand 
Internet:   <dsrand@mitre.org>
Snail:	    MITRE, Burlington Road, Bedford, MA 
Disclaimer: MITRE might agree with me - then again...
Amateur Radio: KC1KJ

jordan@aero.org (Larry M. Jordan) (03/14/91)

Newsgroups: comp.lang.c++
Subject: Re: Implementing LISP in C++ (type discrimination)
Summary: 
Expires: 
References: <17238@cadillac.CAD.MCC.COM> <1991Mar12.221015.22144@aero.org> <1991Mar13.145107.19724@linus.mitre.org>
Sender: 
Followup-To: 
Distribution: 
Organization: The Aerospace Corporation, El Segundo, CA
Keywords: 

Doug,

Could you elaborate on the marking technique via 'GCmarkPhase'.  I
just don't see what you are driving at.  What I'm currently planning
to do is this:

  [1] for each LISP object class (I realize there is only one vtab per
      class) I have overloaded 'new' and 'delete'.  Each class
      'manages' its own object space. (i.e. when a new Cons cell is 
      'allocated', the method Cons::new() will fetch and return a Cons
      cell from Cons::FreeList.  If none are available, the
      garbage collector will be called.  It (gc) will need to
      mark all objects in use, won't it?   See [2].  Then sweep.  If
      Cons::FreeList is still empty, then the Cons object
      space manager will allocate a new chuck of Cons's
      and return one of these.

  [2] during the mark phase, each LISP object class will be asked to
      mark all its internal objects which are in use. (i.e. the Symbol
      class maintains the Obarray.  The Interpreter maintains the 
      evaluation stack, registers, etc.).
      
  [3] each LISP object class will be asked (after marking) to
      sweep, collecting objects into free lists.

'sweep' will be a static member in each class.  There will be a static
'mark' as well, but this will call a generic mark function 'void mark(Node*)'
which knows how to traverse LISP structures, as I indicated before
in a non recursive way, saving backpointer when chasing car's and
cdr's and restoring pointers when unwinding.  Writing such a generic
mark function seems to require doing a case analysis, doesn't it?

I have envisioned two methods one for marking during descent--
'int markAndDecend(?)' which marks and does the pointer reversal--
and a second on ascent--'int ascend(?)' which will restore pointers
up to a point where descending can continue or the marking phase is
complete.  These are two more virtual functions I will need to add
to the base class 'Node'! (The number continues to grow).  But, more
importantly, there are a lot of calls to virtual methods during
the marking phase.  How can this be as efficient as an up front
case analysis?!  I am concerned about the overhead.  Is the technique 
you've intimated substantially different?  Maybe more sophisticated
storage reclamation techniques are called for?!

--Larry

chip@tct.uucp (Chip Salzenberg) (03/16/91)

According to jordan@aero.org (Larry M. Jordan):
>... a generic mark function 'void mark(Node*)' which knows how to
>traverse LISP structures, as I indicated before in a non recursive
>way, saving backpointer when chasing car's and cdr's and restoring
>pointers when unwinding.  Writing such a generic mark function seems
>to require doing a case analysis, doesn't it?

In general, an "isKindOf()" test can almost always be replaced by a
virtual function.  For example:

     switch (p->type()) {
	case Cons:    do_cons_thing((Cons *)p);  break;
        case String:  do_string_thing((String *)p); break;
     }

can be replaced with a virtual do_thing() function, with
implementations Cons::do_thing() and String::do_thing().  Then the
above code becomes simply:

     p->do_thing();

>there are a lot of calls to virtual methods during the marking phase.
>How can this be as efficient as an up front case analysis?!

A switch() is typically a table lookup.  So is a virtual function
lookup.  Any more questions?
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "Most of my code is written by myself.  That is why so little gets done."
                 -- Herman "HLLs will never fly" Rubin

jordan@aero.org (Larry M. Jordan) (03/20/91)

I'm implementing a simple LISP interpreter in C++...

I've succeeded in implementing a marking routine (for a mark and sweep
garbage collector) WITOUT type discrimination!

I'll present type-discrimating version first (which is implemented in Ada)
and follow with the C++ version.

--Ada

  type NodeKind is (Free, Cons, Fixnum, ...);
  type MarkSet is (Unmarked, Marked, MarkLeft);

  type NodePtr is access Node;
  type Node(kind: NodeKind) is
    record
      mark: MarkSet;
      case kind is
	when Free => null;
	when Cons => car, cdr: NodePtr;
	when Fixnum => fix: FixnumType;
	...
      end case;
    end record;
 
Here's the mark function, a tricky non-recursive, routine for marking
all active structures (alg. thanks to David Betz):

  procedure mark(p: NodePtr) is
    current: NodePtr := p;
    previous: NodePtr := null;
  begin
    --mark this node

    loop
	--descend as far as we can

	while not current.mark = Marked loop

	    --mark this node and trace its children
	    case current.kind is
	      when Cons =>
		current.mark := Marked;
		if car(current) /= null then
		  declare
		    down: NodePtr := car(current);
		  begin
		    current.mark := MarkLeft;
		    rplaca(current, previous);
		    previous = current;
		    current = down;
		  end;
		elsif cdr(current) /= null then
		  declare
		    down: NodePtr := car(current);
		  begin
		    rplacd(current, previous);
		    previous = current;
		    current = down;
		  end;
		end if;

	      -- mark other nodes that require tracing
	      when ...
	    
	      -- mark objects that don't contain pointers
	      when Fixnu =>
		current.mark := Marked;
	      when ...
	    end case;

        end loop;


	--backup to a point where we can continue descending
	loop

	    --make sure there is a previous node
	    if previous /= null then
	       if previous.mark = MarkLeft then  --came from left side
		 previous.mark := Marked;
		 declare
		   up: NodePtr := car(previous);
		 begin
		   rplaca(previous, current);
		   if cdr(previous) /= null then
		     rplacd(previous,  up);
		     exit;
		   else			
		     current := cdr(previous); 	--step back up the branch
		     previous := up;
		   end if;
		 end
	       else				--came from right side
		 declare
		   up: NodePtr := cdr(previous);
		 begin
		   rplacd(previous, current);
		   current := previous; 	--step back up the branch
		   previous := up;
		 end;
	       end if;
	    else
	      -- no previous node, must be done
		return;
    	    end if;
	 end loop;

    end loop;

end mark;


Granted this is tricky to follow and the logic of what is really
happening here is hard to follow.  Worse, every time I add a new kind of
object to the LISP system, say Symbols or Environments, I must
change this marking procedure (not to mention the many other places where
case analysis is going on).  

Here's the not-type-descriminating version implemented in C++!
I've made Node a base class and derive all Node kinds from this class.  
The mark routine is implemented entirely abstractly in terms of Nodes 
and the case anaysis is replaced with dynamic binding--virtual method calls.  

class Node;

typedef boolean int;
enum { FALSE, TRUE };

struct MarkState {
  Node *current;
  Node *previous;
};

class Node {
  enum MarkSet { Unmarked, Marked, MarkLeft };
  
  virtual boolean descend(MarkState &m) = 0;	--implementation deferred
  virtual boolean ascend(MarkState &m) = 0;	--implementation deferred
}


void mark(Node *p)
{
   MarkState m;
   m.previous = NIL;
   m.current = p;

   while (1) {

     while (m.current->descend(m))
       ;
     while (1) {
       if (m.previous == NIL)
	  return;
       if (!m.previous->ascend(m))
	  break;
     }
   }
}

// that's it!

class Cons : Node {

  ...
  virtual boolean descend(MarkState &m)
  {
    // mark and save back pointers
    // return TRUE if still descending else return FALSE
  }

  virtual boolean ascend(MarkState &m)
  {
    // ascend (restoring pointers) to a point where descending can continue
    //    and return TRUE
    // else return FALSE if cannot ascend
  }
}



Chip Salzenburg raised an interesting point: 'switch() is a table
lookup'.  Has anyone done a study to see just how much storage is
used in switch() jump tables for a large non-OOP implementation vs. an
OOP implementation?  I'm less concern about the storage occupied by vft's
than I earlier--I'd rather pay the price for the dispatch table once
(in the class) than many times (at each switch()!?).

--Larry

roger@eos.eos.scg.hac.com (Roger Sheldon (Loral)) (04/12/91)

In article <1991Mar12.221015.22144@aero.org> jordan@aero.org (Larry M. Jordan) writes:
>I'm endeavoring to implement an interpreter/compiler for a LISP-like language
>in C++.  

Larry,
I've been working on Lisp in C++ also.  I've discovered a garbage collection
technique which you and others may find useful.  Here's a summary
of the technique.

The class hierarchy is similar to yours:

                        Base -------------- Object
       |-----------------|---------------|
      Cons             Symbol         Integer
       |           |-----|-----|
      Null  Static_symbol    Function

(The relationship of the Object class will be explained below.)
Garbage collection is accomplished via a combination of reference maintenance
and temporary objects.  Reference maintenance is done with a
few member functions:

Base *Base::Ref()       { refs++; return this; }
void Base::Deref()      { if (--refs == 0) delete this; }
void Cons::Deref() {                 // hides inhereted Base::Deref()
    Base *a = this;
    while (a != nil.value && --a->refs==0) {
        a->Car().Deref();            // delete car obj (clobber if last ref)
        Base *save = & a->Cdr();
        delete a;                    // may be deleting *this* !
        a = save;
    }
}

The Cons::Deref() method is a variation of the deref() method found in
Gnu's libg++ parameterized List class. (Thanks FSF) That's pretty much it 
for reference maintenance.  The Null and Static_symbol classes override the 
inherited Deref() method with no ops so there's no need to worry about 
refs and derefs to nil and t.

Now, the temporary objects work with the reference maintenance as follows.
The Object class is defined as: 

class Object { 
    Base *value;
public:
    Object() 			{ value = nil.value; }
    Object(Object &a) 	{ value = a.value->Ref(); }
    Object(Base &a) 	{ value = a.Ref(); }
    ~Object() 			{ value->Deref(); }
    operator=(Object &a){ 
        Base *save=value; 
        value = a.value->Ref();
        save->Deref();
    }
    ...
};

User code manipulates Lisp objects only through the Object class.  Specifically,
all user functions must use the same signature:
	Object foo(Object &a, Object &b <possibly more args>);
The main point is that Lisp objects are always returned through copied objects.
This allows user code to be written without the need to do reference 
maintenance -- the Object class takes care of reference maintenance 
automatically.  Here's an example of some user code:

Object shove_pl(Object &variable, Object &item, Object &alist) {
    if (!alist)
        return list(list(variable, list(item)));
    else if (variable == caar(alist))
        return cons(list(variable, append(cadar(alist), list(item))),
                    cdr(alist));
    else
        return cons(car(alist), shove_pl(variable, item, cdr(alist)));
    }

This is a translation of the Common Lisp shove_pl function found in
Winston's Lisp book:

(defun shove-pl (variable item a-list)
  (cond ((null a-list) (list (list variable (list item))))
        ((equal variable (caar a-list))
         (cons (list variable (append (cadar a-list) (list item)))
               (cdr a-list)))
        (t (cons (car a-list) (shove-pl variable item (cdr a-list))))))

OK, so this approach allows you to write elegant code.  But how
expensive is this garbage collection technique (which I'm calling
"Immediate" garbage collection since objects are clobbered immediately
after the last reference to it is removed) ?
I'm not going to get into a big discussion on performance.  I will
say that I've converted user functions to do their own reference maintenance 
(they deal with Base *'s rather than Objects) and the performance
improvement has ranged from 0 to 3 times.

Feel free to call me at 301 805-0463.

Roger Sheldon

bpendlet@bambam.dsd.es.com (Bob Pendleton) (04/18/91)

In article <14342@hacgate.UUCP>, roger@eos.eos.scg.hac.com (Roger Sheldon (Loral)) writes:

> OK, so this approach allows you to write elegant code.  But how
> expensive is this garbage collection technique (which I'm calling
> "Immediate" garbage collection since objects are clobbered immediately
> after the last reference to it is removed) ?

It's called reference counting. It's been used since the dawn of time.
Reference counting usually uses more total time than traditional
garbage collection. The other cost of reference counting is the
storage used by the count associated with each object. 

For cons cells the counts add 2 to 4 bytes to the size of each cell.
Usually a cons cell is only 8 bytes to begin with. A cons cell can be
as small as 4 bytes if it contains indexes instead of pointers. So a
reference count can chew up a lot of memory.

In older languages keeping track of the reference count was error
prone. But C++ classes let you hide the details and be sure that the
reference count is incremented and decrement at the right times.

(Some of the folks around here have coffee cups with "How could the
reference count be zero?" (Hi Janet) on them. They're left over from a
project written in C that used reference counting. It was a famous and
much dreaded runtime error message.)

The advantage of reference counting is that the program never "goes to
lunch" garbage collecting. Reference counting gives the user
consistent response times. Which is something that users love. I think
that this one benefit more than justifies the use of reference counts.

One thing to consider doing; When the reference count goes to zero put
the object on a type specific free list. When you need a new object of
a given type try to get it from its type specific free list first.
This way your not always going through the general purpose heap code.
Avoiding the general purpose code can give you a real performance
boost.

-- 
              Bob Pendleton, speaking only for myself.
   bpendlet@dsd.es.com or decwrl!esunix!bpendlet or utah-cs!esunix!bpendlet

                         Tools, not rules.

davis@barbes.ilog.fr (Harley Davis) (04/22/91)

In article <1991Apr17.233653.25149@dsd.es.com> bpendlet@bambam.dsd.es.com (Bob Pendleton) writes:

   It's called reference counting. It's been used since the dawn of time.
   Reference counting usually uses more total time than traditional
   garbage collection. The other cost of reference counting is the
   storage used by the count associated with each object. 

One of the most important problems with reference counting is that it
doesn't collect all the garbage.  In particular, circular data
structures never get collected.  Consider the following program:

(defvar *some-list* '(a b))
(setf (cddr *some-list*) *some-list*)
(setf *some-list* ())

Now, despite the fact that there is no way to access any of the cells
in *some-list*, all of them have a reference count of 1.  Oops.

-- Harley Davis

--
------------------------------------------------------------------------------
nom: Harley Davis			ILOG S.A.
net: davis@ilog.fr			2 Avenue Gallie'ni, BP 85
tel: (33 1) 46 63 66 66			94253 Gentilly Cedex, France

gintera@fsd.cpsc.ucalgary.ca (Andrew Ginter) (04/22/91)

In article <DAVIS.91Apr22105508@barbes.ilog.fr> davis@barbes.ilog.fr (Harley Davis) writes:
>One of the most important problems with reference counting is that it
>doesn't collect all the garbage.  In particular, circular data
>structures never get collected.

This was true of reference counting schemes until 1986.  An article
published in Software-Practice and Experience that year describes how
reference counting systems can reclaim circular lists fairly simply.
The exact name and author of the article escape me right now - they're
in my other office.  Drop me a note if you're desperate and I can dig
them up for you.

The circular list collection technique goes something like this:

* Scan the collected heap.  For each object in the heap, chase all
  pointers in the object and decrement the refcount on the target
  object.  At the end of the scan, all of the contribution of internal
  pointers to refcount fields has been eliminated.

* Scan the heap again.  Mark any object with a non-zero refcount
  field.  These are the objects which are referenced by "root
  pointers" (pointers which reside outside of the heap and which refer
  to collected objects).

* Scan the heap again.  Recursively mark any object with a non-zero
  refcount, incrementing reference counts as you traverse pointers in
  objects.

* Scan the heap again.  Any objects in the heap which still has a zero
  reference count can be reclaimed - these are the objects which were
  part of unreachable circular lists.

I'm writing this from memory, so this may not be exactly right, but it
gives the general idea.  I suspect that some of the heap scans can be
collapsed too, but again, I don't remember the details.

Andrew Ginter, 403-220-6320, gintera@cpsc.ucalgary.ca

tmb@ai.mit.edu (Thomas M. Breuel) (04/23/91)

   From: davis@barbes.ilog.fr (Harley Davis)
   In article <1991Apr17.233653.25149@dsd.es.com> bpendlet@bambam.dsd.es.com (Bob Pendleton) writes:
      It's called reference counting. It's been used since the dawn of time.
      Reference counting usually uses more total time than traditional
      garbage collection. The other cost of reference counting is the
      storage used by the count associated with each object. 

   One of the most important problems with reference counting is that it
   doesn't collect all the garbage.  In particular, circular data
   structures never get collected.  [...]

Reference counting is, however, ideal for garbage collecting large,
non-recursive objects like arrays of type real, complex, bit, etc.
Any other kind of garbage collection will have much higher overhead
in this case, and you can never build circular data structures out
of them.

hoelzle@neon.Stanford.EDU (Urs Hoelzle) (04/23/91)

tmb@ai.mit.edu (Thomas M. Breuel) writes:

>   [Others saying that reference counting is slower than other forms
>   of GC]

>Reference counting is, however, ideal for garbage collecting large,
>non-recursive objects like arrays of type real, complex, bit, etc.
>Any other kind of garbage collection will have much higher overhead
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>in this case, and you can never build circular data structures out
>of them.

What is the reason why this must be so?  Do you have any numbers to
show that reference counting is superior in this case?  (Read: I don't
believe it.)

-- 
------------------------------------------------------------------------------
Urs Hoelzle                                            hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305

marti@mint.inf.ethz.ch (Robert Marti) (04/23/91)

In article <1991Apr23.000632.9175@neon.Stanford.EDU> hoelzle@neon.Stanford.EDU
(Urs Hoelzle) writes in response to an earlier message by tmb@ai.mit.edu
(Thomas M. Breuel):
>>[Any kind of garbage collection other than reference counting] will have
>>much higher overhead [in the case of non-recursive data structures].
>What is the reason why this must be so?  Do you have any numbers to
>show that reference counting is superior in this case?  (Read: I don't
>believe it.)

Neither do I.  In fact, I was surprised to find that using the conservative
garbage collector due to Hans Boehm and Mark Weiser (and described in
Software Practice and Experience in 1988?) instead of doing reference
counting myself was _considerably_ faster in a C++ implementation which
used many dynamically allocated arrays.  If my memory serves me correctly,
the running times dropped to about 60% of the original ones.  (And, yes,
the application ran long enough to incur several garbage collections!)

Robert Marti                      |  Phone:    +41 1 254 72 60
Institut fur Informationssysteme  |  FAX:      +41 1 262 39 73
ETH-Zentrum                       |  E-Mail:   marti@inf.ethz.ch
CH-8092 Zurich, Switzerland       |

davis@barbes.ilog.fr (Harley Davis) (04/24/91)

In article <1991Apr22.165825.13552@cpsc.ucalgary.ca> gintera@fsd.cpsc.ucalgary.ca (Andrew Ginter) writes:

   In article <DAVIS.91Apr22105508@barbes.ilog.fr> davis@barbes.ilog.fr (Harley Davis) writes:
   >One of the most important problems with reference counting is that it
   >doesn't collect all the garbage.  In particular, circular data
   >structures never get collected.

   This was true of reference counting schemes until 1986.  An article
   published in Software-Practice and Experience that year describes how
   reference counting systems can reclaim circular lists fairly simply.
   The exact name and author of the article escape me right now - they're
   in my other office.  Drop me a note if you're desperate and I can dig
   them up for you.

   The circular list collection technique goes something like this:

    [mark and sweep too]

Of course, if you use other gc methods with reference count, the
problem can be fixed.  But this was obvious way before 1986.  I think
Danny Bobrow did an implementation of a mixed ref count/full scan gc
in InterLisp in the late 70s/early 80s.

-- Harley Davis
--
------------------------------------------------------------------------------
nom: Harley Davis			ILOG S.A.
net: davis@ilog.fr			2 Avenue Gallie'ni, BP 85
tel: (33 1) 46 63 66 66			94253 Gentilly Cedex, France