[comp.arch] "General-purpose" architectures and symbolic languages

vanroy@bellatrix.uucp (02/22/89)

  Does anyone know of any comparisons of algorithms implemented in symbolic
languages (i.e. Prolog/Lisp) without any special annotations versus in an
imperative language?  This involves writing the algorithm in a natural style
in both languages, and using high-performance (perhaps experimental) compilers
to compare the execution speeds.  The results depend on many factors:
(1) the choice of algorithm, (2) programming style, (3) the quality of the
compilers, and (4) the suitability of the architecture for the symbolic
language.
  I know of one comparison done at Argonne in which a theorem prover was
written in Prolog and in C.  The C version consistently outperformed the
Prolog version by an order of magnitude.  This was due to (1) the poor
implementation of arrays in the Prolog version and (2) the awkwardness of the
architecture for Prolog (I believe it was a Sun-3).  Has anyone done any other
such comparisons?
  Present-day general-purpose architectures will naturally favor the imperative
language in a comparison of this sort.  This is because the traditional notion
of "general-purpose" architecture is not truly general-purpose.  It emphasizes
numeric calculation and does not do the things that symbolic languages require,
such as tag manipulation and multiway branching.  Therefore I propose that the
notion of "general-purpose" be extended to include support for symbolic
languages.
  At Berkeley, we are developing such an architecture for the Prolog language.
We are designing it from the start as a complete system, i.e. compiler,
instruction set architecture, and (single-chip) implementation are being
developed together with feedback in all directions.  In this way we hope to
overcome some of the limitations of previous implementations of Prolog.
  Thanks for any pointers,

	Peter Van Roy
	vanroy@bellatrix.berkeley.edu

wilson@carcoar.Stanford.EDU (Paul Wilson) (02/22/89)

I too am interested in such issues.

One thing I've been wondering about is the compatibility of RISC-style
caches and Lisp Machine-style forwarding pointers.

On Lisp Machines, the time required to check for invisible forwarding
pointers is generally hidden in time required for cache operation,
but I'm unsure of the details.  Do the new cache architectures still
allow this.  (I know they don't generally support such snazzy features.
The question is whether other design decisions directly conflict with
invisible forwarding.)


Also, is anybody doing any work on coordinating cache operation with
garbage collection?  With a large cache (100 KB or larger), you should
be able to have a small Newest generation (say, 50 KB) that is scavenged
completely within the cache -- no thrashing.

Even if the Newest generation won't fit in the cache, you should be
able to avoid swapping a lot of cache blocks out by telling the cache
hardware if they only contain garbage.  (i.e., force them to appear
clean, so they won't be written back, or tell the cache to "create"
a block without actually reading in the corresponding block from
main memory).  It seems that this would cut cache-memory traffic by
approximately half (?) for the Newest generation.

Anybody out there doing such work?


Paul R. Wilson                         
Human-Computer Interaction Laboratory    lab ph.: (312) 413-0042
U. of Ill. at Chi. EECS Dept. (M/C 154) 
Box 4348   Chicago,IL 60680              wilson@carcoar.stanford.edu
Paul R. Wilson                         
Human-Computer Interaction Laboratory    lab ph.: (312) 413-0042
U. of Ill. at Chi. EECS Dept. (M/C 154) 
Box 4348   Chicago,IL 60680              wilson@carcoar.stanford.edu

jwmills@iuvax.cs.indiana.edu (Jonathan Mills) (02/22/89)

In article <28113@ucbvax.BERKELEY.EDU> vanroy@bellatrix.uucp () writes:

>Present-day general-purpose architectures will naturally favor the imperative
>language in a comparison of this sort.  This is because the traditional notion
>of "general-purpose" architecture is not truly general-purpose.  It emphasizes
>numeric calculation and does not do the things that symbolic languages require
>such as tag manipulation and multiway branching.  Therefore I propose that the
>notion of "general-purpose" be extended to include support for symbolic
>languages.

>At Berkeley, we are developing such an architecture for the Prolog language.
>We are designing it from the start as a complete system, i.e. compiler,
>instruction set architecture, and (single-chip) implementation are being
>developed together with feedback in all directions.  In this way we hope to
>overcome some of the limitations of previous implementations of Prolog.
>  Thanks for any pointers,

In response to the queries about extending the definition of "general purpose" 
architectures to include symbolic processing, it's been done, and the results
are very exciting.  I have a single chip design for a Prolog machine, and
frankly think it's pretty good (but we know it could be improved, don't we?)

Because the design was closely tied to discussions and experiments with Kevin
Buettner, the author of Applied Logic Systems Prolog compiler, the
architecture is the first "native code" Prolog processor below the level of the
WAM that is tied in to the optimizations that a Prolog compiler would typically
perform.

I've been flogging the notion of a Prolog RISC since 1986, and have evolved a
design which we are building now at Indiana University with some support from 
Motorola;  Frank McCabe & British Aerospace have been influenced in some small 
part by my earliest LOW RISC I, and have independently come up with a similar 
design (the DLM, a very nice machine based on the little-appreciated sigma 
machine).  Because of the similarity in designs, and because there is only so 
much you can do in a limited instruction set, I'd argue that we already have a 
good notion of what a general purpose processor needs.  First some history:

1985 - WAM taken apart, pieces used to build a Sub-WAM for Ken Bowen's
	 ALS PC Prolog compiler.  Fast implementation; taught us how to
	 compile Prolog to native code.

1986 - LOW RISC I.  Arcane 7 opcodes.  Verified need for integrated tag &
	 value processing; had problems with excessive branching & deref loops

1987 - LOW RISC II.  Simulated, nominal speeds 300 - 800 KLIPS.  Still had
	 excessive branching problems (particularly in unification).

     - 88000e.  Brian Short extended the 88000 to a tagged architecture.
	 Similar performance to LOW RISC II.  His thesis is an excellent piece
	 of work, but you have to sign a non-disclosure to get it.  Try calling
	 Sam Daniel, Motorola GEG, (602) 897-3010 if you want a copy.  It might
	 not be releasable until late 1990, though.

1988 - LIBRA.  A "balanced" RISC with 35 instructions. ("Balance" is a concept 
	 described by Michael Flynn of Stanford and others - I recall a BISC
	 described at a DoD conference).

	 The LIBRA runs WAM macros with 2.3x fewer cycles than the PLM, but 
	 code optimizations can improve this even more.  Every opcode
	 is conditioned to reduce branch frequency (a la the ARM), and branch 	 
	 frequency is further reduced by a single cycle partial unifier.  It is 
	 possible that an ECL gate array version with an instruction prefetch
	 buffer, an ECL cache, and interleaved 100ns RAM could nominally run 	 
	 over1,000,000 inferences per second, and could max out on the inner
	 loop of NREV at more than 5.7 MLIPS (but not much more).

The architecture of the LIBRA is discussed in more detail in the following
posting, derived from an article being written for IEEE Micro.

jwmills@iuvax.cs.indiana.edu (Jonathan Mills) (02/22/89)

LIBRA: LOGICAL INFERENCE BALANCED RISC ARCHITECTURE (name by Tony Faustini)

The LIBRA is a 40-bit 4-stage pipelined processor with four major 
functional units, each pipelined & operating synchronously in parallel:

	1. Value ALU.  Contains separate hardware for arithmetic/logic, 	
			   dereferencing, bounds checking.
	2. Tag ALU.	   Tag comparison, interface to partial unifier.
	3. GC ALU.	   Support for mark&sweep GC algorithms
	4. PC ALU.	   Next instruction fetching;  many branches are simple
			   counter loads, but of a partial field.  Fastest 	
			   branches are within a page (yes this is a paged 	
			   machine, at least for now.  It doesn't bother the 	
			   compiler writer I work with, but I do recognize the 
			   problems for multi-processing.)

The machine is microcoded, but uses only one control word per machine 
instruction.  An alternate microcode address is latched after every 
instruction that sets condition codes (more on this in partial unify).

The 40-bit data word is divided into a 3-bit type, 1 bit each for mark & 
reverse garbage collect flags, and a 35-bit value which can be further 
typed for use with 32-bit numeric host processors.


NON-ORTHOGONAL REGISTER FILE

There are 32 registers.  4 are stack pointers (see why in a minute), 24 are
general purpose, and 4 are link registers for subroutine calling (or 
continuation pointers a la the WAM).  The register bank is non-orthogonal 
to allow single cycle instructions that can:

	push register A using register B as a pointer to memory,
	increment/decrement register B,
	store register B into register C, overwriting the tag in C with an
		immediate.

These instructions make use of parts of the datapath that would be idle 
that are provided for forwarding, and routes their contents back to the 
register file, which is useful in a structure-copying implementation of 
Prolog.


PARTIAL UNIFY

Microcode addressing is typically provided or modified by flags, fields in 
the microinstruction, and a microprogram counter.  In the LIBRA 
concatenated-tags are used for microcode addressing.  Using this technique, 
the microcode is viewed as a two-dimensional array of control words.  The 
two tags saved from the previous instruction's operands are used to index 
the microcode array, and select the control word for a replaceable 
instruction.  Single-cycle partial unify performs either a nop, a store, a 
call or a branch, thus condensing three to five tag checking instructions 
into one.  The partial unify instruction can eliminate as many as 30% of 
the subroutine calls performed by a general purpose RISC running Prolog.

By applying this operation to the LIBRA's partial unification and switch 
instructions, dereferencing can be made a conditional operation, performed 
only when necessary.  More significantly, the pair of checks for a bound 
variable (one is needed for each operand being unified) can be included in 
the operation of the partial unification instruction by simply enlarging 
the unifier table in ROM, thus removing as many as eight instructions 
(which include two branches) from the direct execution sequence.  The 
modified partial unify instruction thus reduces the number of instruction 
cycles spent dereferencing objects reached in eight or fewer indirections;  
this optimization improves the performance of unification from 60% to 100% 
by conditionally omitting dereferencing as well as explicit checks for 
bound variables. Here's an example:



Label	Condition	Instr	sc	Operands				Comment
--------------------------------------------------------------------
loop:	if bound1	ld		Xn	0	Xn
	if bound2	ld		Ai	0	Ai
enter:		sub	sc	Xn	Ai	r27
			unify	sc	Xn	Ai	loop	exit	;no mode split
	if trail1	push+		TR	Xn
	if trail2	push+		TR	Ai
exit:


SMART CACHE

Tags can also be used to improve cache hit rates by preventing transient 
data, such as intermediate elements in a pointer chain, from entering a 
cache.  This implies that dereferencing should take place outside the 
cache, rather than at the chip (which was suggested in 1986 by Mats 
Carlson).  If a "dereferenced choice point" is built once for a procedure, 
then the bottleneck that this would otherwise form can be reduced.  Also, 
the only hardware interlock built into the LIBRA is used to stall the 
pipeline during dereferencing.  Moving dereferencing into the cache would 
clean up the design.

It is also possible to do a CDC-6600 like trick, and move trailing and
failing out of the CPU to a peripheral processor (also done by McCabe in 
the DLM).  If cache entries are monitored by the trail-fail processor, and 
reset if requested by the CPU, the failure of a previous goal can be 
overlapped with the execution of the subsequent goal, thus improving 
backtracking performance.


SYMBOLIC BOUNDS CHECKING & TRAIL-CHECK SCOREBOARDING

Automatic bounds check are provided for trail and current environment 
checking. Some bounds checks are "sticky", to allow detection of an "almost 
stack collision" condition.  Then garbage collection can be initiated at 
programmer defined points.  This saves stack collision checking at every 
call, and further reduces branching.

Scoreboarding is generalized to include marking pre-processed data as well 
as data waiting to be processed.  The trail check is moved to occur during 
a load or the final stage of a dereference.  If the check showed that an 
unbound variable is being loaded, and it needs to be trailed if it is 
instantiated later, the register loaded is marked by setting a trail-check 
flag in the scoreboard.


SYMBOLIC CONDITIONING OF EVERY INSTRUCTION

Conditional instruction execution is used to implement preferred branches 
and is also used with numeric conditions to control the execution of every 
instruction in the Acorn RISC Machine (Acorn Computers Limited 1986, 1987).  
The LIBRA architecture extends this concept by adding symbolic conditions, 
allowing operations which formerly needed several test-and-branch 
instructions to be coded instead as a sequence of instructions, all of 
which are conditionally nops.  This reduces the branch frequency of the 
LIBRA, and improves its ability to use interleaved memory.  Symbolic 
conditions include
certain variable types (bound1, bound2) and trail and environment checks 
(trail1, trail2, env1, env2), and collision checks ("sticky overflow").

Overall, the use of conditional instruction execution, memory interleaving, 
tag-controlled caching, instruction prefetching and loop-buffering 
(lookahead, look-behind) allow the LIBRA to execute short inner loops and 
shallow backtracking at speeds limited primarily by the cache memory cycle 
time rather than the hit rate.


A GENERAL PURPOSE ARCHITECTURE?

Well, the LIBRA was compared only for Prolog, but it was compared to the 
PLM, the SPUR, the 68K and the 88K, and outperformed them all.  Truly, 
Peter VanRoy's assertion that general purpose architectures should support 
symbolic processing is one I have been pushing for years.

The argument against it that has always been thrown back at me is, "We will 
have 50 MIPS machine in a year or two; the performance advantage of a 
special purpose processor is not worth it."

I disagree with that rebuttal, and consider Motorola's interest (which 
varies over time) to indicate that it might - just might - be possible to 
extract the essentials of the LIBRA and put them into place as an SFU in a 
88000.

Toward this end I have designed a 32-bit LIBRA with 16 instructions, that 
encodes the WAM in 2.16x fewer cycles than the PLM.  As an SFU this beast
would be capable of interpreting 32-bit data as tagged cells, would be able 
to take advantage of all of the 88K numeric and O/S specific support, and 
thus could deal with symbolic processing variants such as Constraint Logic 
Programming, as well as Prolog (and presumably Lisp - and after talking to 
the Schemers here at IU I can find nothing they do that the LIBRA can't 
encode).

CLOSING WORD:  I have probably left out some minor but interesting things, 
but if you are really interested & are free early this summer I will give a 
5-week seminar beginning around May 20th at IU on symbolic architectures, 
with an emphasis on their use in automated deduction systems.  E-mail me 
for the details.

Johnson@tilde.ti.com (Doug) (02/23/89)

In article <7051@polya.Stanford.EDU> you write:
  >I too am interested in such issues.
  >
  >One thing I've been wondering about is the compatibility of RISC-style
  >caches and Lisp Machine-style forwarding pointers.
  >
  >On Lisp Machines, the time required to check for invisible forwarding
  >pointers is generally hidden in time required for cache operation,
  >but I'm unsure of the details.  Do the new cache architectures still
  >allow this.  (I know they don't generally support such snazzy features.
  >The question is whether other design decisions directly conflict with
  >invisible forwarding.)

Forwarding pointers do not involve the cache on either the Explorer I or
Explorer II.  The Explorer I doesn't have a cache.  It uses microcode to
check the type of the data read and do the indirection if necessary.
It's strictly serial.

The Explorer II uses hardware in the read path to check the data type
and cause the CPU to do the indirection.  The major advantage is that it
eliminates the inline microcode check.  The Explorer II has a cache, but
the cache knows nothing about forwarding pointers -- they're just a
series of memory references.  

The essence of the problem is that the indirection is done based on the
type of the data fetched.  This is right in the critical path of the
memory interface.  Plus or minus a bit, the type check can't start until
the data could be presented to the CPU.  You either have to slow the
memory interface by the type check time, or have a way of saying "I
didn't mean it" after making the data available.  

Incidently, there is an analogous problem with implementing a read
barrier for incremental garbage collection.  You'd like to raise an
exception if a pointer to old space is read.  

  >
  >Also, is anybody doing any work on coordinating cache operation with
  >garbage collection?  With a large cache (100 KB or larger), you should
  >be able to have a small Newest generation (say, 50 KB) that is scavenged
  >completely within the cache -- no thrashing.
  >
  >Even if the Newest generation won't fit in the cache, you should be
  >able to avoid swapping a lot of cache blocks out by telling the cache
  >hardware if they only contain garbage.  (i.e., force them to appear
  >clean, so they won't be written back, or tell the cache to "create"
  >a block without actually reading in the corresponding block from
  >main memory).  It seems that this would cut cache-memory traffic by
  >approximately half (?) for the Newest generation.
  >
  >Anybody out there doing such work?

Bob Courts has a paper in the September 1988 CACM titled "Improving
Locality of Reference in a Garbage Collecting Memory Management System".
It focuses on the interaction between GC and virtual memory.  I suspect
much of the work applies to caches as well.

I strongly suspect that it's not worth worrying about for caches.
(Courts shows the significant advantages of worrying about virtual
memory.)  Unless your average cache lifetimes are long compared to the
interval between GC flips, the cache flushes caused by flips will be
small compared to those caused by normal running.  

This Explorer has been up for 236 hours and generation 0 (youngest) has
flipped 688 times -- averaging a flip every 21 minutes.  I couldn't turn
up any data on cache lifetimes in a quick look through my bookcase, but
I have the impression it's much shorter.  (Thought experiment:  A 5 mips
processor with a 128K byte cache, 20% load ratio, 1% miss rate, and
uniformly distributed misses will have an average cache life of about 3
seconds.)   Obviously, this comparison has a lot of holes in it, but
we're an order of magnitude apart.

-- Doug

ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/23/89)

There are several discussion about special architectures for symbolic
languages, such as lisp and Prolog.  I used to think that Lisp specific
architecture is the way to go. The same for Prolog.  This is probably
true *IF* you are talking about prototype/research architectures. 
But, if you are going to sell software/hardeare product for/using
symbolic languages, using stock processors seems to be the way to go. 

Designing a CPU, memory system, etc. and O/S, compiler, interpreter,
environment takes awful a lot of time and money.  And, when they are 
ready, the companies like MIPS, Motorola, (may be Intel, now with N10)
has far more powerful processors with cutting edge technologies. 
So, if your software product in symbolic language is to live more than
a couple of years, pick a stock hardware that seems to live long and
prosper, as well as stay competitive in the future ("scalable" 
architecture, as SUN calls ?), and make one for it.  Now, afterward,
everytime they comes up with faster processor, port old compiler
of yours, with constant improvements.  If you are building from 
scratch every time, you can not fine tune your stuff.

May be it is even better to compile into the source language of one of 
those honed compilers (such as MIPSco's C), not bothering the 
machine language.  You can take advantage of their optimization...
(Not yet, probably....)

==============================================================================
Any opinion expressed here is my own.
------------------------------------------------------------------------------
Ryutarou Ohbuchi	"Life's rock."   "Climb now, work later."
ohbuchi@cs.unc.edu	<on csnet>
Department of Computer Science, University of North Carolina at Chapel Hill
==============================================================================