[comp.arch] Optimal Computer Architectures

neal@uikku.tut.fi (Norwitz Neal) (11/09/90)

I have a recursive algorithm that I would like to run as fast as possible.  I am running a part of it on a SPARCstation SLC.  But at this rate, it might finnish in a couple of years!!  I was wondering if someone could suggest a couple of architectures that may suit this recursive algorithm better than others.  The program takes no memory (about 12K).  So all I need is speed.  There are no floating point operations, except a couple of increments.

Also, any information about speed ratings (in mips) from many different computers would be appreciated.

Thanks a lot.

Neal Norwitz
neal@tut.fi
Tampere University of Technology

gillies@m.cs.uiuc.edu (11/09/90)

> I have a recursive algorithm that I would like to run as fast as
> possible.  I am running a part of it on a SPARCstation SLC.  But at
> this rate, it might finnish in a couple of years!!  I was wondering if
> someone could suggest a couple of architectures that may suit this
> recursive algorithm better than others.  

Oh that's easy.  Any RISC BESIDES the SPARC should probably run much
faster.  In a recent benchmark, I found that, on deeply recursive code
the SPARCstation 1 was 2.8 times faster than 16Mhz 68020/Mac II, but 6
times faster on non-recursive code.  SPARC seems to suck the big wazoo
on deep recursion, presumably because of its register-windowing
design.

tr@samadams.princeton.edu (Tom Reingold) (11/09/90)

In article <1990Nov8.212412.14317@funet.fi> neal@uikku.tut.fi (Norwitz
Neal) writes:

$ 
$ I have a recursive algorithm that I would like to run as fast as possible.  I am running a part of it on a SPARCstation SLC.  But at this rate, it might finnish in a couple of years!!  I was wondering if someone could suggest a couple of architectures tha
$ t may suit this recursive algorithm better than others.  The program takes no memory (about 12K).  So all I need is speed.  There are no floating point operations, except a couple of increments.
$ 
$ Also, any information about speed ratings (in mips) from many different computers would be appreciated.

You pose a vague question, so I don't know what we (the net) can
suggest.  If it's really going to take years with the current
algorithm, it sounds worthwhile changing it to an iterative one.

Have you looked at lisp systems?  Some of them handle recursion pretty
efficiently.  I am not talking about lisp running under UNIX.  I am
talking about machines with lisp semantics in the processor.
--
        Tom Reingold
        tr@samadams.princeton.edu  OR  ...!princeton!samadams!tr
        "Warning: Do not drive with Auto-Shade in place.  Remove
	from windshield before starting ignition."

blc@mentat.COM (Bruce Carneal) (11/10/90)

For tail calls at least, the responsibility/blame for
excessive window overflows lies with the compiler, not
the architecture.  Hopefully the newer compilers
for the SPARC do/will handle tail calls without stack
growth.

seanf@sco.COM (Sean Fagan) (11/11/90)

In article <3300209@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>SPARC seems to suck the big wazoo
>on deep recursion, presumably because of its register-windowing
>design.

More to the point, *any* processor with register windows is not going to do
too well on deeply recursive code.

There are tricks a programmer can do to get rid of some recursion, and some
that the compiler can do, of course.

Some people at Berkeley did some studies with their RISC chip and LISP;
since LISP is generally very recursive, and the Berkeley people seem to like
register windows, those are probably worth reading.

-- 
-----------------+
Sean Eric Fagan  | "*Never* knock on Death's door:  ring the bell and 
seanf@sco.COM    |   run away!  Death hates that!"
uunet!sco!seanf  |     -- Dr. Mike Stratford (Matt Frewer, "Doctor, Doctor")
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (11/11/90)

In article <8662@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
 ....
   >SPARC seems to suck the big wazoo
   >on deep recursion, presumably because of its register-windowing
   >design.

   More to the point, *any* processor with register windows is not going to do
   too well on deeply recursive code.
 ....

Lest we forget, the SPARC ISA provides register windows, but their use
is up to sw conventions (viz. CALL does not result in a window spin).
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) (11/11/90)

In article <334@mentat.COM>, blc@mentat.COM (Bruce Carneal) writes:
|> Hopefully the newer compilers
|> for the SPARC do/will handle tail calls without stack
|> growth.

It is my understanding that the SunOS 4.x compiler does
detect and eliminate tail recursion.  This is verified by
compiling (-O4 -S)

ENUM(i) { if (i==0) return 1; else return ENUM(i-1); }

and seeing

_ENUM:
L77001:
        tst     %o0
        bne     L77003
        nop
        retl
        add     %g0,1,%o0
L77003:
        b       L77001
        dec     %o0


It seems odd that it would generate
a branch to another branch.  can anyone
who knows comment?



			Consume
Scott Draves		Be Silent
spot@cs.cmu.edu		Die

Bruce.Hoult@actrix.co.nz (Bruce Hoult) (11/12/90)

>More to the point, *any* processor with register windows is not going
>to do too well on deeply recursive code.
 
I don't know much about this, but it seems to me that a machine without
register windows is oiggoing to have to put those values on the stack
in any case.  A machine with lots of registers that have to be stored
and loaded in deep recursion should at least be able to take advantage
of any burst mode that the memory subsystem supports, thus giving an
advantage over normal architectures.

firth@sei.cmu.edu (Robert Firth) (11/13/90)

In article <1990Nov11.102907.7706@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>It is my understanding that the SunOS 4.x compiler does
>detect and eliminate tail recursion.  This is verified by
>compiling (-O4 -S)
>
>ENUM(i) { if (i==0) return 1; else return ENUM(i-1); }
>
>and seeing
>
>_ENUM:
>L77001:
>        tst     %o0
>        bne     L77003
>        nop
>        retl
>        add     %g0,1,%o0
>L77003:
>        b       L77001
>        dec     %o0
>
>
>It seems odd that it would generate
>a branch to another branch.  can anyone
>who knows comment?

As a preliminary note: the program mentioned by the original poster
does not lend itself to tail-recursion elimination, so this won't
work for him.  Another option, since the code is fairly short, would
be to hack an assembler version that didn't use the register windows
(as was rightly pointed out, they are optional, not an ineradicable
part of the call protocol).  I actually tried that on Ackermann's
Function (what else?) and got a factor of 7 speedup.

Turning to the above code - yes there does seem to be a defect in
the low-level optimiser here.  There are problems in eliding branch
chains on machines with delay slots, but that doesn't apply above:
the delay slot of the BNE is empty, so one could close the chain and
move the decrement of o0 to the label destination.

However, one can do better.  Since register o0 is dead on the
fall-through (it is destroyed by the add), one can float up both
the instructions at L77003:

	bne L77001
	dec %o0

thus closing the chain and filling the delay slot.  This reduces
the five-instruction loop to three instructions, for a 65% speed up.

henry@zoo.toronto.edu (Henry Spencer) (11/14/90)

In article <8662@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
>>SPARC seems to suck the big wazoo
>>on deep recursion, presumably because of its register-windowing
>>design.
>
>More to the point, *any* processor with register windows is not going to do
>too well on deeply recursive code.

Actually, the AMD 29k ought to do all right.  The problem is not register
windows vs. recursion, it is *fixed-size* register windows vs. recursion.
SPARC ends up saving and restoring a lot of registers that aren't actually
in use, because the allocation quantum is 16 registers.  On the 29k the
quantum is 2 registers (1 if you don't care about being floating-point
compatible with the 29050) and it should almost never be necessary to
save and restore unused registers.
-- 
"I don't *want* to be normal!"         | Henry Spencer at U of Toronto Zoology
"Not to worry."                        |  henry@zoo.toronto.edu   utzoo!henry

preston@ariel.rice.edu (Preston Briggs) (11/14/90)

In article <1990Nov12.054723.18015@actrix.co.nz> Bruce.Hoult@actrix.co.nz (Bruce Hoult) writes:
>>More to the point, *any* processor with register windows is not going
>>to do too well on deeply recursive code.

>I don't know much about this, but it seems to me that a machine without
>register windows is oiggoing to have to put those values on the stack

The point is _fixed-size_ register windows.
A window-less machine might have to save 3 registers per call.
A fixed-size window machine might save 16 or 32 registers per call.
Even with burtst-mode saves, the volume will overwhelm the cache,
and perhaps virtual memory.

Preston

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/14/90)

In article <1990Nov13.200051.12329@zoo.toronto.edu>
	henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <8662@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
>>More to the point, *any* processor with register windows is not going to do
>>too well on deeply recursive code.
>The problem is not register
>windows vs. recursion, it is *fixed-size* register windows vs. recursion.
>SPARC ends up saving and restoring a lot of registers that aren't actually
>in use, because the allocation quantum is 16 registers.

That's a reasonable argument, but not the only one.

First, compilers for non-window machines try to dribble values out to
memory, avoiding bursts that could cause processor stalls. (The
infamous VAX "calls" instruction writes a burst, and that seriously
stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
burst.

Second, it wasn't clear from the original posting that the SPARC was
actually running slower than some other machine - the poster just
said it was slow. Maybe he had a big computation? Did he try the
-O switch?

Third, programs that recurse can often be written as iterations.
All machines, SPARC included, run the iterative form faster.

I'm unconvinced that register windows are a good idea, but I'm also
unconvinced that SPARC is some sort of disaster area. 

-- 
Don		D.C.Lindsay

blc@mentat.COM (Bruce Carneal) (11/14/90)

In article <1990Nov11.102907.7706@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>It is my understanding that the SunOS 4.x compiler does
>detect and eliminate tail recursion.  This is verified by
>compiling (-O4 -S)
> (example of intra procedural tail recursion deleted)

Unfortunately this doesn't work in the general case.
Compiling under SunOS 4.1 with 'cc -O4 -S ...' the trivial
program fragment:

	func1(arg) int arg;  { return func2(arg); }
	func2(arg) int arg;  { return func3(arg); }
	func3(arg) int arg;  { return arg; }

you get (with some junk removed):

	_func1:
		save	%sp,-96,%sp
		call	_func2,1
		mov	%i0,%o0
		ret
		restore	%g0,%o0,%o0
	_func2:
		save	%sp,-96,%sp
		call	_func3,1
		mov	%i0,%o0
		ret
		restore	%g0,%o0,%o0
	_func3:
		retl
		nop

As shown, no tail call optimization occurs.
This form of call is used along the main paths
of some Streams modules (and elsewhere of course).
Later compilers can/do optimize those calls.

jcallen@Encore.COM (Jerry Callen) (11/14/90)

In article <11092@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>First, compilers for non-window machines try to dribble values out to
>memory, avoiding bursts that could cause processor stalls. (The
>infamous VAX "calls" instruction writes a burst, and that seriously
>stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
>burst.

I'm curious what's wrong with a burst write of registers out to memory?
It seem to me that an intelligent memory system (maybe with a hint
from the processor that a burst is coming) could accept one word per
cycle and get those registers out to memory very fast. If the processor
is "stalled" while the registers go out, how is this different from
procedure prologue code in a RISC that does a bunch of stores in a row?
In fact, the RISC has to fetch the instructions, so the RISC should take
MORE time to save the registers.

I'l take a stab at answering my own question (I like talking to myself :-)).
Conceivably the RISC could NOT do the saves as a block in the prologue, but
(as Donald says) dribble them out as the register is needed. One complication 
here is that the epilogue has to have some way of knowing which registers
actually have to be restored (in the case of multiple returns, or, for
Ada, exceptions).

-- Jerry Callen
   jcallen@encore.com

henry@zoo.toronto.edu (Henry Spencer) (11/15/90)

In article <11092@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>First, compilers for non-window machines try to dribble values out to
>memory, avoiding bursts that could cause processor stalls. (The
>infamous VAX "calls" instruction writes a burst, and that seriously
>stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
>burst.

The tradeoffs here are subtle.  Many memory systems have much higher
bandwidth for sequential bursts than for random accesses, although your
bus and processor have to be designed to exploit this.  There is also
the question of stalls waiting for data to be *read* from memory, while
the window machine can charge ahead because the data is probably in a
register instead.

>I'm unconvinced that register windows are a good idea...

The MIPSCo people, who measure almost everything, say that for ordinary
C code it doesn't matter much either way, *if* you're willing to assume
sophisticated compilers.  (Of course, they are perhaps not entirely
unbiased... :-))  They also say that for languages with highly dynamic
calling patterns (e.g. C++ and Smalltalk), where the compiler has less
of a handle on what's happening, register windows would be a modest win.
They're definitely a big win if your compiler isn't too bright.

>but I'm also
>unconvinced that SPARC is some sort of disaster area. 

I wouldn't call it a disaster area exactly, at least not for this reason :-),
but rapid large excursions in stack depth are definitely expensive on it.
-- 
"I don't *want* to be normal!"         | Henry Spencer at U of Toronto Zoology
"Not to worry."                        |  henry@zoo.toronto.edu   utzoo!henry

brandis@inf.ethz.ch (Marc Brandis) (11/15/90)

In article <1990Nov12.054723.18015@actrix.co.nz> Bruce.Hoult@actrix.co.nz (Bruce Hoult) writes:
>>More to the point, *any* processor with register windows is not going
>>to do too well on deeply recursive code.
> 
>I don't know much about this, but it seems to me that a machine without
>register windows is oiggoing to have to put those values on the stack
>in any case.  A machine with lots of registers that have to be stored

When you look at typical recursive procedures, you will notice that they
have very often only a small number of local variables, that can be held
in registers. In an architecture with register windows, a trap occurs on
stack underflow or overflow and causes the register windows to be dumped to
memory. Most of the time, too much is stored and restored. 

In an architecture without register windows, a smart compiler can remove
much of this overhead. 

There are also quite a lot of cases, where recursive procedures use variables
in one path through the procedure but not in another. In this case it may
be better to have the variables in memory anyway, so nothing would have to
be stored at all.

(* I speak only for myself.
   Marc-Michael Brandis, Institut fuer Computersysteme, ETH-Zentrum
   CH-8092 Zurich
   email: brandis@inf.ethz.ch brandis@iis.ethz.ch
*)

berg@cip-s04.informatik.rwth-aachen.de (AKA Solitair) (11/16/90)

Norwitz Neal writes:
>I have a recursive algorithm that I would like to run as fast as possible.
>I am running a part of it on a SPARCstation SLC.  But at this rate, it might
>finnish in a couple of years!!

Why don't you have a look at the FORTH-chip (don't remember the manufacturer,
the part number should be something like 4016 if I recall correctly), the
processor has no general purpose registers, just a stack cache, the thing
was designed to execute *native* FORTH (a medium to high level threaded stack
based language) code.  Last I heard about it, they we're designing a
new one.  This processor was practically designed for recursive algorithms.
--
Sincerely,                 berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet
           Stephen R. van den Berg.
"I code it in 5 min, optimize it in 90 min, because it's so well optimized:
it runs in only 5 min.  Actually, most of the time I optimize programs."

shri@ncst.ernet.in (H.Shrikumar) (11/19/90)

In article <3700@rwthinf.UUCP> 

>Norwitz Neal writes:
>>I have a recursive algorithm that I would like to run as fast as possible.
>>I am running a part of it on a SPARCstation SLC.  But at this rate, it might
>>finnish in a couple of years!!

and berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet writes:

>Why don't you have a look at the FORTH-chip (don't remember the manufacturer,
>the part number should be something like 4016 if I recall correctly), the
>processor has no general purpose registers, just a stack cache, the thing

   The start-up that thought up the 4016 had start-up problems, and
sold the technology to Harris. Harris cleaned up a lot of the hardware
bugs in the Novix 4016, christened it the RTX 2000 (*R*eal *T*ime e*X*press),
played with it for a little while, and decided to discontinue the line.

   Sad. See other posts in this newgroup and comp.lan.forth for more
details.

   But your point on Stack Caching machines and recursive routines is
well taken.

   Just to comment, I think the original poster was talking about
mature machines, and the Novix/RTX was yet no more than a chip and
an architecture concept. (25ns RAM, three bank of it, no OS yet ...)

-- shrikumar ( shri@ncst.in )

martin@ee.uni-sb.de (Martin Huber) (12/13/90)

I forward this for friend.   *Don't* reply to me, reply to
the address in the Reply-To header.

Here goes:

In article <1093@shakti.ncst.ernet.in> you write:
>In article <3700@rwthinf.UUCP> 
>
>>Norwitz Neal writes:
>>>I have a recursive algorithm that I would like to run as fast as possible.
>>>I am running a part of it on a SPARCstation SLC.  But at this rate, it might
>>>finnish in a couple of years!!
>
>and berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet writes:
>
>>Why don't you have a look at the FORTH-chip (don't remember the manufacturer,
>>the part number should be something like 4016 if I recall correctly), the
>>processor has no general purpose registers, just a stack cache, the thing
>
>   The start-up that thought up the 4016 had start-up problems, and
>sold the technology to Harris. Harris cleaned up a lot of the hardware
>bugs in the Novix 4016, christened it the RTX 2000 (*R*eal *T*ime e*X*press),
>played with it for a little while, and decided to discontinue the line.
>
>   Sad. See other posts in this newgroup and comp.lan.forth for more
>details.
>
>   But your point on Stack Caching machines and recursive routines is
>well taken.
>
>   Just to comment, I think the original poster was talking about
>mature machines, and the Novix/RTX was yet no more than a chip and
>an architecture concept. (25ns RAM, three bank of it, no OS yet ...)
>
>-- shrikumar ( shri@ncst.in )

Another comment: I have been at the November '90 trade show ELECTRONICA
in Munich, Germany. I have visited HARRIS. They were perfectly willing
to sell the RTX2000. I got a databook, infos ... There was a Mr. Klaus
Flesch which is obviously something like general manager of the company
FORTH SYSTEMS, Germany. He *sells* RTX2000-based products. IMHO, the
RTX2000 is available. It *is* mature: 10MFIPS (FORTH micro-instructions
per second) is definitively state of the art in RISCS. The hardware is
capable of being micro-codable. I think this is better than many of
today's machines. Prices should be in the $500-$5000 range (please ex-
cuse my bad main memory, it is *not* a standard memory), in any case,
they *are* cheaper than a big useless number cruncher.

Take it or leave it, it's up to you!

Martin (mahu@ee.uni-sb.de)
Disclaimer: Mostly i wait for technology to improve, but sometimes
            improvements go by unnoticed. Moral: Watch out!
-- 
Sincerely,                 berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet
           Stephen R. van den Berg.
"I code it in 5 min, optimize it in 90 min, because it's so well optimized:
it runs in only 5 min.  Actually, most of the time I optimize programs."