[comp.arch] Divide in 1 cycle

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/28/91)

In article <1991Feb27.183718.638@jetsun.weitek.COM> weaver@jetsun.WEITEK.COM (Michael Weaver) writes:
>Does anyone know of a pipelined divider that gives a result every cycle?

I've read of at least one real product. No ref -- memory fault.

>As far as I know, they exist only on paper, and for good reason:
>every algorithm I have heard of for an 'array' divider starts with
							^^^^^^^^^^^
>an iterative algorithm, and simply repeats the hardware for each 
^^^^^^^^^^^^^^^^^^^^^^^
>iteration. This means that the amount hardware is increased by 

I thought they started with a good table lookup. I seem to remember
that it's been shown that a good initial approx is more important
than the actual iterative scheme.

>the same factor as the increase in throughput, and the latency 
>remains the same, which is not very attractive. Also, the actual
>number of transistors is horrendous -- it would take perhaps
>ten times as many transistors as an array multiplier, which
>is a large thing.

I don't know about the order of magnitude difference, but I'll agree I 
wouldn't usually build such things on the basis of economics.

It always seemed like nature has it in for (integer) division --
from a stats perspective many of the answers will be 0 yet we
have to wait much longer (i.e. O(n^k) or O(n log n) as opposed to O(1))
in the worst case. If the numerator & denom are taken from uniform 
distibutions (which is pretty much the case in modular arithmetic & a 
few other realms) 1/2 the time num < denom and this is obviously
reasonably quick. A reconfiguring pipe anyone?

-kym

baum@Apple.COM (Allen J. Baum) (03/01/91)

[]
>In article <1991Feb28.010339.20055@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <1991Feb27.183718.638@jetsun.weitek.COM> weaver@jetsun.WEITEK.COM (Michael Weaver) writes:
>>Does anyone know of a pipelined divider that gives a result every cycle?
>
>I've read of at least one real product. No ref -- memory fault.

I believe that HP had a chip that was a purely combinatorial divider.
I don't remember what system it was shipped with, though, and I don't think
it was IEEE conforming

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/01/91)

In article <6968@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
[there pipes that divide in 1 cycle]
>Of the machines I know, this is not the case.  The CYBER 205, which 
>pipelines just about everything, does not pipeline divide, square root
>(the same unit), or convert between decimal and binary, which I believe
>also uses that unit.  The CRAY does not provide a division at all, presumably
>because it could not be pipelined.  It does have a pipelined approximation to
>reciprocal, and a pipelined correction factor.

I know the Cray-2 has problems with chaining (:-), but on the X-MP
wasn't it possible to chain recip & multiply? 

-kym

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (03/07/91)

In article <1991Mar1.033353.28445@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

>In article <6968@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>[there pipes that divide in 1 cycle]
>>Of the machines I know, this is not the case.  The CYBER 205, which 
>>pipelines just about everything, does not pipeline divide, square root
>>(the same unit),
>>also uses that unit.  The CRAY does not provide a division at all, presumably
>>because it could not be pipelined.

>... but on the X-MP
>wasn't it possible to chain recip & multiply? 
>
>-kym

Several points here:

Many of the postings have confused result-within-1-cycle with
1-result-per-cycle (or some small number, due to pipelining e.g.).

I don't know of any machine which produces the result-within-1-cycle.

However, there are a number of machines which have pipelined divides.  
The Cyber 205 *scalar* divide unit was not pipelined.  The vector unit *was*
pipelined.  In the basic 2 pipe machine, it was 1 result/3.57 cycles
(that is not a typo- I don't know why it is not an integral number of cycles),
with a faster (roughly twice as fast) pipeline available 
(more hardware had to be added) on the four pipe version.  
Divide and Square Root were both provided.

The Cray provides a similar level of performance in a different way: by
explicitly pipelining the reciprocal approximation unit with add/multiply
to get the final result.

A question that is sometimes asked by vendors, is, how important is divide?
Because divide is slow, many codes avoid using it whenever possible.  
Because they do, dynamic instruction counts show divide as seldom used.  etc.  
A familiar situation to readers of this group.  I don't know of 
any effort to determine the "natural" frequency of divides, though codes
written for vector machines with pipelined divides are probably a lot
closer to the truth then old codes written on machines that took 60-100
cycles to produce one, non-pipelined result.

  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/07/91)

In article <1991Mar6.183927.16256@news.arc.nasa.gov> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>Many of the postings have confused result-within-1-cycle with
>1-result-per-cycle (or some small number, due to pipelining e.g.).

I don't think we are confused, just indulging (certainly I'm guilty) in the 
kind of sloppy `shop talk' mathematicians accuse physicists of all the time. :-)

>A question that is sometimes asked by vendors, is, how important is divide?
>Because divide is slow, many codes avoid using it whenever possible.  
>Because they do, dynamic instruction counts show divide as seldom used.  etc.  
>A familiar situation to readers of this group.  I don't know of 
>any effort to determine the "natural" frequency of divides, though codes
>written for vector machines with pipelined divides are probably a lot
>closer to the truth then old codes written on machines that took 60-100
>cycles to produce one, non-pipelined result.

This is of course a nontrivial question with interesting ramifications from 
an `philosophy of engineering' point of view. An engineer tries to get a job 
done as economically as possible (in the ideal case, anyway). One technique 
used is to follow the maxim ``what must be done must be done well''. But how 
do you know what _needs_ to be done? Obviously the RISC community and others 
uses the approximation -- what is done often is more important than what is 
done less.  However, this brings up the question -- is what _is_ done the best
way that thangs _can_ be done. I think the obvious unswer is -- we can't be 
_sure_ (but the pocketbook has gotta count for something :-).

As a bumbling attempt to try to figure out what the `natural frequency' of 
divides might be we have to consider quite a few things.

The algebraic properties of division -- as representing either an inverse over 
the field of reals (:-) or, more simplistically, repeated subtraction -- shows 
that it should occur less often than, say, addition/subtraction (i.e. it can 
be made to distribute over these operations).

So why not take addition as the `model' of a low-cost operation that may 
represent at least an upper bound on the natural frequency of division (or any 
other arithmetic operation for that matter)?

Simple measurements show that addition can account for 20% or so (actually 
less) of dynamic opcodes, whereas division normally occurs at the 1% to 3% 
range.

Let's presume all operations other than divide take about the same work.  If 
we say division occurs at a frequency alpha and takes time D  then the 
proportion of time taken for all divisions is then obviously:

	1
	---------------
	1+(1-alpha)/(alpha D)

If we plug in .2 for alpha and make a little table we find that, for example 
on a Magnum where (on average, for uniform random operands) a divide takes 
about 4 adds we find that about 55% of the runtime 
		---------------------------------
might be accounted for by division with the above (obviously very loose)
----------------------------------
assumptions.

I also feel some rough analysis, akin (in accuracy) to the ``editor's 
formula'', might be jury-rigged to provide other measures of natural operator
frequencies. Obviously Herman Rubin & others would be far more qualified than 
I to do such an analysis, but I shall attempt something very simple.

Let's say 3% of operations are divides, given that they are thought to be very 
expensive. Of the remaining 97% of instructions _some_ of them also represent 
divisions, although in disguise. 

Let's presume the work of an average divide is D and, further, that at most D 
instructions will be `substitute' for a divide. If we start with N 
instructions we will therefore have

	d/N = 3%	

	other divides = at least 3% of (N-d)/D

where d = number of divides.

The lower bound on the `natural frequency' of divisions would then be

	3% + other divides/N
>=	3% + 3%(N-d)/(N D)
>=	3% + 1/4 3% - 1/4 3% 3% 
>=about	4.2%

Thus the natural frequency of division is found to be little more (as a lwb) 
than actually measured -- somewhat contradicting the previous discussion. But 
this may not be too far wrong -- the more expensive division _IS_ then the 
fewer will be the divides that can result (given division becomes really 
cheap) by substituting it for these other instructions.

I would be interested in seeing any other discussion (although less interested 
in poking fun at my lame arguments above :-|).

-kym

mash@mips.com (John Mashey) (03/07/91)

In article <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>A question that is sometimes asked by vendors, is, how important is divide?
>>Because divide is slow, many codes avoid using it whenever possible.  
>>Because they do, dynamic instruction counts show divide as seldom used.  etc.  
>>A familiar situation to readers of this group.  I don't know of 
>>any effort to determine the "natural" frequency of divides, though codes
....
>done as economically as possible (in the ideal case, anyway). One technique 
>used is to follow the maxim ``what must be done must be done well''. But how 
>do you know what _needs_ to be done? Obviously the RISC community and others 
>uses the approximation -- what is done often is more important than what is 
>done less.  However, this brings up the question -- is what _is_ done the best
>way that thangs _can_ be done. I think the obvious unswer is -- we can't be 
>_sure_ (but the pocketbook has gotta count for something :-).
...
>fewer will be the divides that can result (given division becomes really 
>cheap) by substituting it for these other instructions.

(AS one reference to the following, see Hennessy & Patterson's
Appendix on Computer Arithmetic.  Note: cheap = 1, fairly cheap = 4,
moderate-cost = 12, expensive = 32 or more, very expensive
can be hundreds.)

You can make floating point go faster and faster by throwing more and
more hardware at it.  Current and/or forthcoming chips illustrate
fact that you can profitably spend 600K transistors or more
on an FPU all byt itself.

+ & -
You can make integer addition cheap by throwing more hardware at it;
the nature of these is that you must put enough hardware in to make
integer add be as fast as it needs to be.

*
You can make integer multiplication moderate-cost by throwing
some hardware at it, or fairly cheap by using more.  It is a useful
analysis to consider tradeoffs here of best use of space.

/
You can make integer divison very expensive by throwing zero hardware at it
(SPARC), somewhat expensive by throwing a little bit of hardware
at it, or somewhat less expensive by throwing more hardware at it
(R3000, 88K, etc).
No matter what you do with CPU logic, you are NOT likely to make
integer division "very cheap" or even "somewhat cheap";
once you have a reasonable implementaion, it's HARD to speed it
up by allocating it more and more space....
(Of course, if you could afford the appropriately-sized lookup table,
you could do it by tabe lookup, but that is a LARGE table not likely
to be placed onto a CPU.:-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/08/91)

In article <777@spim.mips.COM> mash@mips.com (John Mashey) writes:
>You can make integer divison very expensive by throwing zero hardware at it
>(SPARC), somewhat expensive by throwing a little bit of hardware
>at it, or somewhat less expensive by throwing more hardware at it
>(R3000, 88K, etc).
>No matter what you do with CPU logic, you are NOT likely to make
>integer division "very cheap" or even "somewhat cheap";
>once you have a reasonable implementaion, it's HARD to speed it
>up by allocating it more and more space....
>(Of course, if you could afford the appropriately-sized lookup table,
>you could do it by tabe lookup, but that is a LARGE table not likely
>to be placed onto a CPU.:-)

A table lookup, sigh... Maybe!

What with all these sneaky compression techniques around these days --
not that I've looked very closely, mind you -- maybe you _could_ fit a
whole division table in there. Beside the actual bits needed to store
all the results (which works out at about 1 bit per result due to
divides nice nature (e.g. >1/2 results are either 0 or 1)), we need to
also handle all those address lines. Maybe some kind of `perfect
hashing' scheme? (I remember Knuth remarked somewhere that perfect
hashing was going to be very difficult  -- and then some smartys came _up_ 
with some fairly good approximations to it in several domains.)

I am not sure what has/can be done with a table and (very) simple
hardware to do interpolation.

And then again -- maybe you can just live with a (relatively) fast
reciprocal as in the program below. (Of course this _still_ isn't cheap
in terms of gates).

Although the algorithm appears to bristle with multiples (apart from the
one we'll need to actually get a division going), they aren't all
general. One is simply a shift. (It turns out playing around with
exponents via C's standard library functions is not worth the effort).
The other is a `small' multiply (although indicated as a short x double
in the program it could be implemented in a relatively small amount of
dedicated hardware). The only remaining multiply is a full one -- but
doesn't need to be performed in the majority of cases (given uniform
operand to recip). (Some extant hardware relegates similar final 
`correction steps' to a separate function unit.)

Although it's similar to a one-srep Newton-Raphson, it isn't. It's a
simpler scheme based on the expansion:
	(a+b)^-1 = a^-1 (1 + a^-1 b)^-1
		=about a^-1 (1 - a^-1 b)
	when b<<a.
We ensure b<<a by the appropriate FFO-like manipulations at the start of
the routine.

The software recip compares with the built-in (or not :-) versions on several 
machines as follows:

Machine		ratio of 
		software/`builtin'

VAX 6440	24%
Mips Magnum	30%	
VAX 8530	30%	
Sun 3/60	55%	
Sun SS1		70%	?!

I shall not comment further.

In summary: it seems as though there are some mechanisms for obtaining divide 
pipes that return a result per clock, but at the relative low proportion of 
overall dynamic opcodes (<=3% as pixie & I measure) and possibly not much more 
not matter HOW cheap and/or fast it could be made, further effort is probably 
wasted.

-kym

C code ahead -- BEWARE.

--------------------------------------------------------------------------------
#include <stdio.h>
#include <math.h>

#define NIT 100000
#define NTAB 1048

int mask[32];
double shift[32];
double rtab[NTAB];

#define recip(x) {\
	register y;\
	register n;\
	for(y=x,n=0; y>=NTAB; y>>=1,n++);\
	res = rtab[y];\
	if(n) {\
		register short b;\
		res *= shift[n];\
		b = x & mask[n];\
		if(b) res *= 1-res*b;\
	}}

double test1(){
	register i;
	register double ans=0;
	register double res;

	for(i=1;i<=NIT;i++){ recip(i); ans+= res; }
	return ans;
	}

double test2() {
	register i;
	register double ans=0;

	for(i=1;i<=NIT;i++) ans+= 1.0/i;
	return ans;
	}

main(){
	double t0,t1,t2;
	init();
	t0=clock();
	printf("%g\n", test1());
	t1=clock();
	printf("%g\n", test2());
	t2=clock();
	printf("%g\n",(t2-t1)/(t1-t0));
	}

init(){
	int i;

	rtab[0]=1e38;
	for(i=1;i<NTAB;i++) rtab[i]=1.0/i;
	for(i=0;i<32;i++) mask[i]=(1<<i) - 1;
	for(i=0;i<32;i++) shift[i]=1.0/(1<<i);
	}

--------------------------------------------------------------------------------

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/11/91)

My earlier mumblings about ``interpolation'' & ``table-lookups not being quite 
as expensive as all that'' have resulted in a modest amount of email asking 
for clarification. So after a bit of experimentation I am posting yet 
_another_ reciprocal routine which attempts to illustrate some of the ideas to 
which I alluded.

This present routine uses linear interpolation (after some initial adjustment 
of the argument) to find the reciprocal of an unsigned 16-bit number. It uses 
512 words of table storage.

It is based, again, on an approximate expansion for 1/(L a+b)

	= 1/L 1/a (1 - 1/L 1/a b)

	= 1/L (A[a] - B[a] (1/L b))

	where A[a] = 1/a and B[a] = 1/a^2

It is convenient to choose L = 2^k :-).

Over the argument range the absolute error in the result is small enough to 
make any division accurate (i.e. the largest abs err * largest numerator 
< 1 bit).

Since the scheme only involves
	(c) a `small' multiply
	(a) a subtract
	(b) two `small' shifts
it is slightly :-) cheaper in terms of h/w than the scheme I posted previously.
It again could perform at the rate of one result per cycle with a suitable 
tree multiplier and shifters.

The accuracy of the calculation can be improved somewhat if the parameters 
A[a] and B[a] are displaced; I have tried a simple regression to minimize the 
square error, and that seems satisfactory; but for the numerical purists a 
minimax procedure might be better.

Although obviously a toy, expansion to your preference in wordsize is
straightforward (if expensive).

-kym

Since the abs error was originally largest for a power of two, I have 
included a shift in the program to remove trailing 0's in the initial
argument; this would probably be reduced to a single-bit shift in a hardware 
version, or eliminated entirely if the table entries A[a] and B[a] were 
adjusted appropriately.

The program also includes a possible single Newton-Raphson correction step; 
but it does not seem to be required for the purposes of implementing integer 
division.

C Code ahead.

--- program ---
#include <assert.h>
#include <math.h>
#include <stdio.h>

/*

1/(La+b) = 1/L 1/a (1 + 1/L 1/a b)^-1
	=about 1/L 1/a (1 - 1/L 1/a b)
	= 1/L 1/a - 1/L^2 1/a^2 b
	= 1/L A(a) - 1/L^2 B(a) b

Good if L = 2^k.

*/

#define MAX 0xffff
#define NTAB (1<<8)

float atab[NTAB],btab[NTAB];
int mask[10];
float shift[16];

float recip0(x) {
	int a;
	unsigned short b;
	unsigned char m=0;
	unsigned char n;

/* trim trailing 0's from initial argument */
	for(a=x; (a&1)==0; a>>=1, m++);
/* top `nbits' of x -> a ; rest -> b */
	for(n=0; a>=NTAB; a>>=1, n++);
	assert(n<=10);
	b=x & mask[n];
	m += n;
	assert(m<=15);
	return shift[m]*(atab[a] - shift[n]*btab[a]*b);
	}

#ifdef NEWTON
float recip(x) {
/* perform single Newton-Raphson correction step */
	float r = recip0(x);
	r *= 2 - r*x;
	return r;
	}
#else
/* perform no Newton-Raphson correction step */
#define recip(x) recip0(x)
#endif

main() {
	int x;
	float mine,std;
	float abs_err,rel_err;
	float prod_rel_err=1,sum_abs_err=0;
	float max_abs_err=0,min_abs_err=1e38;
	int min_abs_err_x= -1,max_abs_err_x= -1;
	float max_rel_err=0,min_rel_err=1e38;
	int min_rel_err_x= -1,max_rel_err_x= -1;

	init();

	for(x=1; x<=MAX; x++) {

		std=1.0/x;

		mine=recip(x);
		assert(mine>=0); /* just in case... */

		abs_err=fabs(mine-std);
		sum_abs_err+=abs_err;
		rel_err=abs_err/std;
		prod_rel_err *=rel_err;
		if(abs_err>max_abs_err)
			max_abs_err=abs_err,max_abs_err_x=x;
		else if(abs_err<min_abs_err)
			min_abs_err=abs_err,min_abs_err_x=x;
		if(rel_err>max_rel_err)
			max_rel_err=rel_err,max_rel_err_x=x;
		else if(rel_err<min_rel_err)
			min_rel_err=rel_err,min_rel_err_x=x;

		}

	printf("av abs err=%g ",sum_abs_err/MAX );
	printf("min abs err =%g ",min_abs_err);
	printf("(%d) ",min_abs_err_x);
	printf("max abs err =%g ",max_abs_err);
	printf("(%d) ",max_abs_err_x);
	putchar('\n');
	printf("av rel err=%g ", pow(prod_rel_err,1.0/MAX) );
	printf("min rel err =%g ",min_rel_err);
	printf("(%d) ",min_rel_err_x);
	printf("max rel err =%g ",max_rel_err);
	printf("(%d) ",max_rel_err_x);
	putchar('\n');

	exit(0);
	}

init() {
	int i;
	atab[0] = 0;
	btab[0] = 0;
	for(i=1; i<NTAB; i++) {
		atab[i] = 1.0/i;
		btab[i] = 1.0/((long)i*i);
		}
	for(i=0; i<10; i++) mask[i] = (1L<<i) - 1;
	for(i=0; i<16; i++) shift[i] = 1.0/(1L<<i);
	}


--- make #1 ---
cc recip.c -lm
a.out
--- log file #1 ---
av abs err=5.93951e-08 min abs err =0 (1) max abs err =7.59971e-06 (514)
av rel err=0 min rel err =0 (1) max rel err =0.00732425 (33008)
--- make #2 ---
cc -DNEWTON recip.c -lm
--- log file #2 ---
av abs err=1.51684e-10 min abs err =0 (1) max abs err =2.96859e-08 (514)
av rel err=0 min rel err =0 (1) max rel err =5.36168e-05 (33008)
--- end ---

bob@tera.com (Bob Alverson) (03/12/91)

Although the Tera computer is not built yet, it will be able to divide by
an arbitrary integer *constant* (or loop invariant value, etc.), while
producing one result per cycle.  Mind you, I can't imagine any reason
anyone would care.  The latency is the same as for a "fused multiply add"
a la IBM's RS/6000 (because the divide is done with the same logic).

For the unlucky whose divisors aren't known to the compiler and aren't loop
invariant, the divide rate drops to one result every nine ticks.  The only
significant hardware dedicated to divide is a 256 entry lookup table and
an 8x8 -> 16 multiplier for the initial reciprocal approximation.

Now, if I could just figure out how to do float divides at one per tick...

Bob (bob@tera.com)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)

In article <1991Mar12.043839.11068@tera.com> bob@tera.com (Bob Alverson) writes:

| For the unlucky whose divisors aren't known to the compiler and aren't loop
| invariant, the divide rate drops to one result every nine ticks.  The only
| significant hardware dedicated to divide is a 256 entry lookup table and
| an 8x8 -> 16 multiplier for the initial reciprocal approximation.

  I got these results from a 386-25:

#   System id: Dell 325, 4MB, 150MB, Xenix/386 2.3.3, 387
#   
#   Math operations, effective instructions/sec (thousands)
#   
#                  Add     Sub     Mpy     Div    Wtd HM
#   short:      7451.0  7378.6  3023.3  2933.3    4656.6
#   long:       7600.0  7368.4  2692.3  2000.0    4031.5
#   float:      1168.8  1168.8   975.6   933.3    1074.9
#   double:     1025.6  1012.7   750.0   789.5     899.2
#
#   Test and branch timing:
#   integer compare and branch    0.688 uSec,   1453.5K/sec
#     float compare and branch    4.320 uSec,    231.5K/sec

The divide speed would indicate about 9 ticks for 16 bit, about 12.5 for
32 bit.  The 486-25 looks like this:

#   System id: HP 486-25, SCO ODT 1.0, 10MB, 300MB, cc
#   
#   Math operations, effective instructions/sec (thousands)
#   
#                  Add     Sub     Mpy     Div    Wtd HM
#   short:     17934.1 18000.0  3483.9  2936.2    6400.4
#   long:      20400.0 19695.6  3225.8  2042.6    5528.7
#   float:      4258.1  4252.0  3829.8  1515.8    3276.1
#   double:     4129.0  4087.9  3260.9  1345.8    2992.5
#   
#   Test and branch timing:
#   integer compare and branch    0.247 uSec,   4054.0K/sec
#     float compare and branch    0.850 uSec,   1176.5K/sec

  This would indicate that the 486 didn't get much help on divide, but
add and subtract, as well as compare and branch, got a big boost. My
overall results for a bunch of programs was that the 486 was about 2.6x
faster than the 386 at the same speed.

  Note: these figures are presented as ballpark figures, and represent
measured performance obtained using C rather than assembler. While they
are proportional to actual hardware performance, they are not best case
performance. On the other hand I started building this benchmark suite
in 1970... it measures performance of individual performance aspects,
looking for those "jackpot cases" where performance is really bad.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"