[comp.arch] Intel 860 Architecture

hamrick@convex.COM (Ed Hamrick) (12/10/89)

A few questions about the Intel 860 architecture:

1) The 860 doesn't seem to have any divide instructions, either integer
   or floating point.  It seems to depend on a floating point reciprocal
   instruction followed by a floating point multiplication.

   - What other architectures use reciprocal -> multiply for divide?

   - What are the numerical accuracy tradeoffs?

   - How many cycles does the reciprocal instruction take? Can it be
     pipelined?

2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit?

3) What happens to the pipeline if there are page faults / exceptions
   during dual operation mode?  Does the pipeline advance one step
   per clock cycle, or one step per floating instruction?

4) Is is possible to do pipelined FP loads with non-unit stride?

5) Is it possible to do pipelined scatter/gather operations?

6) The 860 doesn't seem to have integer multiplication instructions,
   and also doesn't seem to have any integer to floating conversion
   instructions.  What are the best ways to do efficient integer
   multiplication with the 860?  Does this have something to do with
   the fmlow instruction?

All in all, it looks like a well thought out chip, with a lot of clever
architectural trade-offs to get everything on one chip.

Regards,
Ed Hamrick

ccplumb@rose.waterloo.edu (Colin Plumb) (12/10/89)

In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes:
>2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit?

It's 3 stages for most things, and 2 for d.p multiplies.  However,
in the latter case, each stage takes 2 cycles, so you only get one result
per 2 clocks.

>3) What happens to the pipeline if there are page faults / exceptions
>   during dual operation mode?  Does the pipeline advance one step
>   per clock cycle, or one step per floating instruction?

I don't quite understand.  The pipeline advances one stage per floating
instruction.  The instruction's dest specification specifies where to
put the current result, not the result of the operation you're
currently starting.

The i860's exception handling is seriously wierd.  It saves just
barely enough information for an excpetion handling routine to
figure out what went wrong and fix it.  No fast context switches
on this puppy!  And even then, there are code constructs you have
to avoid, like branching to the shadow of a delayed branch.  It
only saves one address, so the excpetion handler has to look back
one instruction to see where it should resume... ugh.

>4) Is is possible to do pipelined FP loads with non-unit stride?

Certainly.  The pipelined load business just makes the latency
visible to the programmer; you still supply one address per
load.  There is no auto-increment feature.  A pipelined load is
just a load that doesn't get satisifed until after you've issued
the next pipelined load; other than that it's normal.

>5) Is it possible to do pipelined scatter/gather operations?

Again, sure if you want to write the software to compute the scatter/gather
business.  I believe the load pipeline is 2 deep (I may have
forgotten).  This means the first two instructions you issue,
supply addresses and bogus destination registers.  The third pipelined
load, supply the third address and the destination for the first load
(which hopefully has completed by now).  There's nothing you couldn't
do with agressive scoreboarding and ordinary loads, except that not
having to supply a destination register until the data is ready gives
you another register for those few clocks.

>6) The 860 doesn't seem to have integer multiplication instructions,
>   and also doesn't seem to have any integer to floating conversion
>   instructions.  What are the best ways to do efficient integer
>   multiplication with the 860?  Does this have something to do with
>   the fmlow instruction?

Ug... I'm forgetting.  I believe the fmlow instruction can do an integer
multiply, and I'm pretty sure there are int<->fp conversion instructions.

>All in all, it looks like a well thought out chip, with a lot of clever
>architectural trade-offs to get everything on one chip.

To be honest, I wasn't too impressed when I saw it.  Lots of wierd
non-orthogonalities and I still think the interrupt handling is
a pig.  But I believe some of the design team reads comp.arch; let
them refute.

(Note that I believe an interrupt take/return should take about twice a
function call/return.  The 29000 is still too slow, but shows how
simple an interrupt handling structure can be.  I still wonder what the
chip is doing for all those cycles.  Freeze staus registers, set
supervisor mode, clear pipeline, and start fetching from a new
address.  A non-delayed jump with a little bit of fiddling.)
-- 
	-Colin

brooks@maddog.llnl.gov (Eugene Brooks) (12/10/89)

In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes:
>   - What other architectures use reciprocal -> multiply for divide?
The machines made at CRI do this.
>   - What are the numerical accuracy tradeoffs?
As good as Newton Raphson can be.
>   - How many cycles does the reciprocal instruction take? Can it be
>     pipelined?
Yes it can be pipelined, on the Cray it is pipelined.
I don't remember about the 860.
brooks@maddog.llnl.gov, brooks@maddog.uucp

mark@mips.COM (Mark G. Johnson) (12/11/89)

 Two ">>" for hamrick@convex.COM (Ed Hamrick);
 One ">"  for ccplumb@rose.waterloo.edu (Colin Plumb);
>>All in all, it looks like a well thought out chip, with a lot of clever
>>architectural trade-offs to get everything on one chip.
>
>To be honest, I wasn't too impressed when I saw it.  Lots of wierd
>non-orthogonalities and I still think the interrupt handling is
>a pig.  But I believe some of the design team reads comp.arch; let
>them refute.

I'd suggest that the Solborne/Mitsubishi _Million_Transistor_SPARC_ chip
(having CPU, caches, and floating point on one die, very much like the 860)
is a lot better thought out, with lots more architectural cleverness.
Including the idea that the computer needs to run an operating system
efficiently, and that user programs written in high-level languages
should run quickly.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

mark@mips.COM (Mark G. Johnson) (12/11/89)

In article <33258@hal.mips.COM> I made a blunder
 > ... the Solborne/Mitsubishi _Million_Transistor_SPARC_ chip ...
                   ^^^^^^^^^^^^
oops, make that Matsushita.  Sorry.

afgg6490@uxa.cso.uiuc.edu (12/12/89)

..> Reciprocal approximation

Extracting from the comparative tables in my recently completed
survey of computer arithmetic (yeah, I know, I wish they were much more
complete too):

Cray machines use Newton-Raphson reciprocal approximation.

Reciprocal approximation was designed into the Advanced Astronautics
ZS-1, but was replaced in the final design by commodity chips. 
IEEE format, but not IEEE accuracy. This algorithms special feature
was ommitting mantissa bits that were fixed in value in the intermediate
stages to obtain a 1/x approximately 20% more accurate that Cray's algorithm.

Reciprocal approximation was used in the Gould NP1, with hexadecimal
floating point.  Accuracy problems (in large part caused by the
hex format) caused a real divide to be backpatched in.

The AT&T DSP32C uses NR with a seed instruction and SW iteration for
its special 40 bit FP format.

Motorola MC960002 also uses Newton-Raphson with a seed instruction.

(Cyrix, as mentioned in an earlier post, uses NR to compute an
approximate 1/X for use in its 17 bit digit selection divide).



COMMENTS
--------
    Use of N-R reciprocal algorithms was set back quite a bit by
IEEE FP's exactness requirements: a true remainder is needed for
correct rounding.   Also, in non-binary floating point (decimal or hex)
accuracy problems when division is implemented by reciprocal multiplication
can be extreme.   Note that NR reciprocals typically converge from below.
Most other problems with reciprocal are caused by laziness - implementations
that do not bring the result out to the last possible bit.
    With a fast enough multiplier, however, the true remainder can be
computed with only a little extra work (for Y/X:  Q=Y*(1/X); R=Y-Q*X;
QQ = Q +/- delta dependeng on comparison of R and X).
    I have already heard compiler guys say that reciprocal instructions
opened up possibilities for optimization that are hidden by the
assymmetry of divide.
    I see some possibility of using both an explicit reciprocal and a 
remainder correction to advantage, as separable operations:
(1) a reciprocal instruction; (2) a remainder correction instruction to acheive
a true divide result.  (2) Can be optimized away, or combined when
redundant.

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (12/14/89)

In article <112400013@uxa.cso.uiuc.edu> afgg6490@uxa.cso.uiuc.edu writes:
>
>    Use of N-R reciprocal algorithms was set back quite a bit by
>IEEE FP's exactness requirements: a true remainder is needed for
>correct rounding. 

The i860 supports some IEEE things in software. That is, the logic is
in trap handlers, rather than on the chip. 

This software is not yet available for divide. So, multiplies and
adds can underflow, but divides can't.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

jesup@cbmvax.commodore.com (Randell Jesup) (12/14/89)

In article <112400013@uxa.cso.uiuc.edu> afgg6490@uxa.cso.uiuc.edu writes:
>
>..> Reciprocal approximation

[list of machines that use NR approx to do divides...]

	You can add the RPM-40 (with its FPU) to the list.  A FAST ?56x56?
multiplier array made it fairly easy to do that way.  (I think it was
3 40Mhz cycles per pass through the array).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"