[comp.arch] Vector processors, i860

hsong@nvuxl.UUCP (g hugh song) (02/02/91)

Hi.
Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
missing in this chip for building a UNIX machine out of this chip?   What I 
want to see is a cheap vector workstation that I really need.   How about
Moto's 68040 (or 88000) with a 68951(?) DSP chip?   Is it possible to build 
a vector machine with this config?   I am not intrested in single precision
vector processors (if they exist).  One of my friends told me that Stardent
is working with i860 for their next generation machine.  Is this true?

The machine on my desk, dec 5000/200 has an i860 for its graphics.  Is there
any compiler vendor who is working on a compiler which takes advantage of
this i860 chip?   I know a company which makes a coprocessor board with
i860 for VME bus and, later this year, Turbo Channel (DS 5000/200).   Their
price is rather steep.   I might buy one more DS 5000/200 with the money.
OK.  The company name is CSPI.  The sales person said that is is going to
release a kind of a scheduler software which allows multitasking.  However,
I do not get the picture of how transparent it is for the users.  If anybody
in the net knows how it works or will work, please explain.

Thanks.

	-hsong-

henry@zoo.toronto.edu (Henry Spencer) (02/03/91)

In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes:
>Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
>missing in this chip for building a UNIX machine out of this chip? ...

An architecture whose potential performance can be exploited in high-level
languages without driving the compiler writers up the wall. :-(
-- 
"Maybe we should tell the truth?"      | Henry Spencer at U of Toronto Zoology
"Surely we aren't that desperate yet." |  henry@zoo.toronto.edu   utzoo!henry

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (02/03/91)

In article <798@nvuxl.UUCP>, hsong@nvuxl.UUCP (g hugh song) writes:

song> .... [asks about the i860 cpu]....
song> What I want to see is a cheap vector workstation....

Short answer:  
	Buy an IBM RISC System 6000.  The Model 320 has the best
	price/performance ratio in the server configuration.

Long Answer:
	There is a good bit of confusion about "vector" processing 
	out there, so I will be a bit tutorial here....  You should
	type 'n' now if you find this topic boring....

	The original "vector" processing machines (Star 100, Cyber
	205, Cray 1, etc) and their modern equivalents (Crays 2,X,Y;
	IBM 3090VF, Convex, Alliant) had an instruction set which 
	allowed a *single* instruction to specify an operation
	on a *vector* of data. 
	
	Note that this does not imply that the operations on the
	vector of data are done simultaneously.
	In fact, on all of these machines the calculations were
	largely serial, so the specification of many calculations
	with a single instruction does not provide any performance
	gain.  

	The real performance gain from these machines comes
	from "pipelining", whereby the different stages of the
	floating-point calculations are separated to enable an 
	"assembly-line" operation with a new result available each
	clock cycle even though each individual calculation requires
	many cycles in the pipeline.

	Modern microprocessor floating-point units are pipelined, 
	but do not have vector instructions.  Operations on "vectors"
	are performed by code loops rather than by single instructions
	but the performance is the same since instruction fetching in
	loops is almost never a performance bottleneck.  Examples of
	modern microprocessors with fully pipelined floating-point
	units are the i860, the IBM RIOS, and (I believe) the Motorola
	8800.  None of these have vector instructions.

	So what is the difference between a "supercomputer" vector
	processor and a pipelined microprocessor?  The main difference
	which causes performance differences is the memory bandwidth.
	"Supercomputers" are generally not built around cached memory
	systems, but are instead designed to have very high speed 
	transfers between the main memory and the vector registers 
	used by the pipelined vector instructions.

	Microprocessor-based systems are all cache-oriented and none
	of the currently available systems have enough memory
	bandwidth to keep the pipelined arithmetic units busy if
	the data to be worked on does not fit in the cache.  The 
	job of writing code which re-uses data in the cache
	effectively and compiling that code into optimum machine
	instruction sequences is very difficult and severely limits
	the performance of current pipelined microprocessors.

	The IBM RIOS and Intel 860 machines are good examples.
	Each is capable of a theoretical peak of 60 MFLOPS for
	30 MHz parts and each is able to attain around 50 MFLOPS
	on very carefully coded operations (typically matrix
	multiplication).  On the other hand, standard vectorizable
	Fortran codes typically run at about 4-6 MFLOPS on either
	of those architectures with existing compiler technology.

	I recommend getting the IBM over an i860-based system 
	since it is clear now there will be a lot more RIOS
	machines out there than 860 machines, and therefore
	a lot more people working on compiler technology,
	systems integration, third-party software, etc....

song> One of my friends told me that Stardent is working with i860 for
song> their next generation machine.  Is this true?

The Stardent "Stiletto" has been announced, I think.  It uses a MIPS
R3000 for the main cpu and intel i860's for "vector" and graphics
co-processors.  I don't know about availability, but the performance
of the unit will certainly be limited by the considerations I outlined
above. 
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/03/91)

hsong@nvuxl.UUCP (g hugh song) wrote:
> Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
> missing in this chip for building a UNIX machine out of this chip?

Return from interrupt.  When the chip takes an exception, it sort of
drops all the bits in the pipleine on the floor and lets software
put the pieces back together.  The code to restart from an interrupt
is, I'm told, 10,000 lines of assembler.  And you still have to
avoid one or two code sequences.

It's a great hot box, but for taking interrupts, it's a pig.

(It's also, as Henry points out, a pain in the ass to program... nobody
has a compiler which can come close to hand coding yet.)
-- 
	-Colin

sef@kithrup.COM (Sean Eric Fagan) (02/03/91)

In article <1991Feb3.061217.21988@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>Return from interrupt.  When the chip takes an exception, it sort of
>drops all the bits in the pipleine on the floor and lets software
>put the pieces back together.  The code to restart from an interrupt
>is, I'm told, 10,000 lines of assembler.  And you still have to
>avoid one or two code sequences.

You're told wrong, or I am.

To do a context switch, you ned to do something like this:

	st	f2, regs[f2]
	st	f3, regs[f3]
	st	f4, regs[f4]
	fadd.p	f0, f0, f2
	fadd.p	f0, f0, f3
	fadd.p	f0, f0, f4
	st	f2, fadd_pipeline[0]
	st	f3, fadd_pipeline[1]
	st	f4, fadd_pipeline[2]
	ld	f2, old_fadd_pipeline[0]
	ld	f3, old_fadd_pipeline[1]
	ld	f4, old_fadd_pipeline[2]
	fadd.p	f0, f2, f0
	fadd.p	f0, f3, f0
	fadd.p	f0, f4, f0

Etc.  Note that  a) you need to do that for every pipelined unit on the
chip, and b) you need to know how many steps the pipeline has.  While it is
not trivial, it won't take 10,000 lines of assembly code.

Interrupt code should not have to do floating point code, so none of the
pipelines should need to be saved/restored; that arduous task is saved for
context switches alone.

I think Intel has very nicely come up with a microprocessor where CPU-state
storage / restorage is a considerable portion of the context switch time...

All of the above is based on having looked at an i860 manual for a couple of
weeks, mostly trying to figure out *why* anyone would want to do this. 8-;

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/04/91)

sef@kithrup.COM (Sean Eric Fagan) wrote:
> In article <1991Feb3.061217.21988@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
> > Return from interrupt.  [Is a bitch]
>
> You're told wrong, or I am.
>
> To do a context switch, you ned to do something like this:
>
>  [Example of reloading add pipeline]
>
> Etc.  Note that  a) you need to do that for every pipelined unit on the
> chip, and b) you need to know how many steps the pipeline has.  While it is
> not trivial, it won't take 10,000 lines of assembly code.
>
> Interrupt code should not have to do floating point code, so none of the
> pipelines should need to be saved/restored; that arduous task is saved for
> context switches alone.

By the way, you have to put exceptions back in the pipeline as well.
For the multiply pipeline at least, which is variable-length, there
is a special "reload pipe" instruction.

Howver, when I said it drops the pipeline on the floor, I meant the
instruction pipeline, where you are, in fact, provided with enough
information to reconstruct it's state, but it's not just a PC.

The worst case, as is usual with chips, is taking a page fault, since
it can occur mid-instruction and you usually want to do a context
switch while the page is loaded.

And flushing the cache requires a software loop to map each entry to
an inaccessible region of memory (no valid bits, it seems).

But the nastiest cases arise because certain instructions can't be
re-executed.  You have to emulate them in software.  Being in the
delay slot of a branch is marked with a status register flag, and if
it's set you have to go back and emulate the branch as well.  I haven't
got a manual handy, but I seem to recall there are some restrictions
on clobbering the branch decision register in the delay slot to allow
this to happen.

BTW, I've always wondered why MIPS backs up the PC to untaken branch
instructions.  At least, the reference to "determining if the
conditions of the branch were met" in the reserved instruction
exception description of Kane's book suggests it does.  It would be
fully backward-compatible to only back up on taken branches, and it
would also simplify the emulation logic if you knew you had these
semantics, because you wouldn't have to test the branch condition.

Then, you could be in two-instructions-per-cycle mode when this happens,
and I think there are a few other cases to worry about.

It's really a royal flaming mess.
-- 
	-Colin

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/04/91)

Oh, yes, excuse me for not remembering this in my first post... one of the
more amusing forbidden code sequences is jumping into a delay slot.
this is because there's no MIPS-like branch delay bit; you have to
examine the instruction at PC-4 (or PC-8 if in double-instruction mode)
to see if it's a taken branch.  This instruction, of course might be
on a differnt page, and it might not be paged in.  If you're sure it
wasn't paged in one (user) cycle ago, you can assume you just branched
to the faulting instruction and don't have to read the other page, but
are you completely sure of that?
-- 
	-Colin

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/04/91)

In article <1991Feb03.082253.12458@kithrup.COM> 
	sef@kithrup.COM (Sean Eric Fagan) writes:
>In article <1991Feb3.061217.21988@watdragon.waterloo.edu> 
	ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>>The code to restart from an interrupt
>>is, I'm told, 10,000 lines of assembler.  And you still have to
>>avoid one or two code sequences.
>
>You're told wrong, or I am.
>To do a context switch, you ned to do something like this:
	[..simple looking code sequence..]

I believe that the problem is faults, not context switches.  Doesn't
the 860 need a fair bit of software support to do e.g.  IEEE
underflow of a divide?
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

gillies@cs.uiuc.edu (Don Gillies) (02/04/91)

henry@zoo.toronto.edu (Henry Spencer) writes:

>In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes:
>>Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
>>missing in this chip for building a UNIX machine out of this chip? ...

>An architecture whose potential performance can be exploited in high-level
>languages without driving the compiler writers up the wall. :-(


From a naive standpoint, isn't an i860 functionally similar to an IBM
6000 cpu?  I have seen citations to IBM-confidential reports on new
scheduling algorithms for pipeline and nonhomogeneous/superscaler
processors.

gillies@cs.uiuc.edu (Don Gillies) (02/04/91)

henry@zoo.toronto.edu (Henry Spencer) writes:

>In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes:
>>Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
>>missing in this chip for building a UNIX machine out of this chip? ...

>An architecture whose potential performance can be exploited in high-level
>languages without driving the compiler writers up the wall. :-(


From a naive standpoint, isn't an i860 functionally similar to an IBM
6000 cpu?  I have seen citations to IBM-confidential reports on new
scheduling algorithms for pipeline and nonhomogeneous/superscaler
processors.

Does IBM know something about processor scheduling and compiler
writing that the rest of the world doesn't?  If compiler writers go
"up the wall" trying to generate i860 code, perhaps it's because they
are ignorant of, or unwilling to develop, effective scheduling
techniques.

Don Gillies	     |  University of Illinois at Urbana-Champaign
gillies@cs.uiuc.edu  |  Digital Computer Lab, 1304 W. Springfield, Urbana IL
---------------------+------------------------------------------------------
"UGH!  WAR! ... What is it GOOD FOR?  ABSOLUTELY NOTHING!"  - 60's music lyrics

gillies@cs.uiuc.edu (Don Gillies) (02/05/91)

Someone has informed me that the i860 can only issue three *different*
kinds of instructions at the same time (i.e. Integer, FPU, branch),
while the IBM 6000 can issue three instructions of the *same* kind at
the same time (i.e. FPU, FPU, FPU).

This is a major difference.  I have thought for a long time that the
i860 approach was fundamentally unsound, because the scheduling has
historically been difficult [in a theoretical sense].  This is because
the best theoretical heuristics (known to me) for scheduling n
unit-length jobs with precedence constraints on m ALU's are:

	execution time		performance bound	source

i860	O(n^m)			2 * optimal		[lenstra et. al 89]

6000	O(n log n)		(2 - 2/m) * optimal	[coffman-graham 72]

The "identical" model assumes any job can execute on any ALU.  The
"unrelated" model assumes that every job takes a different amount of
time (possibly infinity) on each ALU.

In one respect, these algorithms are simplifications (i.e. no register
allocation, no pipelining, etc.)  In another respect, the lenstra
algorithm solves a harder problem, but there are lesser
generalizations (i.e. resource scheduling) where the algorithms are
polynomial time, but no known algorithms have worst-case performance
better than s*optimal, where s is the number of functional units (i.e.
worst-case performance no better than a single functional unit).

The "identical" processors problem has been studied (theoretically)
since the mid 1960's, and the "unrelated" processors problem has been
studied since at least the mid 1970's.

The lenstra algorithm uses linear programming and is the first
algorithm that always produces a schedule better than the trivial m *
optimal.


Don Gillies	     |  University of Illinois at Urbana-Champaign
gillies@cs.uiuc.edu  |  Digital Computer Lab, 1304 W. Springfield, Urbana IL
---------------------+------------------------------------------------------
"UGH!  WAR! ... What is it GOOD FOR?  ABSOLUTELY NOTHING!"  - 60's music lyrics

-- 

preston@ariel.rice.edu (Preston Briggs) (02/05/91)

gillies@cs.uiuc.edu (Don Gillies) writes:

>Someone has informed me that the i860 can only issue three *different*
>kinds of instructions at the same time (i.e. Integer, FPU, branch),
>while the IBM 6000 can issue three instructions of the *same* kind at
>the same time (i.e. FPU, FPU, FPU).

This isn't quite right, for either machine.

The 860 can specify, in a single instruction,
	1 integer op (load, store, branch, integer arithmetic)
	1 FP op (arithmetic), where the FP op might be a multiply-add.
	  (Some restrictions apply)

Sort of a wide instruction approach, where the instruction are
normally issued every cycle.

The RS 6000 can specify, in a single instruction,
	1 operation, which might be a multiply-add.

The RS 6000 is more of a superscalar machine in that it can issue
many of these instructions in a single cycle (a branch, a conditional
code operation, an integer op, and an FP op).

Generally, I'd say the RS 6000 is an easier target,
allowing much more flexibility in instruction scheduling.
Register renaming and shorter pipelines are also helpful
(means less penalty for non-optimal schedules).

I'd say the RS 6000 has a simpler and more flexible instruction
set than the i860.  The cost is clever implementation to make that
simple instruction set run very fast.

This might be a good example of trading hardware complexity
against compiler complexity.

Preston Briggs

dik@cwi.nl (Dik T. Winter) (02/05/91)

In article <1991Feb4.194521.8384@cs.uiuc.edu> gillies@cs.uiuc.edu (Don Gillies) writes:
 > 
 > Someone has informed me that the i860 can only issue three *different*
 > kinds of instructions at the same time (i.e. Integer, FPU, branch),
 > while the IBM 6000 can issue three instructions of the *same* kind at
 > the same time (i.e. FPU, FPU, FPU).
As far as I know, both are wrong.  The i860 can issue at most two
*different* kinds of instructions at the same time (one FPU the other non-FPU).
I believe some versions of the i960 can issue three instructions at the same
time (but I understand the next cycle they can issue at most one instruction).
The RS6000 can issue three *different* kinds of instructions at the same time
(where different is different from the different of the i860).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

cet1@cl.cam.ac.uk (C.E. Thompson) (02/05/91)

In article <2896@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>The RS6000 can issue three *different* kinds of instructions at the same time
>(where different is different from the different of the i860).
>--
and there have been similar postings in this thread. Misunderstandings tend
to arise here, because there are different constraints coming from different 
stages in the various pipelines:

1. The ICU can, on any one cycle, do all of:
   a. Execute a branch instruction
   b. Execute a condition register instruction
   c. Dispatch two other instructions to the FXU & FPU. These can be both
      fixed, both floating, or one of each.
2. The FXU can execute at most one fixed point instruction each cycle
   (and most such instructions do only take one cycle).
3. The FPU is bit more complicated because of the parallel load and 
   arithmetic pipelines, but sticking to arithmetic instructions, it can
   begin executing one new floating point operation each cycle. (They 
   usually have 2 or 3 cycle latency.) The operation can be a multiply-
   and-add, which you can count as two FLOPS if you want to.

To keep up a rate of two non-ICU-executed instructions per cycle therefore
requires equal numbers of fixed and floating-point instructions. However,
because both the FXU and FPU contain buffers of instructions issued by the
ICU and not yet executed, the instructions don't have to strictly alternate
in type, and it is not necessary for one of each type to be issued by the 
ICU on each cycle.

Chris Thompson
JANET:    cet1@uk.ac.cam.phx
Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/06/91)

In article <1991Feb4.194521.8384@cs.uiuc.edu> 
	gillies@cs.uiuc.edu (Don Gillies) writes:
>Someone has informed me that the i860 can only issue three *different*
>kinds of instructions at the same time (i.e. Integer, FPU, branch),
>while the IBM 6000 can issue three instructions of the *same* kind at
>the same time (i.e. FPU, FPU, FPU).

I don't believe that this is correct. The IBM can (peak) issue *four*
instructions per clock, but they have to be of the four different
kinds that the machine distinguishes.

There is only one bus from the I-cache/despatcher to the FPU. At
peak, one FPU instruction travels over it, and is queued in the FPU
for actual execution. Of course, they talk about widening everything
in future implementations. Plus, one can imagine superscalar logic in
the dequeueing logic, allowing multiple dequeues per clock, but I
don't recall hearing that they had done that.

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

brandis@inf.ethz.ch (Marc Brandis) (02/06/91)

In article <1991Feb4.194521.8384@cs.uiuc.edu> gillies@cs.uiuc.edu (Don Gillies) writes:
>
>Someone has informed me that the i860 can only issue three *different*
>kinds of instructions at the same time (i.e. Integer, FPU, branch),
>while the IBM 6000 can issue three instructions of the *same* kind at
>the same time (i.e. FPU, FPU, FPU).
>
You were misinformed about both the i860 and the IBM S/6000. The i860 can only
issue two instructions at once, where one has to be an integer instruction
(branch counts as integer instruction) and one has to be a floating point
instruction. Moreover, you have to statically schedule these instructions.
That is, you enter a special mode (so-called dual instruction mode), in which
the i860 reads two instructions every cycle and issues the first one to the
integer unit and the second to the floating point unit.

However, the floating-point instruction can be a multiply-and-add instruction,
so that you can say that three operations may be issued at once in this mode.

The IBM S/6000, on the other hand, can issue four instructions at once, but
all have to be of a different kind: one integer, one fp, one branch and one
condition-code operation. The fp instruction can be multiply-and-add, so you
get a maximum of 5 operations per cycle (but this is an instruction mix that
you will never find).  The S/6000 is dynamically scheduled. That is, the
programmer implements just a sequential stream of instructions, the hardware
determines what can be issued in parallel.


Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/06/91)

In article <1991Feb3.061217.21988@watdragon.waterloo.edu>,
ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
|> hsong@nvuxl.UUCP (g hugh song) wrote:
|> > Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
|> > missing in this chip for building a UNIX machine out of this chip?
|> 
|> Return from interrupt.  When the chip takes an exception, it sort of
|> drops all the bits in the pipleine on the floor and lets software
|> put the pieces back together.  The code to restart from an interrupt
|> is, I'm told, 10,000 lines of assembler.
|> 

I don't believe this number.  It clearly takes some work, but I would guess
it's more on the order of 100 - 200 instructions.  Anyone know?

|> 
|> (It's also, as Henry points out, a pain in the ass to program... nobody
|> has a compiler which can come close to hand coding yet.)
|> -- 

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0451			   ==
-----------------------------------------------------------------------------

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/06/91)

In article <1991Feb4.023042.21714@cs.uiuc.edu>, gillies@cs.uiuc.edu (Don
Gillies) writes:
|> henry@zoo.toronto.edu (Henry Spencer) writes:
|> 
|> From a naive standpoint, isn't an i860 functionally similar to an IBM
|> 6000 cpu?
|> 
|> Does IBM know something about processor scheduling and compiler
|> writing that the rest of the world doesn't?  If compiler writers go
|> "up the wall" trying to generate i860 code, perhaps it's because they
|> are ignorant of, or unwilling to develop, effective scheduling
|> techniques.
|> 

The i860 has the ability to execute up to 3 instructions at a time, giving
a theoretical possibility of 120 mips at 40MHz.  Unfortunately, you can
only do this with specific combinations of instructions (and by changing modes
and incurring a startup penalty).

For normal programming (even in assembler) this parallelism is rarely
available,
and finding it with a compiler is difficult.  Generally, the i860 performs like
a normal processor -- its mips rating is its clock speed minus some percentage
for pipeline stalls and memory delays.

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

jlitvin@st860.intel.com (John Litvin) (02/07/91)

In article <538@decvax.decvax.dec.com.UUCP> kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:

   In article <1991Feb3.061217.21988@watdragon.waterloo.edu>,
   ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
   |> hsong@nvuxl.UUCP (g hugh song) wrote:
   |> > Why is it so hard to build a UNIX machine with Intel's i860 chip?  What is
   |> > missing in this chip for building a UNIX machine out of this chip?
   |> 
   |> Return from interrupt.  When the chip takes an exception, it sort of
   |> drops all the bits in the pipleine on the floor and lets software
   |> put the pieces back together.  The code to restart from an interrupt
   |> is, I'm told, 10,000 lines of assembler.
   |> 

>   I don't believe this number.  It clearly takes some work, but I would guess
>   it's more on the order of 100 - 200 instructions.  Anyone know?

Yes, I do :-).  From the SVR4 port, we have about 1000 lines of assembly
in ml/ttrap.s to handle this.  (OK, it's more than 200, but far less than
the 10,000 lines we were accused of requiring).

John Litvin

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/08/91)

In article <11798@pt.cs.cmu.edu> I wrote:
>In article <1991Feb4.194521.8384@cs.uiuc.edu> 
>	gillies@cs.uiuc.edu (Don Gillies) writes:
>>...the IBM 6000 can issue three instructions of the *same* kind at
>>the same time (i.e. FPU, FPU, FPU).
>
>I don't believe that this is correct. The IBM can (peak) issue *four*
>instructions per clock, but they have to be of the four different
>kinds that the machine distinguishes.
>
>There is only one bus from the I-cache/despatcher to the FPU. At
>peak, one FPU instruction travels over it, and is queued in the FPU
>for actual execution. 

Evidently I misspoke. There are two buses from the I-cache/despatcher
to the FPU and FXU (integer unit). IBM paid the pins to send both
buses to both units. So, you really can issue two FPU instructions
per clock - or two FXUs - or one of each. The queue in each execution
unit can dequeue/initiate one per clock, but can enqueue two per
clock.

For comparison, the Omron Luna on my desk can initiate four
instructions per clock, in any mix. That's a cheat: it contains four
88000's. For some applications (such as mine, this week), this is
actually better, because it gives a different balance of resources -
mostly, for me, a big CPU-cache bandwidth. It was fun, the first time
I did a process list, and saw four entries with %CPU at 98+%.

The big issue with high-end processors is keeping them fed. The R4000
press release "disclosed" 128 bits of data path to the external
cache:  I expect several announcements this year that are at least as
wide. 
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/08/91)

I received the following reply to a previous post of mine which I thought I
would pass along:


In article <538@decvax.decvax.dec.com.UUCP> you write:
>In article <1991Feb3.061217.21988@watdragon.waterloo.edu>,
>ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>|> hsong@nvuxl.UUCP (g hugh song) wrote:
>|> > Why is it so hard to build a UNIX machine with Intel's i860 chip? 
What is
>|> > missing in this chip for building a UNIX machine out of this chip?
>|> 
>|> Return from interrupt.  When the chip takes an exception, it sort of
>|> drops all the bits in the pipleine on the floor and lets software
>|> put the pieces back together.  The code to restart from an interrupt
>|> is, I'm told, 10,000 lines of assembler.
>|> 
>
>I don't believe this number.  It clearly takes some work, but I would guess
>it's more on the order of 100 - 200 instructions.  Anyone know?
>


Sorry, I can't post, but you can post my answer...
It's under 1000 lines (including comments, etc.) for the assembly-level
save and restore code, including all the trap type identification.
Fortunately a whole lot less than 10,000 lines.

Doug Doucette
doug@swdc.stratus.com
Stratus Western Development Center
San Jose, CA


-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

pc@Stardent.COM (Paul Cantrell) (02/08/91)

Re:	the discussion about what makes the i860 a difficult chip to use.

I was at Alliant when they were porting their SMP parallel mini-supercomputer
to i860. Although there are a lot of things Intel could have done to make the
chip easier for people to integrate into systems, basically the chip works ok.
I don't think that the problems that Colin has pointed out are really reasons
not to use the chip. And the chip *does* have many qualities which can make it
be a really fine chip for a multiprocessing application.

In article <1991Feb3.223732.3581@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>Howver, when I said it drops the pipeline on the floor, I meant the
>instruction pipeline, where you are, in fact, provided with enough
>information to reconstruct it's state, but it's not just a PC.
>
>The worst case, as is usual with chips, is taking a page fault, since
>it can occur mid-instruction and you usually want to do a context
>switch while the page is loaded.

Well, the problem occurs whether or not you need to context switch. Colin
is right that it's a lot of trouble to continue from the exception because
of all the special cases you have to worry about continuing - delay slots,
dual instruction mode, pipeline, etc. The actual processor time consumed
isn't all that horrible, it's just that the exception handling code is
very complex. This isn't to say I think the i860 exception processing is
all that great - I think Intel could have produced a much better design
without a noticable increase in silicon. My suspicion is that they simply
had a lack of software expertise involved in that part of the design.
There is nothing in the current exception processing design, however,
which prevents the chip from functioning correctly in a multiprocessing
application.

I've never looked at the 88000 exception code, but I'm told that it is
quite complex as well. The major difference that I see is that all the
people I've talked to who are using 88000 say that the exception code as
provided by Motorola works fine, and they never had to modify it. In the
case of the i860, the original exception code as provided with Intel
didn't even come close to handling all the cases for continuing from an
exception. Because of this, Alliant spent a great deal of time (1 person
for many months) trying to get this code to work correctly. Had Intel
delivered a working exception handler to i860 customers, the level of pain
and suffering would have been reduced by a few orders of magnitude. This
is one area Intel should try to improve as they come out with future
chips.

So the real problem here was that the chip design mandated a complex piece
of code, which Intel didn't supply. Many customers have spent a lot of money
developing this code themselves, when it would have been more appropriate for
the chip manufacturer to supply it.

>And flushing the cache requires a software loop to map each entry to
>an inaccessible region of memory (no valid bits, it seems).

Well, I'm not sure what Colin meant by "no valid bits". The cache is a
writeback cache, and rather than have a chip function to write back all
the cache lines, the operating system simply has to set things up so that
the normal cache line replacement algorithm will write the dirty line back
to memory, and replace it with an invalid entry. There certainly is a
valid bit; only dirty cache lines actually get written back. In a funny
sort of way, I kinda like this. It seems very RISCy to me - why provide a
special function which doesn't get used that often (only at context switch
time) when you can have the software do it? In the case of the Alliant,
the main memory subsystem couldn't keep up with the chip anyway, so there
wasn't really a speed disadvantage to doing it this way. In systems with
very fast main memory, people might think this was a performance problem,
I can't comment.

I think it's safe to say that this isn't a problem that should cause you to
think twice about using the i860 in your system.

If I had to list the major problems with i860 as they affected Alliant,
they would be:

	1) Exception processing code as provided by Intel was inadequate
	   causing us to dedicate significant amounts of developer time
	   to this. Not a problem with the chip as much as a problem with
	   the way Intel helped bootstrap customers.

	2) No cache coherence scheme making multi-processor implementations
	   suffer severe performance problems (shared data can't be cached).

	3) Time consuming to produce a compiler which can get a significant
	   amount of the potential performance out of the chip. Lets face it,
	   the chip has been out for quite a while, and good compilers are
	   just now appearing.

Of course, you can argue that #2 isn't fair because Intel didn't design
the chip for MP applications. The follow on chip handles that. You can
argue whether that was a smart move when more and more systems are MP
systems. Still, I'd accept that it's not fair to critisize the chip for
this.

What's interesting is that that leaves #1 and #3 which are both software
issues. So perhaps the lesson is that chip manufacturers who want to go
the RISC route should make sure they have the software expertise to do
this, rather than simply push the problem back onto their customers. MIPS
is probably a good example of a company that has a good balance of
software and hardware design experience in their team, allowing them to
ship product to their customers which doesn't take a year to integrate.

I've never worked on a MIPS based product, so perhaps this isn't true.
However, I think the principal is still the same - it's not enough to
produce a hot chip if it takes the customers another year or two to
successfully integrate it into their product and come up with a compiler
which gets a significant percentage of the possible performance. If you do
it that way the product is obsolete by the time system manufacturers get
it to market.

I can assure you that these opinions are my own, and may not be shared by
my employers; past, present, or future...
-- 
uucp: pc@stardent

meissner@osf.org (Michael Meissner) (02/13/91)

In article <1991Feb8.144035.1076@Stardent.COM> pc@Stardent.COM (Paul
Cantrell) writes:

| I've never looked at the 88000 exception code, but I'm told that it is
| quite complex as well. The major difference that I see is that all the
| people I've talked to who are using 88000 say that the exception code as
| provided by Motorola works fine, and they never had to modify it. 

The exception code provided by Motorola may be reasonable now, but the
original exception code did not properly handle NaN's, denorms, and
such.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?