hsong@nvuxl.UUCP (g hugh song) (02/02/91)
Hi. Why is it so hard to build a UNIX machine with Intel's i860 chip? What is missing in this chip for building a UNIX machine out of this chip? What I want to see is a cheap vector workstation that I really need. How about Moto's 68040 (or 88000) with a 68951(?) DSP chip? Is it possible to build a vector machine with this config? I am not intrested in single precision vector processors (if they exist). One of my friends told me that Stardent is working with i860 for their next generation machine. Is this true? The machine on my desk, dec 5000/200 has an i860 for its graphics. Is there any compiler vendor who is working on a compiler which takes advantage of this i860 chip? I know a company which makes a coprocessor board with i860 for VME bus and, later this year, Turbo Channel (DS 5000/200). Their price is rather steep. I might buy one more DS 5000/200 with the money. OK. The company name is CSPI. The sales person said that is is going to release a kind of a scheduler software which allows multitasking. However, I do not get the picture of how transparent it is for the users. If anybody in the net knows how it works or will work, please explain. Thanks. -hsong-
henry@zoo.toronto.edu (Henry Spencer) (02/03/91)
In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes: >Why is it so hard to build a UNIX machine with Intel's i860 chip? What is >missing in this chip for building a UNIX machine out of this chip? ... An architecture whose potential performance can be exploited in high-level languages without driving the compiler writers up the wall. :-( -- "Maybe we should tell the truth?" | Henry Spencer at U of Toronto Zoology "Surely we aren't that desperate yet." | henry@zoo.toronto.edu utzoo!henry
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (02/03/91)
In article <798@nvuxl.UUCP>, hsong@nvuxl.UUCP (g hugh song) writes:
song> .... [asks about the i860 cpu]....
song> What I want to see is a cheap vector workstation....
Short answer:
Buy an IBM RISC System 6000. The Model 320 has the best
price/performance ratio in the server configuration.
Long Answer:
There is a good bit of confusion about "vector" processing
out there, so I will be a bit tutorial here.... You should
type 'n' now if you find this topic boring....
The original "vector" processing machines (Star 100, Cyber
205, Cray 1, etc) and their modern equivalents (Crays 2,X,Y;
IBM 3090VF, Convex, Alliant) had an instruction set which
allowed a *single* instruction to specify an operation
on a *vector* of data.
Note that this does not imply that the operations on the
vector of data are done simultaneously.
In fact, on all of these machines the calculations were
largely serial, so the specification of many calculations
with a single instruction does not provide any performance
gain.
The real performance gain from these machines comes
from "pipelining", whereby the different stages of the
floating-point calculations are separated to enable an
"assembly-line" operation with a new result available each
clock cycle even though each individual calculation requires
many cycles in the pipeline.
Modern microprocessor floating-point units are pipelined,
but do not have vector instructions. Operations on "vectors"
are performed by code loops rather than by single instructions
but the performance is the same since instruction fetching in
loops is almost never a performance bottleneck. Examples of
modern microprocessors with fully pipelined floating-point
units are the i860, the IBM RIOS, and (I believe) the Motorola
8800. None of these have vector instructions.
So what is the difference between a "supercomputer" vector
processor and a pipelined microprocessor? The main difference
which causes performance differences is the memory bandwidth.
"Supercomputers" are generally not built around cached memory
systems, but are instead designed to have very high speed
transfers between the main memory and the vector registers
used by the pipelined vector instructions.
Microprocessor-based systems are all cache-oriented and none
of the currently available systems have enough memory
bandwidth to keep the pipelined arithmetic units busy if
the data to be worked on does not fit in the cache. The
job of writing code which re-uses data in the cache
effectively and compiling that code into optimum machine
instruction sequences is very difficult and severely limits
the performance of current pipelined microprocessors.
The IBM RIOS and Intel 860 machines are good examples.
Each is capable of a theoretical peak of 60 MFLOPS for
30 MHz parts and each is able to attain around 50 MFLOPS
on very carefully coded operations (typically matrix
multiplication). On the other hand, standard vectorizable
Fortran codes typically run at about 4-6 MFLOPS on either
of those architectures with existing compiler technology.
I recommend getting the IBM over an i860-based system
since it is clear now there will be a lot more RIOS
machines out there than 860 machines, and therefore
a lot more people working on compiler technology,
systems integration, third-party software, etc....
song> One of my friends told me that Stardent is working with i860 for
song> their next generation machine. Is this true?
The Stardent "Stiletto" has been announced, I think. It uses a MIPS
R3000 for the main cpu and intel i860's for "vector" and graphics
co-processors. I don't know about availability, but the performance
of the unit will certainly be limited by the considerations I outlined
above.
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/03/91)
hsong@nvuxl.UUCP (g hugh song) wrote: > Why is it so hard to build a UNIX machine with Intel's i860 chip? What is > missing in this chip for building a UNIX machine out of this chip? Return from interrupt. When the chip takes an exception, it sort of drops all the bits in the pipleine on the floor and lets software put the pieces back together. The code to restart from an interrupt is, I'm told, 10,000 lines of assembler. And you still have to avoid one or two code sequences. It's a great hot box, but for taking interrupts, it's a pig. (It's also, as Henry points out, a pain in the ass to program... nobody has a compiler which can come close to hand coding yet.) -- -Colin
sef@kithrup.COM (Sean Eric Fagan) (02/03/91)
In article <1991Feb3.061217.21988@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >Return from interrupt. When the chip takes an exception, it sort of >drops all the bits in the pipleine on the floor and lets software >put the pieces back together. The code to restart from an interrupt >is, I'm told, 10,000 lines of assembler. And you still have to >avoid one or two code sequences. You're told wrong, or I am. To do a context switch, you ned to do something like this: st f2, regs[f2] st f3, regs[f3] st f4, regs[f4] fadd.p f0, f0, f2 fadd.p f0, f0, f3 fadd.p f0, f0, f4 st f2, fadd_pipeline[0] st f3, fadd_pipeline[1] st f4, fadd_pipeline[2] ld f2, old_fadd_pipeline[0] ld f3, old_fadd_pipeline[1] ld f4, old_fadd_pipeline[2] fadd.p f0, f2, f0 fadd.p f0, f3, f0 fadd.p f0, f4, f0 Etc. Note that a) you need to do that for every pipelined unit on the chip, and b) you need to know how many steps the pipeline has. While it is not trivial, it won't take 10,000 lines of assembly code. Interrupt code should not have to do floating point code, so none of the pipelines should need to be saved/restored; that arduous task is saved for context switches alone. I think Intel has very nicely come up with a microprocessor where CPU-state storage / restorage is a considerable portion of the context switch time... All of the above is based on having looked at an i860 manual for a couple of weeks, mostly trying to figure out *why* anyone would want to do this. 8-; -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/04/91)
sef@kithrup.COM (Sean Eric Fagan) wrote: > In article <1991Feb3.061217.21988@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: > > Return from interrupt. [Is a bitch] > > You're told wrong, or I am. > > To do a context switch, you ned to do something like this: > > [Example of reloading add pipeline] > > Etc. Note that a) you need to do that for every pipelined unit on the > chip, and b) you need to know how many steps the pipeline has. While it is > not trivial, it won't take 10,000 lines of assembly code. > > Interrupt code should not have to do floating point code, so none of the > pipelines should need to be saved/restored; that arduous task is saved for > context switches alone. By the way, you have to put exceptions back in the pipeline as well. For the multiply pipeline at least, which is variable-length, there is a special "reload pipe" instruction. Howver, when I said it drops the pipeline on the floor, I meant the instruction pipeline, where you are, in fact, provided with enough information to reconstruct it's state, but it's not just a PC. The worst case, as is usual with chips, is taking a page fault, since it can occur mid-instruction and you usually want to do a context switch while the page is loaded. And flushing the cache requires a software loop to map each entry to an inaccessible region of memory (no valid bits, it seems). But the nastiest cases arise because certain instructions can't be re-executed. You have to emulate them in software. Being in the delay slot of a branch is marked with a status register flag, and if it's set you have to go back and emulate the branch as well. I haven't got a manual handy, but I seem to recall there are some restrictions on clobbering the branch decision register in the delay slot to allow this to happen. BTW, I've always wondered why MIPS backs up the PC to untaken branch instructions. At least, the reference to "determining if the conditions of the branch were met" in the reserved instruction exception description of Kane's book suggests it does. It would be fully backward-compatible to only back up on taken branches, and it would also simplify the emulation logic if you knew you had these semantics, because you wouldn't have to test the branch condition. Then, you could be in two-instructions-per-cycle mode when this happens, and I think there are a few other cases to worry about. It's really a royal flaming mess. -- -Colin
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/04/91)
Oh, yes, excuse me for not remembering this in my first post... one of the more amusing forbidden code sequences is jumping into a delay slot. this is because there's no MIPS-like branch delay bit; you have to examine the instruction at PC-4 (or PC-8 if in double-instruction mode) to see if it's a taken branch. This instruction, of course might be on a differnt page, and it might not be paged in. If you're sure it wasn't paged in one (user) cycle ago, you can assume you just branched to the faulting instruction and don't have to read the other page, but are you completely sure of that? -- -Colin
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/04/91)
In article <1991Feb03.082253.12458@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes: >In article <1991Feb3.061217.21988@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >>The code to restart from an interrupt >>is, I'm told, 10,000 lines of assembler. And you still have to >>avoid one or two code sequences. > >You're told wrong, or I am. >To do a context switch, you ned to do something like this: [..simple looking code sequence..] I believe that the problem is faults, not context switches. Doesn't the 860 need a fair bit of software support to do e.g. IEEE underflow of a divide? -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
gillies@cs.uiuc.edu (Don Gillies) (02/04/91)
henry@zoo.toronto.edu (Henry Spencer) writes: >In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes: >>Why is it so hard to build a UNIX machine with Intel's i860 chip? What is >>missing in this chip for building a UNIX machine out of this chip? ... >An architecture whose potential performance can be exploited in high-level >languages without driving the compiler writers up the wall. :-( From a naive standpoint, isn't an i860 functionally similar to an IBM 6000 cpu? I have seen citations to IBM-confidential reports on new scheduling algorithms for pipeline and nonhomogeneous/superscaler processors.
gillies@cs.uiuc.edu (Don Gillies) (02/04/91)
henry@zoo.toronto.edu (Henry Spencer) writes: >In article <798@nvuxl.UUCP> hsong@nvuxl.UUCP (g hugh song) writes: >>Why is it so hard to build a UNIX machine with Intel's i860 chip? What is >>missing in this chip for building a UNIX machine out of this chip? ... >An architecture whose potential performance can be exploited in high-level >languages without driving the compiler writers up the wall. :-( From a naive standpoint, isn't an i860 functionally similar to an IBM 6000 cpu? I have seen citations to IBM-confidential reports on new scheduling algorithms for pipeline and nonhomogeneous/superscaler processors. Does IBM know something about processor scheduling and compiler writing that the rest of the world doesn't? If compiler writers go "up the wall" trying to generate i860 code, perhaps it's because they are ignorant of, or unwilling to develop, effective scheduling techniques. Don Gillies | University of Illinois at Urbana-Champaign gillies@cs.uiuc.edu | Digital Computer Lab, 1304 W. Springfield, Urbana IL ---------------------+------------------------------------------------------ "UGH! WAR! ... What is it GOOD FOR? ABSOLUTELY NOTHING!" - 60's music lyrics
gillies@cs.uiuc.edu (Don Gillies) (02/05/91)
Someone has informed me that the i860 can only issue three *different* kinds of instructions at the same time (i.e. Integer, FPU, branch), while the IBM 6000 can issue three instructions of the *same* kind at the same time (i.e. FPU, FPU, FPU). This is a major difference. I have thought for a long time that the i860 approach was fundamentally unsound, because the scheduling has historically been difficult [in a theoretical sense]. This is because the best theoretical heuristics (known to me) for scheduling n unit-length jobs with precedence constraints on m ALU's are: execution time performance bound source i860 O(n^m) 2 * optimal [lenstra et. al 89] 6000 O(n log n) (2 - 2/m) * optimal [coffman-graham 72] The "identical" model assumes any job can execute on any ALU. The "unrelated" model assumes that every job takes a different amount of time (possibly infinity) on each ALU. In one respect, these algorithms are simplifications (i.e. no register allocation, no pipelining, etc.) In another respect, the lenstra algorithm solves a harder problem, but there are lesser generalizations (i.e. resource scheduling) where the algorithms are polynomial time, but no known algorithms have worst-case performance better than s*optimal, where s is the number of functional units (i.e. worst-case performance no better than a single functional unit). The "identical" processors problem has been studied (theoretically) since the mid 1960's, and the "unrelated" processors problem has been studied since at least the mid 1970's. The lenstra algorithm uses linear programming and is the first algorithm that always produces a schedule better than the trivial m * optimal. Don Gillies | University of Illinois at Urbana-Champaign gillies@cs.uiuc.edu | Digital Computer Lab, 1304 W. Springfield, Urbana IL ---------------------+------------------------------------------------------ "UGH! WAR! ... What is it GOOD FOR? ABSOLUTELY NOTHING!" - 60's music lyrics --
preston@ariel.rice.edu (Preston Briggs) (02/05/91)
gillies@cs.uiuc.edu (Don Gillies) writes: >Someone has informed me that the i860 can only issue three *different* >kinds of instructions at the same time (i.e. Integer, FPU, branch), >while the IBM 6000 can issue three instructions of the *same* kind at >the same time (i.e. FPU, FPU, FPU). This isn't quite right, for either machine. The 860 can specify, in a single instruction, 1 integer op (load, store, branch, integer arithmetic) 1 FP op (arithmetic), where the FP op might be a multiply-add. (Some restrictions apply) Sort of a wide instruction approach, where the instruction are normally issued every cycle. The RS 6000 can specify, in a single instruction, 1 operation, which might be a multiply-add. The RS 6000 is more of a superscalar machine in that it can issue many of these instructions in a single cycle (a branch, a conditional code operation, an integer op, and an FP op). Generally, I'd say the RS 6000 is an easier target, allowing much more flexibility in instruction scheduling. Register renaming and shorter pipelines are also helpful (means less penalty for non-optimal schedules). I'd say the RS 6000 has a simpler and more flexible instruction set than the i860. The cost is clever implementation to make that simple instruction set run very fast. This might be a good example of trading hardware complexity against compiler complexity. Preston Briggs
dik@cwi.nl (Dik T. Winter) (02/05/91)
In article <1991Feb4.194521.8384@cs.uiuc.edu> gillies@cs.uiuc.edu (Don Gillies) writes: > > Someone has informed me that the i860 can only issue three *different* > kinds of instructions at the same time (i.e. Integer, FPU, branch), > while the IBM 6000 can issue three instructions of the *same* kind at > the same time (i.e. FPU, FPU, FPU). As far as I know, both are wrong. The i860 can issue at most two *different* kinds of instructions at the same time (one FPU the other non-FPU). I believe some versions of the i960 can issue three instructions at the same time (but I understand the next cycle they can issue at most one instruction). The RS6000 can issue three *different* kinds of instructions at the same time (where different is different from the different of the i860). -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
cet1@cl.cam.ac.uk (C.E. Thompson) (02/05/91)
In article <2896@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >The RS6000 can issue three *different* kinds of instructions at the same time >(where different is different from the different of the i860). >-- and there have been similar postings in this thread. Misunderstandings tend to arise here, because there are different constraints coming from different stages in the various pipelines: 1. The ICU can, on any one cycle, do all of: a. Execute a branch instruction b. Execute a condition register instruction c. Dispatch two other instructions to the FXU & FPU. These can be both fixed, both floating, or one of each. 2. The FXU can execute at most one fixed point instruction each cycle (and most such instructions do only take one cycle). 3. The FPU is bit more complicated because of the parallel load and arithmetic pipelines, but sticking to arithmetic instructions, it can begin executing one new floating point operation each cycle. (They usually have 2 or 3 cycle latency.) The operation can be a multiply- and-add, which you can count as two FLOPS if you want to. To keep up a rate of two non-ICU-executed instructions per cycle therefore requires equal numbers of fixed and floating-point instructions. However, because both the FXU and FPU contain buffers of instructions issued by the ICU and not yet executed, the instructions don't have to strictly alternate in type, and it is not necessary for one of each type to be issued by the ICU on each cycle. Chris Thompson JANET: cet1@uk.ac.cam.phx Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/06/91)
In article <1991Feb4.194521.8384@cs.uiuc.edu> gillies@cs.uiuc.edu (Don Gillies) writes: >Someone has informed me that the i860 can only issue three *different* >kinds of instructions at the same time (i.e. Integer, FPU, branch), >while the IBM 6000 can issue three instructions of the *same* kind at >the same time (i.e. FPU, FPU, FPU). I don't believe that this is correct. The IBM can (peak) issue *four* instructions per clock, but they have to be of the four different kinds that the machine distinguishes. There is only one bus from the I-cache/despatcher to the FPU. At peak, one FPU instruction travels over it, and is queued in the FPU for actual execution. Of course, they talk about widening everything in future implementations. Plus, one can imagine superscalar logic in the dequeueing logic, allowing multiple dequeues per clock, but I don't recall hearing that they had done that. -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
brandis@inf.ethz.ch (Marc Brandis) (02/06/91)
In article <1991Feb4.194521.8384@cs.uiuc.edu> gillies@cs.uiuc.edu (Don Gillies) writes: > >Someone has informed me that the i860 can only issue three *different* >kinds of instructions at the same time (i.e. Integer, FPU, branch), >while the IBM 6000 can issue three instructions of the *same* kind at >the same time (i.e. FPU, FPU, FPU). > You were misinformed about both the i860 and the IBM S/6000. The i860 can only issue two instructions at once, where one has to be an integer instruction (branch counts as integer instruction) and one has to be a floating point instruction. Moreover, you have to statically schedule these instructions. That is, you enter a special mode (so-called dual instruction mode), in which the i860 reads two instructions every cycle and issues the first one to the integer unit and the second to the floating point unit. However, the floating-point instruction can be a multiply-and-add instruction, so that you can say that three operations may be issued at once in this mode. The IBM S/6000, on the other hand, can issue four instructions at once, but all have to be of a different kind: one integer, one fp, one branch and one condition-code operation. The fp instruction can be multiply-and-add, so you get a maximum of 5 operations per cycle (but this is an instruction mix that you will never find). The S/6000 is dynamically scheduled. That is, the programmer implements just a sequential stream of instructions, the hardware determines what can be issued in parallel. Marc-Michael Brandis Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology) CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/06/91)
In article <1991Feb3.061217.21988@watdragon.waterloo.edu>, ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: |> hsong@nvuxl.UUCP (g hugh song) wrote: |> > Why is it so hard to build a UNIX machine with Intel's i860 chip? What is |> > missing in this chip for building a UNIX machine out of this chip? |> |> Return from interrupt. When the chip takes an exception, it sort of |> drops all the bits in the pipleine on the floor and lets software |> put the pieces back together. The code to restart from an interrupt |> is, I'm told, 10,000 lines of assembler. |> I don't believe this number. It clearly takes some work, but I would guess it's more on the order of 100 - 200 instructions. Anyone know? |> |> (It's also, as Henry points out, a pain in the ass to program... nobody |> has a compiler which can come close to hand coding yet.) |> -- ----------------------------------------------------------------------------- == jeff kenton Consulting at kenton@decvax.dec.com == == (617) 894-4508 (603) 881-0451 == -----------------------------------------------------------------------------
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/06/91)
In article <1991Feb4.023042.21714@cs.uiuc.edu>, gillies@cs.uiuc.edu (Don Gillies) writes: |> henry@zoo.toronto.edu (Henry Spencer) writes: |> |> From a naive standpoint, isn't an i860 functionally similar to an IBM |> 6000 cpu? |> |> Does IBM know something about processor scheduling and compiler |> writing that the rest of the world doesn't? If compiler writers go |> "up the wall" trying to generate i860 code, perhaps it's because they |> are ignorant of, or unwilling to develop, effective scheduling |> techniques. |> The i860 has the ability to execute up to 3 instructions at a time, giving a theoretical possibility of 120 mips at 40MHz. Unfortunately, you can only do this with specific combinations of instructions (and by changing modes and incurring a startup penalty). For normal programming (even in assembler) this parallelism is rarely available, and finding it with a compiler is difficult. Generally, the i860 performs like a normal processor -- its mips rating is its clock speed minus some percentage for pipeline stalls and memory delays. ----------------------------------------------------------------------------- == jeff kenton Consulting at kenton@decvax.dec.com == == (617) 894-4508 (603) 881-0011 == -----------------------------------------------------------------------------
jlitvin@st860.intel.com (John Litvin) (02/07/91)
In article <538@decvax.decvax.dec.com.UUCP> kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes: In article <1991Feb3.061217.21988@watdragon.waterloo.edu>, ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: |> hsong@nvuxl.UUCP (g hugh song) wrote: |> > Why is it so hard to build a UNIX machine with Intel's i860 chip? What is |> > missing in this chip for building a UNIX machine out of this chip? |> |> Return from interrupt. When the chip takes an exception, it sort of |> drops all the bits in the pipleine on the floor and lets software |> put the pieces back together. The code to restart from an interrupt |> is, I'm told, 10,000 lines of assembler. |> > I don't believe this number. It clearly takes some work, but I would guess > it's more on the order of 100 - 200 instructions. Anyone know? Yes, I do :-). From the SVR4 port, we have about 1000 lines of assembly in ml/ttrap.s to handle this. (OK, it's more than 200, but far less than the 10,000 lines we were accused of requiring). John Litvin
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/08/91)
In article <11798@pt.cs.cmu.edu> I wrote: >In article <1991Feb4.194521.8384@cs.uiuc.edu> > gillies@cs.uiuc.edu (Don Gillies) writes: >>...the IBM 6000 can issue three instructions of the *same* kind at >>the same time (i.e. FPU, FPU, FPU). > >I don't believe that this is correct. The IBM can (peak) issue *four* >instructions per clock, but they have to be of the four different >kinds that the machine distinguishes. > >There is only one bus from the I-cache/despatcher to the FPU. At >peak, one FPU instruction travels over it, and is queued in the FPU >for actual execution. Evidently I misspoke. There are two buses from the I-cache/despatcher to the FPU and FXU (integer unit). IBM paid the pins to send both buses to both units. So, you really can issue two FPU instructions per clock - or two FXUs - or one of each. The queue in each execution unit can dequeue/initiate one per clock, but can enqueue two per clock. For comparison, the Omron Luna on my desk can initiate four instructions per clock, in any mix. That's a cheat: it contains four 88000's. For some applications (such as mine, this week), this is actually better, because it gives a different balance of resources - mostly, for me, a big CPU-cache bandwidth. It was fun, the first time I did a process list, and saw four entries with %CPU at 98+%. The big issue with high-end processors is keeping them fed. The R4000 press release "disclosed" 128 bits of data path to the external cache: I expect several announcements this year that are at least as wide. -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (02/08/91)
I received the following reply to a previous post of mine which I thought I would pass along: In article <538@decvax.decvax.dec.com.UUCP> you write: >In article <1991Feb3.061217.21988@watdragon.waterloo.edu>, >ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >|> hsong@nvuxl.UUCP (g hugh song) wrote: >|> > Why is it so hard to build a UNIX machine with Intel's i860 chip? What is >|> > missing in this chip for building a UNIX machine out of this chip? >|> >|> Return from interrupt. When the chip takes an exception, it sort of >|> drops all the bits in the pipleine on the floor and lets software >|> put the pieces back together. The code to restart from an interrupt >|> is, I'm told, 10,000 lines of assembler. >|> > >I don't believe this number. It clearly takes some work, but I would guess >it's more on the order of 100 - 200 instructions. Anyone know? > Sorry, I can't post, but you can post my answer... It's under 1000 lines (including comments, etc.) for the assembly-level save and restore code, including all the trap type identification. Fortunately a whole lot less than 10,000 lines. Doug Doucette doug@swdc.stratus.com Stratus Western Development Center San Jose, CA ----------------------------------------------------------------------------- == jeff kenton Consulting at kenton@decvax.dec.com == == (617) 894-4508 (603) 881-0011 == -----------------------------------------------------------------------------
pc@Stardent.COM (Paul Cantrell) (02/08/91)
Re: the discussion about what makes the i860 a difficult chip to use. I was at Alliant when they were porting their SMP parallel mini-supercomputer to i860. Although there are a lot of things Intel could have done to make the chip easier for people to integrate into systems, basically the chip works ok. I don't think that the problems that Colin has pointed out are really reasons not to use the chip. And the chip *does* have many qualities which can make it be a really fine chip for a multiprocessing application. In article <1991Feb3.223732.3581@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >Howver, when I said it drops the pipeline on the floor, I meant the >instruction pipeline, where you are, in fact, provided with enough >information to reconstruct it's state, but it's not just a PC. > >The worst case, as is usual with chips, is taking a page fault, since >it can occur mid-instruction and you usually want to do a context >switch while the page is loaded. Well, the problem occurs whether or not you need to context switch. Colin is right that it's a lot of trouble to continue from the exception because of all the special cases you have to worry about continuing - delay slots, dual instruction mode, pipeline, etc. The actual processor time consumed isn't all that horrible, it's just that the exception handling code is very complex. This isn't to say I think the i860 exception processing is all that great - I think Intel could have produced a much better design without a noticable increase in silicon. My suspicion is that they simply had a lack of software expertise involved in that part of the design. There is nothing in the current exception processing design, however, which prevents the chip from functioning correctly in a multiprocessing application. I've never looked at the 88000 exception code, but I'm told that it is quite complex as well. The major difference that I see is that all the people I've talked to who are using 88000 say that the exception code as provided by Motorola works fine, and they never had to modify it. In the case of the i860, the original exception code as provided with Intel didn't even come close to handling all the cases for continuing from an exception. Because of this, Alliant spent a great deal of time (1 person for many months) trying to get this code to work correctly. Had Intel delivered a working exception handler to i860 customers, the level of pain and suffering would have been reduced by a few orders of magnitude. This is one area Intel should try to improve as they come out with future chips. So the real problem here was that the chip design mandated a complex piece of code, which Intel didn't supply. Many customers have spent a lot of money developing this code themselves, when it would have been more appropriate for the chip manufacturer to supply it. >And flushing the cache requires a software loop to map each entry to >an inaccessible region of memory (no valid bits, it seems). Well, I'm not sure what Colin meant by "no valid bits". The cache is a writeback cache, and rather than have a chip function to write back all the cache lines, the operating system simply has to set things up so that the normal cache line replacement algorithm will write the dirty line back to memory, and replace it with an invalid entry. There certainly is a valid bit; only dirty cache lines actually get written back. In a funny sort of way, I kinda like this. It seems very RISCy to me - why provide a special function which doesn't get used that often (only at context switch time) when you can have the software do it? In the case of the Alliant, the main memory subsystem couldn't keep up with the chip anyway, so there wasn't really a speed disadvantage to doing it this way. In systems with very fast main memory, people might think this was a performance problem, I can't comment. I think it's safe to say that this isn't a problem that should cause you to think twice about using the i860 in your system. If I had to list the major problems with i860 as they affected Alliant, they would be: 1) Exception processing code as provided by Intel was inadequate causing us to dedicate significant amounts of developer time to this. Not a problem with the chip as much as a problem with the way Intel helped bootstrap customers. 2) No cache coherence scheme making multi-processor implementations suffer severe performance problems (shared data can't be cached). 3) Time consuming to produce a compiler which can get a significant amount of the potential performance out of the chip. Lets face it, the chip has been out for quite a while, and good compilers are just now appearing. Of course, you can argue that #2 isn't fair because Intel didn't design the chip for MP applications. The follow on chip handles that. You can argue whether that was a smart move when more and more systems are MP systems. Still, I'd accept that it's not fair to critisize the chip for this. What's interesting is that that leaves #1 and #3 which are both software issues. So perhaps the lesson is that chip manufacturers who want to go the RISC route should make sure they have the software expertise to do this, rather than simply push the problem back onto their customers. MIPS is probably a good example of a company that has a good balance of software and hardware design experience in their team, allowing them to ship product to their customers which doesn't take a year to integrate. I've never worked on a MIPS based product, so perhaps this isn't true. However, I think the principal is still the same - it's not enough to produce a hot chip if it takes the customers another year or two to successfully integrate it into their product and come up with a compiler which gets a significant percentage of the possible performance. If you do it that way the product is obsolete by the time system manufacturers get it to market. I can assure you that these opinions are my own, and may not be shared by my employers; past, present, or future... -- uucp: pc@stardent
meissner@osf.org (Michael Meissner) (02/13/91)
In article <1991Feb8.144035.1076@Stardent.COM> pc@Stardent.COM (Paul Cantrell) writes: | I've never looked at the 88000 exception code, but I'm told that it is | quite complex as well. The major difference that I see is that all the | people I've talked to who are using 88000 say that the exception code as | provided by Motorola works fine, and they never had to modify it. The exception code provided by Motorola may be reasonable now, but the original exception code did not properly handle NaN's, denorms, and such. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?