fay@encore.UUCP (Peter Fay) (09/14/87)
In article <347@erc3ba.UUCP> sd@erc3ba.UUCP (S.Davidson) writes: > > >It's happened already, though they are not all the rage yet. They are >called Very Long Instruction Word machines, and one of the originators, >Josh Fisher, did his dissertation on global compaction of horizontal >microcode. Josh moved to Yale after he graduated, and then moved to a >company to build a VLIW machine. I don't know the current status of this machine, >though. At Yale, though, Josh got some very impressive speedups from unrolling >loops and basically running compaction on them, assuming a lot of available resources. >I don't know of any results on real hardware, however. > Funny you should mention this. I was just reading "Unix on a VLIW" (P. Clancy et al. - Proc. Summer 1987 Usenix Conf.) which describes some of Multiflow's hardware and software. Truely incredible stuff, if it's for real. Their high end system (Trace 28/200) claims 28 operations per instruction, 120 MFLOPS and 215 VLIW MIPS. The most intriguing aspect to me, though, is not just their hardware doing 28 formerly sequential instructions in parallel, but their compiler techniques. Normally "conditional jumps occur every five to eight instructions", making parallelization very difficult. So simply take a trace of normal program execution (yes, I know, somewhat awkward compiling new programs) and have the compiler assume it will USUALLY execute that trace. Then compile the new program as if it were not going to take the seldom-used branches and plunge ahead. Of course, if those unlikely branches happen, just do "compensation" (undo what you did wrong). The authors claim instead of several instructions without branches, they can acheive "hundreds or thousands of operations become candidates for overlap". Unfortunately, no hard cold numbers of improved code are presented in this paper. My question to those parllel machine compiler writers out there: is anyone writing compilers for non VLIW machines using the same methods? Why can't, say, an Alliant-type (or Cedar-type, etc.) machine with hardware lock-step between computational elements get a trace execution, recompile assuming no branches, and when the 1000th instruction diverts from the "chosen path", just back up the CE's and undo the damage? peter fay fay@multimax.arpa {allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay -- peter fay fay@multimax.arpa {allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay
lindsay@k.gp.cs.cmu.edu.UUCP (09/18/87)
Yes, trace scheduling is a useful technique on non-VLIW machines. To recap: the basic trick is to eliminate constraints from the precedence graph, by placing fixup code on paths which are thought to be less likely. For example: if(a) then { x; b } else c; might become x; if(a) then b else { undo-x; c }; A VLIW machine does this because you can't "schedule" two ops into the same large instruction word if one is constrained to be before/after the other. On more normal machines, scheduling still has wins. Anyone with a floating point coprocessor can try for integer/float overlap. The MIPS cpu can be scheduled. Crays have multiple functional units: they can be scheduled because opcodes are issued faster than functions complete. Other vector machines, such the Alliant, have scheduling. So, the trace scheduling method can be used to improve the scheduling on all of these. The win is basically that the machine's average throughput gets nearer the peak throughput. If your application is already peaked, then you don't need to know. The Multiflow machine can do branch logic in every instruction, so they hope to do "junk code" better than anyone else. Supposedly their Unix is quite quick: I will try to get some numbers. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
aglew@urbsdc.Urbana.Gould.COM (11/06/88)
>The 88000 does even better. Rather than requiring all instructions to >contain several operations, each instruction *starts* one operation. The >target register (source for a store operation) is marked "busy" so that >the the next reference will wait for the operation to finish. This >allows the same parallelism as VLIW without wasting code memory bandwidth >on empty slots. From what I understand, the current chip does addition >and logic in one instruction cycle (thus not parallelizing these operations), >but load, store, multiply/divide, floating point use the scheme described. >A neat advantage of the hardware bit is that the compiler does not need to >know exact timings to ensure correct execution. Timing data enhances >optimization, but is not necessary to ensure correctness. > >I believe this technique is called "scoreboarding". > >A later version could parallelize short instructions also if instruction >fetching became much faster than addition and logic. > >Stuart D. Gathman <stuart@bms-at.uucp> > <..!{vrdxhq|daitc}!bms-at!stuart> I have to be careful saying this, since I now work for Motorola, but it should be obvious that scoreboarding cannot take you as far as VLIW. Scoreboarding is an appropriate choice for the current level of microprocessor technology, but any computer architect will tell you that you eventually have to get past the one operation/cycle dispatch limit (well, maybe not Norm Jouppi, at DEC, who published an interesting paper titled something like "Superpipelined vs. Superparallel" computers in CAN a while back). Scoreboarding lets you have multiple operations at once, but still, typically, you only dispatch one operation/instruction cycle. Which means that only one operation/instruction cycle can complete, which provides a limit on throughput. To get faster, you either have to decrease the cycle time or increase the number of operations dispatched/completed per instruction cycle. Note that scoreboarding doesn't even get you to 1 operation/cycle dispatch; you still have stalls, when the register is busy. The next step past scoreboarding is Tomasulo instruction scheduling, which lets you continue to dispatch instructions even though previous instructions have not yet even received the data to begin execution. Berkeley's Aquarius project was the last group I know of to try this. Tomasulo scheduling seems to be a hard subject, but every group to try it makes it a little bit easier. Tomasulo on a single operation per instruction set lets you approach 1 operation / instruction cycle dispatch. Both scoreboarding and Tomasulo can be used to dispatch one or multiple instructions per cycle, getting past the instruction dispatch limit. This is just easier to do in a VLIW instruction set, where the operations are guaranteed to be independent; it can be done, but gets expensive, for dispatch of multiple possibly dependent operations/cycle. Andy "Krazy" Glew. at: Motorola Microcomputer Division, Champaign-Urbana Development Center (formerly Gould CSD Urbana Software Development Center). mail: 1101 E. University, Urbana, Illinois 61801, USA. email: (Gould addresses will persist for a while) aglew@gould.com - preferred, if you have MX records aglew@fang.gould.com - if you don't ...!uunet!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products. PS. I promise to shorten this .signature as soon as our new mail paths are set.
aglew@urbsdc.Urbana.Gould.COM (11/11/88)
>..> Peter da Silva makes the plea for scoreboarding being better >..> than VLIW, "because the next version of the machine is likely >..> to have a different mix of instruction timings." > >The next version of the machine have different instruction timings? >How? Remember, a VLIW is effectively executing microcode. Multiplies >on a CISC processor, currently implemented in microcode, can be >speeded up by adding a hardware multiplier. But adding a new >functional unit like that to a VLIW effectively makes the instruction, >err, control word longer. You'll have to re-compile to take advantage >of it anyhow! > >Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg > Snail Mail P.O. Box 92191 Lafayette, LA 70509 Ummh, what is to prevent a VLIW from having a multiply instruction (doesn't MULTIFLOW, Bob?). And what is to prevent version 1 of the machine having a multiply instruction that does two bits at a time, taking ~16 cycles, while version 2 does 8 bits at a time, taking ~4 cycles, for a 32 bit word? IE. what is to prevent execution units that are already in the VLIW from changing theit timings. Ditto especially for the memory system. VLIW and techniques such as scoreboarding are not mutually exclusive - it is possible to combine them, although whether such is worthwhile is the subject of ongoing research. So far, I'd guess that the evidence says that VLIW+scoreboarding doesn't win you much performance, but it can give you binary compatibility. Binary compatibility will be an issue until (1) there is a standard machine independent distribution format for software; (2) until there is a standard way of handling multi-machine executables; (3) when the problems of process migration between inhomogenous machines are either solved, or considered unimportant. Andy "Krazy" Glew. at: Motorola Microcomputer Division, Champaign-Urbana Development Center (formerly Gould CSD Urbana Software Development Center). mail: 1101 E. University, Urbana, Illinois 61801, USA. email: (Gould addresses will persist for a while) aglew@gould.com - preferred, if you have MX records aglew@fang.gould.com - if you don't ...!uunet!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products. PS. I promise to shorten this .signature soon.
spectre@mit-vax.LCS.MIT.EDU (Joseph D. Morrison) (11/12/88)
In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >I have to be careful saying this, since I now work for Motorola, >but it should be obvious that scoreboarding cannot take you as far as >VLIW. Scoreboarding is an appropriate choice for the current level >of microprocessor technology, but any computer architect will tell >you that you eventually have to get past the one operation/cycle >dispatch limit ... (more info about VLIW and scoreboarding) It seems to me that the issue of VLIW versus scoreboarding is the wrong one to discuss. Scoreboarding is but one of several techniques for managing a pipeline. (Some alternative techniques are micro-dataflow, simple stalling, or letting the compiler stick no-ops in the right places. The simple schemes can also be combined with "register bypass" to improve pipeline performance.) So I think we were actually arguing about "which is better for getting parallelism; pipelining or VLIW?" Phrased that way, I think the answer is obviously "use both". If each of your functional units takes 4 cycles to perform its operation, and you have a VLIW machine with 8 functional units, your average throughput will be 2 instructions per cycle. The obvious thing to do is to use pipelined functional units, and get the 8 instructions per cycle you deserve :-) Naturally, as soon as you do this you will need some mechanism for handling the various conflicts that occur when two instructions in the pipeline want to use the same register. This is when you can use scoreboarding, or whatever you want. In fact, what better way to test pipeline strategies! With all those functional units, the pipeline management will be pretty hairy... Joe Morrison -- MIT Laboratory for Computer Science UUCP: ...!mit-eddie!vx!spectre 545 Technology Square, Room 425 ARPA: spectre@vx.lcs.mit.edu Cambridge, Massachusetts, 02139 (617) 253-5881 -- "Back off, man; I'm a scientist!"
frisch@mfci.UUCP (Michael Frisch) (11/12/88)
In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: > >So I think we were actually arguing about "which is better for getting >parallelism; pipelining or VLIW?" Phrased that way, I think the answer >is obviously "use both". > >If each of your functional units takes 4 cycles to perform its >operation, and you have a VLIW machine with 8 functional units, your >average throughput will be 2 instructions per cycle. The obvious thing >to do is to use pipelined functional units, and get the 8 instructions >per cycle you deserve :-) > >Naturally, as soon as you do this you will need some mechanism for >handling the various conflicts that occur when two instructions in the >pipeline want to use the same register. This is when you can use >scoreboarding, or whatever you want. > >In fact, what better way to test pipeline strategies! With all those >functional units, the pipeline management will be pretty hairy... > This is already done in Multiflow's VLIW ... instructions which take more than one cycle (floating add, floating multiply, memory references) are pipelined, with the pipes exposed to the compiler. So these operations can each be initiated every cycle. The pipelining is managed in software, at compile time, rather than by a scoreboard at runtime. It may be hairy, but a) the compiler has much more information available to it than the limited look-ahead of a scoreboard, b) the compiler can rearrange operations as needed to keeps the pipes full while the scoreboard can at best execute those future operations which happen to be data-ready, and c) making the hardware simpler (i.e., no scoreboard) makes the system more cost-effective. (Someone asked about integer multiplies .. they're done in one cycle in hardware already ... its the flops and memory refs which gain from pipelining). Mike Frisch ------------------------------------------------------------------------------- The opinions above are mine and not necessarily those of my employer.
colwell@mfci.UUCP (Robert Colwell) (11/13/88)
In article <28200234@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: > >>..> Peter da Silva makes the plea for scoreboarding being better >>..> than VLIW, "because the next version of the machine is likely >>..> to have a different mix of instruction timings." >> >>The next version of the machine have different instruction timings? >>How? Remember, a VLIW is effectively executing microcode. Multiplies >>on a CISC processor, currently implemented in microcode, can be >>speeded up by adding a hardware multiplier. But adding a new >>functional unit like that to a VLIW effectively makes the instruction, >>err, control word longer. You'll have to re-compile to take advantage >>of it anyhow! > >Ummh, what is to prevent a VLIW from having a multiply instruction >(doesn't MULTIFLOW, Bob?). Yes, we do have integer multiply (in fact, they're implemented in those great big AMD/Cypress chips). > IE. what is to prevent execution units that are already in the >VLIW from changing theit timings. Not a thing. And if you've made your compiler table-driven, where all the significant pipe lengths and resource usages in the machine reside in one table, it's not hard at all to retarget the compiler. And further, it's relatively easy to experiment with what the effects on performance would be if one could shorten a pipe here or add a register file port there. >VLIW and techniques such as scoreboarding are not mutually exclusive >- it is possible to combine them, although whether such is worthwhile >is the subject of ongoing research. So far, I'd guess that the evidence >says that VLIW+scoreboarding doesn't win you much performance, but >it can give you binary compatibility. Yes, that's one way to get binary compatibility, similar in spirit to the Vax's 11-compatibility mode. Kinda high price to pay, though, considering I could have used that hardware to buy higher performance on recompiled code for the same hardware cost (and less design time, since the scoreboard would be no trivial design to get right.) Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
colwell@mfci.UUCP (Robert Colwell) (11/13/88)
In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >It seems to me that the issue of VLIW versus scoreboarding is the >wrong one to discuss. > >Scoreboarding is but one of several techniques for managing a >pipeline. (Some alternative techniques are micro-dataflow, simple ^^^^^^^^^^^^^^ Would you elaborate a little on that? Never heard of it. >stalling, or letting the compiler stick no-ops in the right places. >The simple schemes can also be combined with "register bypass" to >improve pipeline performance.) We do register bypassing, and it's not free in terms of gates in the register files, but it's worthwhile. >So I think we were actually arguing about "which is better for getting >parallelism; pipelining or VLIW?" Phrased that way, I think the answer >is obviously "use both". We did, so I don't see any argument here. >If each of your functional units takes 4 cycles to perform its >operation, and you have a VLIW machine with 8 functional units, your >average throughput will be 2 instructions per cycle. The obvious thing >to do is to use pipelined functional units, and get the 8 instructions >per cycle you deserve :-) We do. If you put in a functional unit that requires 4 cycles to complete, and you DON'T pipeline it, then your first machine will be your last, because nobody will buy it, the performance will be too low. The question is, does the compiler manage the pipes, or do you devote complicated runtime hardware to the task? >Naturally, as soon as you do this you will need some mechanism for >handling the various conflicts that occur when two instructions in the >pipeline want to use the same register. This is when you can use >scoreboarding, or whatever you want. We let the compiler do it. The only reason to make the hardware do it is to try to handle object code compatibility across different pipeline latencies. See other recent articles for more on this. >In fact, what better way to test pipeline strategies! With all those >functional units, the pipeline management will be pretty hairy... So if you do it in software, you get a wrong answer, you fix your tables, and recompile the compiler (not that that's ever happened to us, you understand :-)). And if you do it in hardware, you respin the chip at enormous expense and then wait for the next time.
aglew@urbsdc.Urbana.Gould.COM (11/14/88)
>It seems to me that the issue of VLIW versus scoreboarding is the >wrong one to discuss. > >Scoreboarding is but one of several techniques for managing a >pipeline. (Some alternative techniques are micro-dataflow, simple >stalling, or letting the compiler stick no-ops in the right places. >The simple schemes can also be combined with "register bypass" to >improve pipeline performance.) > > Joe Morrison > >MIT Laboratory for Computer Science UUCP: ...!mit-eddie!vx!spectre >545 Technology Square, Room 425 ARPA: spectre@vx.lcs.mit.edu >Cambridge, Massachusetts, 02139 (617) 253-5881 This is pedantic, but "managing a pipeline" is overly restrictive. What you mean is managing an instruction and resource scheduling system, of which a pipeline is only one possibility. To me, pipeline implies sequentiality - saying "pipe network" lets you get out of order, but I'd still prefer a better term.
stevew@nsc.nsc.com (Steve Wilson) (11/15/88)
In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >Naturally, as soon as you do this you will need some mechanism for >handling the various conflicts that occur when two instructions in the >pipeline want to use the same register. This is when you can use >scoreboarding, or whatever you want. This is where the discussion comes from. VLIW advocates letting the compiler manage both the pipe and register utilization since the compiler knows about global resource utilization where a scoreboard doesn't. Steve Wilson National Semiconductor [The above opinion is mine, not that of my employer. ]
spectre@mit-vax.LCS.MIT.EDU (Joseph D. Morrison) (11/16/88)
In article <556@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >>Scoreboarding is but one of several techniques for managing a >>pipeline. (Some alternative techniques are micro-dataflow, simple > ^^^^^^^^^^^^^^ >Would you elaborate a little on that? Never heard of it. Micro-dataflow is an interesting pipeline management mechanism that was used in the IBM 360/91 computer. (I think IBM used it in one other machine as well, but I can't remember which.) The idea was that instructions could be started out of order, and could finish out of order, with a "reservation station" making sure that no dependencies were violated. If you are interested, here are the details! (If not, hit 'n' now!) What are reservation stations? ============================== The CPU has several of these "reservation stations" in its hardware. Each reservation station is either: (a) empty, or (b) contains an instruction to be performed as soon as possible, in the form: +----+------------------+------------------+------------+ | OP | TAG/DATA (src 1) | TAG/DATA (src 2) | TAG (dest) | +----+------------------+------------------+------------+ i.e. an operation, two source operands (each of which can be a valid datum or a tag), and one destination operand which is always a tag. The Instruction-Processing Algorithm ==================================== You can think of this as three processes running "in parallel" on the hardware. (Assume, for simplicity, that all instructions are in the form (OP SRC1 SRC2 DEST). Assume also that all sources and destinations are registers.) PROCESS 1 ========= PROCESS-1 continually loads the reservation stations. Every time a reservation station becomes free, PROCESS-1 loads the next instruction from memory into the free reservation station. To load an instruction into a reservation station: - Say the instruction was (+ R0 R0 R1) - First, generate a new tag for the solution (which will go in R1). (Say we generate the tag "q".) - Stick the tag "q" in register R1. - Next, check if R0 contains data or a tag. If it contains data (say the number 53) put the following in the reservation station: (+ 53 53 q) If R0 contained tag "p", we would put: (+ p p q) into the reservation station. - IN GENERAL: - A tag is always generated for the destination of each instruction. - This tag is always placed in BOTH the destination register, and in the "destination" position of the reservation station. - The operands are copied right into the reservation station if they are available, but if a tag is in an operand register, that tag is placed in the reservation station instead. - Informally, tags represent "data in transit". PROCESS 2 ========= PROCESS-2 dispatches instructions from the reservation station. PROCESS-2 is continually scanning the reservation stations, checking for stations in which both source operands contain valid data. If one is found, the operation is "shoved into" the pipelined ALU, along with the tag for the result. For example, if a station contained (+ 53 53 q), the dispatcher would shove [+ 53 53, q] into the ALU. (The ALU goes off and does its thing, and when it's done, the pair [106, q] will appear on the bus.) If PROCESS-2 finds something like (+ p p q) in a reservation station, it just ignores that station for the time being. Every time an instruction is dispatched from a reservation station, that station becomes empty and can be refilled by PROCESS-1. PROCESS 3 ========= This process "watches" the bus where results come out, looking for [result, tag] pairs. Whenever it sees such a pair, PROCESS-3 finds all places (in the register file and in the reservation stations) with that tag, and shoves the result into all those places. Discussion ========== Notice that this is a "micro" version of a dataflow machine. Micro-dataflow gives you as much parallelism as is permitted by any reordering of the instructions, while properly handling any data dependencies, BUT: degenerates to a normal interlocked pipeline if (a) the reservation stations become full, or (b) it runs out of tags Write/write conflicts are elegantly handled, as in an ordinary dataflow machine; if two instructions both write to R0, the result slot of each instruction is still allocated a different tag, and the last instruction is the one whose tag ends up in R0. Thus "the right thing will happen". It's really a very clever scheme! The only bug? It's not worth the hardware cost. You need associative lookup to handle copy-back of results into the registers and reservation stations, and that takes chip area. You need a tag generator, and some extra datapaths... And unless you use a large reservation station, you really don't find that much extra parallelism. Details on the 360/91 system can be found in the following article: Anderson, Sparacio and Tomasulo; "The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling", IBM Journal, January 1967. (pp.8-24) Joe Morrison -- MIT Laboratory for Computer Science UUCP: ...!mit-eddie!vx!spectre 545 Technology Square, Room 425 ARPA: spectre@vx.lcs.mit.edu Cambridge, Massachusetts, 02139 (617) 253-5881 -- "Back off, man; I'm a scientist!"
lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (11/16/88)
In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >Micro-dataflow is an interesting pipeline management mechanism that >was used in the IBM 360/91 computer. I think that this is more commonly known as Tomasulo instruction scheduling. There was a study, a few years ago, showing that a Cray-1 would have had higher throughput if it had used this method. This system is essentially the high-price/high-win version of a scoreboard. Many modern systems have chosen to go with compile-time scheduling, some retaining a few hardware interlocks, some not. The argument is actually deeper than just fancy compilers versus fancy (or self-reliant) hardware. There are two basic issues. The first issue is branches. They happen very often, and the hardware solutions don't mind. The innovation that made VLIW possible was a compiler innovation for scheduling in the presence of branches. It works well in certain kinds of code: only Multiflow has much understanding about how well it works on the rest of the code. The second issue is cycle counts and synchronization. It used to be common for instructions to take a data-dependent number of clocks. For example, a multiply by a small number would run faster than a multiply by a big number. Also, there were machines with asynchronous units: they were done when they were done, and that was that. (The latest buzzword is "self timed circuits", but they aren't necessarily like that.) All in all, the hardware solutions coped fine with all this. The compilers give up and rely on fond hopes. There are several reasons that data-dependent instruction timing has come to disfavor. For one, hardware interlocks only look ahead just so far, and are rarely as clever as the Tomasulo scheme. So, the compilers were generating code that interlocked a lot. By making the machines more predictable, we've made it possible for compilers to compare possible overlap sequences, and compute - at compile time - which will run faster. That still leaves conditional branches. The approach of HEP was straighforward enough: run someone else as a crack-stuffer. I wonder what the follow-on will look like. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science --
aglew@urbsdc.Urbana.Gould.COM (11/16/88)
>/* Written 3:41 pm Nov 14, 1988 by stevew@nsc.nsc.com in urbsdc:comp.arch */ >In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >>In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >>Naturally, as soon as you do this you will need some mechanism for >>handling the various conflicts that occur when two instructions in the >>pipeline want to use the same register. This is when you can use >>scoreboarding, or whatever you want. >This is where the discussion comes from. VLIW advocates letting the >compiler manage both the pipe and register utilization since the >compiler knows about global resource utilization where a scoreboard >doesn't. > >Steve Wilson >National Semiconductor >[The above opinion is mine, not that of my employer. ] I normally don't bother, but I'd like to point out that the quote is not mine.
henry@utzoo.uucp (Henry Spencer) (11/17/88)
In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes: >It's really a very clever scheme! The only bug? It's not worth the >hardware cost... There is also a small problem with trying to make the situation look sensible when an interrupt or a trap strikes. I don't think any of those beasts had an MMU (too old), but taking a page fault in a system like that would really be something. And you thought the 68020's "stack puke" was bad...
sher@sunybcs.uucp (David Sher) (11/23/88)
This is just an idea that has been floating around my mind for some time. The CMU (and now perhaps GE) WARP is an MIMD systolic array full of powerful pipelined processors. Its microinstruction set is designed to be as orthogonal as possible. So is the WARP a good candidate for VLIW techniques. The architecture is a bit regular for such but that may not be a disadvantage. I was considering doing some research that a ways myself but I find myself too busy to do that for a few years. -David Sher -David Sher ARPA: sher@cs.buffalo.edu BITNET: sher@sunybcs UUCP: {rutgers,ames,boulder,decvax}!sunybcs!sher
ian@armada.UUCP (Ian L. Kaplan) (11/24/88)
In article <2828@cs.Buffalo.EDU>, sher@sunybcs.uucp (David Sher) writes: > This is just an idea that has been floating around my mind for some time. > The CMU (and now perhaps GE) WARP is an MIMD systolic array full of > powerful pipelined processors. Its microinstruction set is designed to be > as orthogonal as possible. So is the WARP a good candidate for VLIW > techniques. The architecture is a bit regular for such but that may not > be a disadvantage. I was considering doing some research that a ways myself > but I find myself too busy to do that for a few years. The Warp has a compiler for a CMU developed language named W2. This compiler does, in fact, use some VLIW techniques. The work on the compiler was done by Monica Lam, Thomas Gross and their colleagues. It is described in Dr. Lam's Phd thesis (A Systolic Array Optimizing Compiler Monica Sin-Ling Lam, May 1987, CMU-CS-87-187). The Warp is an interesting machine, but its technology is several years old. CMU is working with iNTEL on a next generation machine, known as the iWarp. Last I heard, the iWarp would be a parallel processor with 72 PEs, arranged in a grid (I was tempted to write 2-D systolic array here, but just as the original Warp is much more flexible than a simple pipeline systolic array, the iWarp will be much more flexible than a simple 2-D systolic array). I expect that some interesting work on parallel programming languages and environments will arise out of the iWarp project. My guess is that there may end up being some similarities between the languages used to program the iWarp and the languages that are (or could be) used to program the Connection Machine. As far as work on VLIW, I would start soon, if I were you. RISC is passe'. The next high performance microprocessor architecture will be VLIW. Ian Kaplan MassPar Inc. I speak for myself and no one else.
lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (11/26/88)
I've had trouble responding to inquiries by mail, so, a posting. The study that simulated a Cray-1 with Tomasulo interlocks was: "Instruction Issue Logic for High-Performance Interruptable Pipelined Processors", 14th Annual Int'l Symposium on Computer Architecture (also Computer Arch. News vol.15 #2 June 1987) P.27 The original: "An Efficient Algorithm For Exploiting Multiple Arithmetic Units", R. Tomasulo, IBM Journal of R&D Jan 1967, p. 25 -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science --
cs4342ac@evax.arl.utexas.edu (Ytivitaler) (02/20/91)
I am looking for any information on books, articles etc... on VLIW computers. Any help would be appreciated. Thanx, Fred