[comp.arch] VLIW

fay@encore.UUCP (Peter Fay) (09/14/87)

In article <347@erc3ba.UUCP> sd@erc3ba.UUCP (S.Davidson) writes:
>
>
>It's happened already, though they are not all the rage yet.  They are
>called Very Long Instruction Word machines, and one of the originators,
>Josh Fisher, did his dissertation on global compaction of horizontal
>microcode.  Josh moved to Yale after he graduated, and then moved to a
>company to build a VLIW machine.  I don't know the current status of this machine,
>though.  At Yale, though, Josh got some very impressive speedups from unrolling
>loops and basically running compaction on them, assuming a lot of available resources.
>I don't know of any results on real hardware, however.
>

Funny you should mention this. I was just reading "Unix on a VLIW" (P.
Clancy et al. - Proc. Summer 1987 Usenix Conf.) which describes some of
Multiflow's hardware and software. Truely incredible stuff, if it's for
real. Their high end system (Trace 28/200) claims 28 operations per
instruction, 120 MFLOPS and 215 VLIW MIPS.

The most intriguing aspect to me, though, is not just their hardware doing
28 formerly sequential instructions in parallel, but their compiler
techniques. Normally "conditional jumps occur every five to eight
instructions", making parallelization very difficult. So simply take a
trace of normal program execution (yes, I know, somewhat awkward compiling
new programs) and have the compiler assume it will USUALLY execute that
trace. Then compile the new program as if it were not going to take the
seldom-used branches and plunge ahead. Of course, if those unlikely
branches happen, just do "compensation" (undo what you did wrong). The
authors claim instead of several instructions without branches, they can
acheive "hundreds or thousands of operations become candidates for
overlap".

Unfortunately, no hard cold numbers of improved code are presented in this
paper.

My question to those parllel machine compiler writers out there: is anyone
writing compilers for non VLIW machines using the same methods? Why can't,
say, an Alliant-type (or Cedar-type, etc.) machine with hardware lock-step
between computational elements get a trace execution, recompile assuming
no branches, and when the 1000th instruction diverts from the "chosen
path", just back up the CE's and undo the damage?

			peter fay
			fay@multimax.arpa
{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay
-- 
			peter fay
			fay@multimax.arpa
{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay

lindsay@k.gp.cs.cmu.edu.UUCP (09/18/87)

Yes, trace scheduling is a useful technique on non-VLIW machines.

To recap: the basic trick is to eliminate constraints from the precedence
graph, by placing fixup code on paths which are thought to be less likely.
For example: 
	if(a) then { x; b } else c;
might become
	x; if(a) then b else { undo-x; c };

A VLIW machine does this because you can't "schedule" two ops into the
same large instruction word if one is constrained to be before/after the
other.

On more normal machines, scheduling still has wins. Anyone with a floating
point coprocessor can try for integer/float overlap. The MIPS cpu can be
scheduled. Crays have multiple functional units: they can be scheduled
because opcodes are issued faster than functions complete.  Other vector
machines, such the Alliant, have scheduling.

So, the trace scheduling method can be used to improve the scheduling on
all of these. The win is basically that the machine's average throughput
gets nearer the peak throughput. If your application is already peaked,
then you don't need to know.

The Multiflow machine can do branch logic in every instruction,
so they hope to do "junk code" better than anyone else. Supposedly
their Unix is quite quick: I will try to get some numbers.
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

aglew@urbsdc.Urbana.Gould.COM (11/06/88)

>The 88000 does even better.  Rather than requiring all instructions to
>contain several operations, each instruction *starts* one operation.  The
>target register (source for a store operation) is marked "busy" so that
>the the next reference will wait for the operation to finish.  This
>allows the same parallelism as VLIW without wasting code memory bandwidth
>on empty slots.  From what I understand, the current chip does addition
>and logic in one instruction cycle (thus not parallelizing these operations),
>but load, store, multiply/divide, floating point use the scheme described.
>A neat advantage of the hardware bit is that the compiler does not need to
>know exact timings to ensure correct execution.  Timing data enhances
>optimization, but is not necessary to ensure correctness.
>
>I believe this technique is called "scoreboarding".
>
>A later version could parallelize short instructions also if instruction
>fetching became much faster than addition and logic.
>
>Stuart D. Gathman	<stuart@bms-at.uucp>
>			<..!{vrdxhq|daitc}!bms-at!stuart>

I have to be careful saying this, since I now work for Motorola,
but it should be obvious that scoreboarding cannot take you as far as
VLIW. Scoreboarding is an appropriate choice for the current level
of microprocessor technology, but any computer architect will tell
you that you eventually have to get past the one operation/cycle
dispatch limit (well, maybe not Norm Jouppi, at DEC, who published
an interesting paper titled something like "Superpipelined vs. Superparallel"
computers in CAN a while back).

Scoreboarding lets you have multiple operations at once, 
but still, typically, you only dispatch one operation/instruction cycle.
Which means that only one operation/instruction cycle can complete,
which provides a limit on throughput. 
   To get faster, you either have to decrease the cycle time
or increase the number of operations dispatched/completed per
instruction cycle.

Note that scoreboarding doesn't even get you to 1 operation/cycle dispatch;
you still have stalls, when the register is busy.
   The next step past scoreboarding is Tomasulo instruction scheduling,
which lets you continue to dispatch instructions even though previous
instructions have not yet even received the data to begin execution.
Berkeley's Aquarius project was the last group I know of to try this.
Tomasulo scheduling seems to be a hard subject, but every group to try
it makes it a little bit easier.
   Tomasulo on a single operation per instruction set lets you approach
1 operation / instruction cycle dispatch.

Both scoreboarding and Tomasulo can be used to dispatch one or multiple
instructions per cycle, getting past the instruction dispatch limit.
This is just easier to do in a VLIW instruction set, where the operations
are guaranteed to be independent; it can be done, but gets expensive,
for dispatch of multiple possibly dependent operations/cycle.



Andy "Krazy" Glew. 
at: Motorola Microcomputer Division, Champaign-Urbana Development Center
    (formerly Gould CSD Urbana Software Development Center).
mail: 1101 E. University, Urbana, Illinois 61801, USA.
email: (Gould addresses will persist for a while)
    aglew@gould.com     	    - preferred, if you have MX records
    aglew@fang.gould.com     	    - if you don't
    ...!uunet!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

PS. I promise to shorten this .signature as soon as our new mail paths are set.

aglew@urbsdc.Urbana.Gould.COM (11/11/88)

>..> Peter da Silva makes the plea for scoreboarding being better
>..> than VLIW, "because the next version of the machine is likely
>..> to have a different mix of instruction timings."
>
>The next version of the machine have different instruction timings?
>How? Remember, a VLIW is effectively executing microcode. Multiplies
>on a CISC processor, currently implemented in microcode, can be
>speeded up by adding a hardware multiplier. But adding a new
>functional unit like that to a VLIW effectively makes the instruction,
>err, control word longer. You'll have to re-compile to take advantage
>of it anyhow!
>
>Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
>          Snail Mail P.O. Box 92191 Lafayette, LA 70509              

Ummh, what is to prevent a VLIW from having a multiply instruction
(doesn't MULTIFLOW, Bob?).

And what is to prevent version 1 of the machine having a multiply
instruction that does two bits at a time, taking ~16 cycles,
while version 2 does 8 bits at a time, taking ~4 cycles, for a 32
bit word?
    IE. what is to prevent execution units that are already in the
VLIW from changing theit timings.

Ditto especially for the memory system. 

VLIW and techniques such as scoreboarding are not mutually exclusive
- it is possible to combine them, although whether such is worthwhile
is the subject of ongoing research. So far, I'd guess that the evidence
says that VLIW+scoreboarding doesn't win you much performance, but
it can give you binary compatibility.

Binary compatibility will be an issue until (1) there is a standard
machine independent distribution format for software; (2) until there
is a standard way of handling multi-machine executables; (3) when
the problems of process migration between inhomogenous machines
are either solved, or considered unimportant.



Andy "Krazy" Glew. 
at: Motorola Microcomputer Division, Champaign-Urbana Development Center
    (formerly Gould CSD Urbana Software Development Center).
mail: 1101 E. University, Urbana, Illinois 61801, USA.
email: (Gould addresses will persist for a while)
    aglew@gould.com     	    - preferred, if you have MX records
    aglew@fang.gould.com     	    - if you don't
    ...!uunet!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

PS. I promise to shorten this .signature soon.

spectre@mit-vax.LCS.MIT.EDU (Joseph D. Morrison) (11/12/88)

In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:

>I have to be careful saying this, since I now work for Motorola,
>but it should be obvious that scoreboarding cannot take you as far as
>VLIW. Scoreboarding is an appropriate choice for the current level
>of microprocessor technology, but any computer architect will tell
>you that you eventually have to get past the one operation/cycle
>dispatch limit ...

(more info about VLIW and scoreboarding)

It seems to me that the issue of VLIW versus scoreboarding is the
wrong one to discuss.

Scoreboarding is but one of several techniques for managing a
pipeline.  (Some alternative techniques are micro-dataflow, simple
stalling, or letting the compiler stick no-ops in the right places.
The simple schemes can also be combined with "register bypass" to
improve pipeline performance.)

So I think we were actually arguing about "which is better for getting
parallelism; pipelining or VLIW?" Phrased that way, I think the answer
is obviously "use both".

If each of your functional units takes 4 cycles to perform its
operation, and you have a VLIW machine with 8 functional units, your
average throughput will be 2 instructions per cycle. The obvious thing
to do is to use pipelined functional units, and get the 8 instructions
per cycle you deserve :-)

Naturally, as soon as you do this you will need some mechanism for
handling the various conflicts that occur when two instructions in the
pipeline want to use the same register. This is when you can use
scoreboarding, or whatever you want.

In fact, what better way to test pipeline strategies! With all those
functional units, the pipeline management will be pretty hairy...

        Joe Morrison
--
MIT Laboratory for Computer Science     UUCP: ...!mit-eddie!vx!spectre
545 Technology Square, Room 425         ARPA: spectre@vx.lcs.mit.edu
Cambridge, Massachusetts, 02139         (617) 253-5881
--
"Back off, man; I'm a scientist!"

frisch@mfci.UUCP (Michael Frisch) (11/12/88)

In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>
>So I think we were actually arguing about "which is better for getting
>parallelism; pipelining or VLIW?" Phrased that way, I think the answer
>is obviously "use both".
>
>If each of your functional units takes 4 cycles to perform its
>operation, and you have a VLIW machine with 8 functional units, your
>average throughput will be 2 instructions per cycle. The obvious thing
>to do is to use pipelined functional units, and get the 8 instructions
>per cycle you deserve :-)
>
>Naturally, as soon as you do this you will need some mechanism for
>handling the various conflicts that occur when two instructions in the
>pipeline want to use the same register. This is when you can use
>scoreboarding, or whatever you want.
>
>In fact, what better way to test pipeline strategies! With all those
>functional units, the pipeline management will be pretty hairy...
>
This is already done in Multiflow's VLIW ... instructions which take
more than one cycle (floating add, floating multiply, memory references)
are pipelined, with the pipes exposed to the compiler.  So these operations
can each be initiated every cycle.  The pipelining is managed in software,
at compile time, rather than by a scoreboard at runtime.  It may be hairy,
but a) the compiler has much more information available to it than the limited
look-ahead of a scoreboard, b) the compiler can rearrange operations as
needed to keeps the pipes full while the scoreboard can at best execute those
future operations which happen to be data-ready, and c) making the hardware
simpler (i.e., no scoreboard) makes the system more cost-effective.

(Someone asked about integer multiplies .. they're done in one cycle in
hardware already ... its the flops and memory refs which gain from pipelining).

                                       Mike Frisch

-------------------------------------------------------------------------------
The opinions above are mine and not necessarily those of my employer.

colwell@mfci.UUCP (Robert Colwell) (11/13/88)

In article <28200234@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>>..> Peter da Silva makes the plea for scoreboarding being better
>>..> than VLIW, "because the next version of the machine is likely
>>..> to have a different mix of instruction timings."
>>
>>The next version of the machine have different instruction timings?
>>How? Remember, a VLIW is effectively executing microcode. Multiplies
>>on a CISC processor, currently implemented in microcode, can be
>>speeded up by adding a hardware multiplier. But adding a new
>>functional unit like that to a VLIW effectively makes the instruction,
>>err, control word longer. You'll have to re-compile to take advantage
>>of it anyhow!
>
>Ummh, what is to prevent a VLIW from having a multiply instruction
>(doesn't MULTIFLOW, Bob?).

Yes, we do have integer multiply (in fact, they're implemented in
those great big AMD/Cypress chips).

>    IE. what is to prevent execution units that are already in the
>VLIW from changing theit timings.

Not a thing.  And if you've made your compiler table-driven, where
all the significant pipe lengths and resource usages in the machine
reside in one table, it's not hard at all to retarget the compiler.
And further, it's relatively easy to experiment with what the effects
on performance would be if one could shorten a pipe here or add a
register file port there.

>VLIW and techniques such as scoreboarding are not mutually exclusive
>- it is possible to combine them, although whether such is worthwhile
>is the subject of ongoing research. So far, I'd guess that the evidence
>says that VLIW+scoreboarding doesn't win you much performance, but
>it can give you binary compatibility.

Yes, that's one way to get binary compatibility, similar in spirit to
the Vax's 11-compatibility mode.  Kinda high price to pay, though,
considering I could have used that hardware to buy higher performance
on recompiled code for the same hardware cost (and less design time,
since the scoreboard would be no trivial design to get right.)

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

colwell@mfci.UUCP (Robert Colwell) (11/13/88)

In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>It seems to me that the issue of VLIW versus scoreboarding is the
>wrong one to discuss.
>
>Scoreboarding is but one of several techniques for managing a
>pipeline.  (Some alternative techniques are micro-dataflow, simple
                                             ^^^^^^^^^^^^^^
Would you elaborate a little on that?  Never heard of it.

>stalling, or letting the compiler stick no-ops in the right places.
>The simple schemes can also be combined with "register bypass" to
>improve pipeline performance.)

We do register bypassing, and it's not free in terms of gates in the
register files, but it's worthwhile.

>So I think we were actually arguing about "which is better for getting
>parallelism; pipelining or VLIW?" Phrased that way, I think the answer
>is obviously "use both".

We did, so I don't see any argument here.

>If each of your functional units takes 4 cycles to perform its
>operation, and you have a VLIW machine with 8 functional units, your
>average throughput will be 2 instructions per cycle. The obvious thing
>to do is to use pipelined functional units, and get the 8 instructions
>per cycle you deserve :-)

We do.  If you put in a functional unit that requires 4 cycles to 
complete, and you DON'T pipeline it, then your first machine will be
your last, because nobody will buy it, the performance will be too low.
The question is, does the compiler manage the pipes, or do you devote
complicated runtime hardware to the task?

>Naturally, as soon as you do this you will need some mechanism for
>handling the various conflicts that occur when two instructions in the
>pipeline want to use the same register. This is when you can use
>scoreboarding, or whatever you want.

We let the compiler do it.  The only reason to make the hardware do it
is to try to handle object code compatibility across different pipeline
latencies.  See other recent articles for more on this.

>In fact, what better way to test pipeline strategies! With all those
>functional units, the pipeline management will be pretty hairy...

So if you do it in software, you get a wrong answer, you fix your tables,
and recompile the compiler (not that that's ever happened to us, you
understand :-)).  And if you do it in hardware, you respin the chip at 
enormous expense and then wait for the next time.

aglew@urbsdc.Urbana.Gould.COM (11/14/88)

>It seems to me that the issue of VLIW versus scoreboarding is the
>wrong one to discuss.
>
>Scoreboarding is but one of several techniques for managing a
>pipeline.  (Some alternative techniques are micro-dataflow, simple
>stalling, or letting the compiler stick no-ops in the right places.
>The simple schemes can also be combined with "register bypass" to
>improve pipeline performance.)
>
>        Joe Morrison
>
>MIT Laboratory for Computer Science     UUCP: ...!mit-eddie!vx!spectre
>545 Technology Square, Room 425         ARPA: spectre@vx.lcs.mit.edu
>Cambridge, Massachusetts, 02139         (617) 253-5881

This is pedantic, but "managing a pipeline" is overly restrictive.
What you mean is managing an instruction and resource scheduling
system, of which a pipeline is only one possibility. To me,
pipeline implies sequentiality - saying "pipe network" lets you
get out of order, but I'd still prefer a better term.

stevew@nsc.nsc.com (Steve Wilson) (11/15/88)

In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>Naturally, as soon as you do this you will need some mechanism for
>handling the various conflicts that occur when two instructions in the
>pipeline want to use the same register. This is when you can use
>scoreboarding, or whatever you want.
This is where the discussion comes from.  VLIW advocates letting the 
compiler manage both the pipe and register utilization since the 
compiler knows about global resource utilization where a scoreboard 
doesn't.

Steve Wilson 
National Semiconductor
[The above opinion is mine, not that of my employer. ]

spectre@mit-vax.LCS.MIT.EDU (Joseph D. Morrison) (11/16/88)

In article <556@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>>Scoreboarding is but one of several techniques for managing a
>>pipeline.  (Some alternative techniques are micro-dataflow, simple
>                                             ^^^^^^^^^^^^^^
>Would you elaborate a little on that?  Never heard of it.

Micro-dataflow is an interesting pipeline management mechanism that
was used in the IBM 360/91 computer. (I think IBM used it in one other
machine as well, but I can't remember which.)

The idea was that instructions could be started out of order, and could
finish out of order, with a "reservation station" making sure that no
dependencies were violated.

If you are interested, here are the details! (If not, hit 'n' now!)

What are reservation stations?
==============================
The CPU has several of these "reservation stations" in its hardware.
Each reservation station is either:

	(a) empty, or
	(b) contains an instruction to be performed as soon as
	    possible, in the form:

        +----+------------------+------------------+------------+
        | OP | TAG/DATA (src 1) | TAG/DATA (src 2) | TAG (dest) |
        +----+------------------+------------------+------------+

	i.e. an operation, two source operands (each of which can be
	     a valid datum or a tag), and one destination operand which
	     is always a tag.

The Instruction-Processing Algorithm
====================================
You can think of this as three processes running "in parallel" on the
hardware.  (Assume, for simplicity, that all instructions are in the
form (OP SRC1 SRC2 DEST).  Assume also that all sources and destinations
are registers.)

PROCESS 1
=========
PROCESS-1 continually loads the reservation stations.  Every time a
reservation station becomes free, PROCESS-1 loads the next instruction
from memory into the free reservation station.

To load an instruction into a reservation station:

	- Say the instruction was (+ R0 R0 R1)

	- First, generate a new tag for the solution (which will go in
	  R1). (Say we generate the tag "q".)

	- Stick the tag "q" in register R1.

	- Next, check if R0 contains data or a tag. If it contains data
	  (say the number 53) put the following in the reservation
	  station:
		(+  53 53 q)
	  If R0 contained tag "p", we would put:
		(+  p p q)
	  into the reservation station.

	- IN GENERAL:
		- A tag is always generated for the destination of each
		  instruction.

		- This tag is always placed in BOTH the destination
		  register, and in the "destination" position of the
		  reservation station.

		- The operands are copied right into the reservation
		  station if they are available, but if a tag is in an
		  operand register, that tag is placed in the
		  reservation station instead.

		- Informally, tags represent "data in transit".
PROCESS 2
=========
PROCESS-2 dispatches instructions from the reservation station.
PROCESS-2 is continually scanning the reservation stations, checking for
stations in which both source operands contain valid data.  If one is
found, the operation is "shoved into" the pipelined ALU, along with the
tag for the result.

For example, if a station contained (+ 53 53 q), the dispatcher would
shove [+ 53 53, q] into the ALU.  (The ALU goes off and does its thing,
and when it's done, the pair [106, q] will appear on the bus.)

If PROCESS-2 finds something like (+ p p q) in a reservation station,
it just ignores that station for the time being.

Every time an instruction is dispatched from a reservation station, that
station becomes empty and can be refilled by PROCESS-1.

PROCESS 3
=========
This process "watches" the bus where results come out, looking for
[result, tag] pairs.  Whenever it sees such a pair, PROCESS-3 finds all
places (in the register file and in the reservation stations) with that
tag, and shoves the result into all those places.

Discussion
==========
Notice that this is a "micro" version of a dataflow machine.

Micro-dataflow gives you as much parallelism as is permitted by any
reordering of the instructions, while properly handling any data
dependencies, BUT: degenerates to a normal interlocked pipeline if

	(a) the reservation stations become full, or
	(b) it runs out of tags

Write/write conflicts are elegantly handled, as in an ordinary dataflow
machine; if two instructions both write to R0, the result slot of each
instruction is still allocated a different tag, and the last instruction
is the one whose tag ends up in R0.  Thus "the right thing will happen".

It's really a very clever scheme!  The only bug?  It's not worth the
hardware cost.  You need associative lookup to handle copy-back of
results into the registers and reservation stations, and that takes chip
area.  You need a tag generator, and some extra datapaths...  And unless
you use a large reservation station, you really don't find that much
extra parallelism.

Details on the 360/91 system can be found in the following article:

Anderson, Sparacio and Tomasulo; "The IBM System/360 Model 91: Machine
Philosophy and Instruction-Handling", IBM Journal, January 1967.
(pp.8-24)

        Joe Morrison
--
MIT Laboratory for Computer Science     UUCP: ...!mit-eddie!vx!spectre
545 Technology Square, Room 425         ARPA: spectre@vx.lcs.mit.edu
Cambridge, Massachusetts, 02139         (617) 253-5881
--
"Back off, man; I'm a scientist!"

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (11/16/88)

In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>Micro-dataflow is an interesting pipeline management mechanism that
>was used in the IBM 360/91 computer.

I think that this is more commonly known as Tomasulo instruction
scheduling.  There was a study, a few years ago, showing that a Cray-1
would have had higher throughput if it had used this method.

This system is essentially the high-price/high-win version of a
scoreboard. Many modern systems have chosen to go with compile-time
scheduling, some retaining a few hardware interlocks, some not.

The argument is actually deeper than just fancy compilers versus fancy
(or self-reliant) hardware. There are two basic issues.

The first issue is branches. They happen very often, and the hardware
solutions don't mind. The innovation that made VLIW possible was a
compiler innovation for scheduling in the presence of branches.  It works
well in certain kinds of code: only Multiflow has much understanding
about how well it works on the rest of the code.

The second issue is cycle counts and synchronization. It used to be
common for instructions to take a data-dependent number of clocks.  For
example, a multiply by a small number would run faster than a multiply by
a big number. Also, there were machines with asynchronous units: they
were done when they were done, and that was that. (The latest buzzword is
"self timed circuits", but they aren't necessarily like that.)
All in all, the hardware solutions coped fine with all this. The compilers
give up and rely on fond hopes.

There are several reasons that data-dependent instruction timing has come
to disfavor. For one, hardware interlocks only look ahead just so far,
and are rarely as clever as the Tomasulo scheme. So, the compilers were
generating code that interlocked a lot. By making the machines more
predictable, we've made it possible for compilers to compare possible
overlap sequences, and compute - at compile time - which will run faster.

That still leaves conditional branches. The approach of HEP was
straighforward enough: run someone else as a crack-stuffer. I wonder
what the follow-on will look like.

-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science
--

aglew@urbsdc.Urbana.Gould.COM (11/16/88)

>/* Written  3:41 pm  Nov 14, 1988 by stevew@nsc.nsc.com in urbsdc:comp.arch */
>In article <5087@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>>In article <28200228@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>>Naturally, as soon as you do this you will need some mechanism for
>>handling the various conflicts that occur when two instructions in the
>>pipeline want to use the same register. This is when you can use
>>scoreboarding, or whatever you want.
>This is where the discussion comes from.  VLIW advocates letting the 
>compiler manage both the pipe and register utilization since the 
>compiler knows about global resource utilization where a scoreboard 
>doesn't.
>
>Steve Wilson 
>National Semiconductor
>[The above opinion is mine, not that of my employer. ]

I normally don't bother, but I'd like to point out that the quote
is not mine.

henry@utzoo.uucp (Henry Spencer) (11/17/88)

In article <5097@mit-vax.LCS.MIT.EDU> spectre@mit-vax.UUCP (Joseph D. Morrison) writes:
>It's really a very clever scheme!  The only bug?  It's not worth the
>hardware cost...

There is also a small problem with trying to make the situation look
sensible when an interrupt or a trap strikes.  I don't think any of
those beasts had an MMU (too old), but taking a page fault in a system
like that would really be something.  And you thought the 68020's "stack
puke" was bad...

sher@sunybcs.uucp (David Sher) (11/23/88)

This is just an idea that has been floating around my mind for some time.
The CMU (and now perhaps GE) WARP is an MIMD systolic array full of 
powerful pipelined processors.  Its microinstruction set is designed to be
as orthogonal as possible.  So is the WARP a good candidate for VLIW 
techniques.  The architecture is a bit regular for such but that may not
be a disadvantage.  I was considering doing some research that a ways myself
but I find myself too busy to do that for a few years.
-David Sher
-David Sher
ARPA: sher@cs.buffalo.edu	BITNET: sher@sunybcs
UUCP: {rutgers,ames,boulder,decvax}!sunybcs!sher

ian@armada.UUCP (Ian L. Kaplan) (11/24/88)

In article <2828@cs.Buffalo.EDU>, sher@sunybcs.uucp (David Sher) writes:
> This is just an idea that has been floating around my mind for some time.
> The CMU (and now perhaps GE) WARP is an MIMD systolic array full of 
> powerful pipelined processors.  Its microinstruction set is designed to be
> as orthogonal as possible.  So is the WARP a good candidate for VLIW 
> techniques.  The architecture is a bit regular for such but that may not
> be a disadvantage.  I was considering doing some research that a ways myself
> but I find myself too busy to do that for a few years.

  The Warp has a compiler for a CMU developed language named W2.  This
  compiler does, in fact, use some VLIW techniques.  The work on the compiler
  was done by Monica Lam, Thomas Gross and their colleagues.  It is
  described in Dr. Lam's Phd thesis (A Systolic Array Optimizing Compiler
  Monica Sin-Ling Lam, May 1987, CMU-CS-87-187). 

  The Warp is an interesting machine, but its technology is several years old.
  CMU is working with iNTEL on a next generation machine, known as the iWarp.
  Last I heard, the iWarp would be a parallel processor with 72 PEs, arranged
  in a grid (I was tempted to write 2-D systolic array here, but just as
  the original Warp is much more flexible than a simple pipeline systolic
  array, the iWarp will be much more flexible than a simple 2-D systolic 
  array).  I expect that some interesting work on parallel programming 
  languages and environments will arise out of the iWarp project.  My guess
  is that there may end up being some similarities between the languages used
  to program the iWarp and the languages that are (or could be) used to
  program the Connection Machine.

  As far as work on VLIW, I would start soon, if I were you.  RISC is passe'.
  The next high performance microprocessor architecture will be VLIW.

                               Ian Kaplan
                               MassPar Inc.

                    I speak for myself and no one else.

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (11/26/88)

I've had trouble responding to inquiries by mail, so, a posting.
The study that simulated a Cray-1 with Tomasulo interlocks was:

"Instruction Issue Logic for High-Performance Interruptable
Pipelined Processors", 14th Annual Int'l Symposium on Computer
Architecture (also Computer Arch. News vol.15 #2 June 1987) P.27

The original:
"An Efficient Algorithm For Exploiting Multiple Arithmetic Units",
R. Tomasulo, IBM Journal of R&D Jan 1967, p. 25
-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science
--

cs4342ac@evax.arl.utexas.edu (Ytivitaler) (02/20/91)

       I am looking for any information on books, articles etc... on
  VLIW computers.   Any help would be appreciated.
 
                         Thanx,  Fred