[comp.arch] VLIW Architecture

cs4g6ad@maccs.dcss.mcmaster.ca (Custeau RD) (09/27/89)

   I am looking for references or information on VLIW architectures for
a fourth-year architecture seminar.  Any help would be greatly appreciated.

   Please reply by e-mail, as I don't often read the news.

   Thanks in advance!

cutler@mfci.UUCP (Ben Cutler) (09/29/89)

In article <251FCB3F.12366@maccs.dcss.mcmaster.ca> cs4g6ad@maccs.dcss.mcmaster.ca (Custeau     RD) writes:
>
>   I am looking for references or information on VLIW architectures for
>a fourth-year architecture seminar.  Any help would be greatly appreciated.
>
>   Please reply by e-mail, as I don't often read the news.
>
>   Thanks in advance!

A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
by John Ellis, which won the ACM Doctoral Dissertation Award in 1985.  It's
available from MIT Press.  For information on a commercial VLIW
implementation, contact Multiflow Computer at (203) 488-6090.

slackey@bbn.com (Stan Lackey) (09/29/89)

In article <1050@m3.mfci.UUCP> cutler@mfci.UUCP (Ben Cutler) writes:
>In article <251FCB3F.12366@maccs.dcss.mcmaster.ca> cs4g6ad@maccs.dcss.mcmaster.ca (Custeau     RD) writes:
>>   I am looking for references or information on VLIW architectures for
>>a fourth-year architecture seminar.  Any help would be greatly appreciated.
>A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',

It's a great work.  Nearly all of the attention is on the compiler,
and doesn't get into hardware or OS issues.  A main difference between
the book and what Multiflow ended up shipping is in the memory
interface.  The book expected a separate path from each main data path
in the CPU to its corresponding bank of memory, with relatively slow
transfer paths in between.  Thus, for peak performance, the compiler
would have to be able to predict which memory bank a datum is in, and
give the instruction to the corresponding CPU.

In many cases, this is not a problem, as scalar variables (in FORTRAN
anyway) have their address known at compiler time, and of course the
base of all arrays are known.  But apparently calculated indices into
arrays are common enough to be a problem.  There was much attention
paid to this "memory bank disambiguation" in the book, and even
included a syntax for allowing the user to give hints to the compiler
to aid in disambiguation.

From what I could tell from looking at Multiflow's released
documentation, Multiflow eliminated this problem with a kind of a
"distributed crossbar" (as I would call it) between the CPU's and the
memory banks.  This allows the disambiguation to be resolved at
runtime, and answered my main criticism of the technology.

Another difference is that the prototype compiler for the thesis was
written in Lisp.  I think Multiflow ended up using C.

-Stan
Not affiliated with Multiflow except as an interested observer.  Not
necessarily the views of BBN either.

cik@l.cc.purdue.edu (Herman Rubin) (10/01/89)

In article <1050@m3.mfci.UUCP>, cutler@mfci.UUCP (Ben Cutler) writes:
> In article <251FCB3F.12366@maccs.dcss.mcmaster.ca> cs4g6ad@maccs.dcss.mcmaster.ca (Custeau     RD) writes:
| >
| >   I am looking for references or information on VLIW architectures for
| >a fourth-year architecture seminar.  Any help would be greatly appreciated.
  
> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985.  It's
> available from MIT Press.  For information on a commercial VLIW
> implementation, contact Multiflow Computer at (203) 488-6090.

As one who always finds ways to use the architecture that the compiler
writers did not think about, I maintain that this book helps little.

Why is it the case that people in the computing field think that someone
can understand a computer in ignorance of its instruction set and the
temporal operation of those instructions?
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

rodman@mfci.UUCP (Paul Rodman) (10/02/89)

In article <1626@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>  
>> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
>> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985.  It's
>> available from MIT Press.  For information on a commercial VLIW
>> implementation, contact Multiflow Computer at (203) 488-6090.
>
>As one who always finds ways to use the architecture that the compiler
>writers did not think about, I maintain that this book helps little.

Helps little for what?  Compilers ALWAYs involve tradeoffs, as does
hardware, as does any engineering endeavor. I would be suprised if you
_couldn't_ find ways to use computers that the compiler writers didn't
think about..(or were to stupid to realize, or decided to blow off...)

IMHO, this book is _extremely_ well written, and filled with good ideas.
When I first read it there were some things I disagreed with, but I was
very impressed with it overall. 

>
>Why is it the case that people in the computing field think that someone
>can understand a computer in ignorance of its instruction set and the
>temporal operation of those instructions?

What do you mean by "understand"? The Bulldog effort broke important
ground, period. Throwing stones at it because it didn't cover everything
is silly. It was a phd thesis, not a product!

Belive me, in order to generate code for a VLIW you _do_ need to understand
the instruction set and the temporal operation of those instructions!
But the kind of stuff Ellis talks about is at a higher level and
has to come first. Obviously the best product will try to integrate
things from the lowest gate level back to the phase-one front end of the
compiler, but it is human nature to try to comparmentalize things.

Too _many_ folks have designed computers and focused on the low-level
instructions set/ pipelining issues without having a clue as to 
how the compiler will actually find a way use them. When it comes time
to emit a .o file, hand waving won't work and bad algorithms will
fall apart on somebody's code.

Written any compilers lately? 

>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

Paul K. Rodman / KA1ZA /   rodman@multiflow.com
Multiflow Computer, Inc.   Tel. 203 488 6090 x 236
Branford, Ct. 06405

hankd@pur-ee.UUCP (Hank Dietz) (10/03/89)

In article <1626@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <1050@m3.mfci.UUCP>, cutler@mfci.UUCP (Ben Cutler) writes:
>> In article <251FCB3F.12366@maccs.dcss.mcmaster.ca> cs4g6ad@maccs.dcss.mcmaster.ca (Custeau     RD) writes:
>| >
>| >   I am looking for references or information on VLIW architectures for
>| >a fourth-year architecture seminar.  Any help would be greatly appreciated.
>  
>> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
>> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985.  It's
>> available from MIT Press.  For information on a commercial VLIW
>> implementation, contact Multiflow Computer at (203) 488-6090.
>
>As one who always finds ways to use the architecture that the compiler
>writers did not think about, I maintain that this book helps little.

VLIW machine compilers essentially search for optimal schedules of
instructions; I don't see how a full-width search could be ommitting things
that Dr. Rubin would want to do.  ;-)

>Why is it the case that people in the computing field think that someone
>can understand a computer in ignorance of its instruction set and the
>temporal operation of those instructions?

The primary reason humans have done so well is not that we are so much
better than other life forms, but rather that we build much better tools.
To accomplish any large task, a human must be able to be ignorant of at least
some details -- VLIW compiler technology is a prime example of a mechanism
keeping track of, and optimizing use of, a structure too complex for humans
to manage directly.

I suppose you'd rather not use any tools...  but then why do you want to use
the tools called "computers?"

						-hankd@ecn.purdue.edu

>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
>Phone: (317)494-6054
>hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

PS: Dr. Rubin's views do not represent those of Purdue University.  ;-)

cik@l.cc.purdue.edu (Herman Rubin) (10/03/89)

In article <13038@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <1626@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

> >As one who always finds ways to use the architecture that the compiler
> >writers did not think about, I maintain that this book helps little.
> 
> VLIW machine compilers essentially search for optimal schedules of
> instructions; I don't see how a full-width search could be ommitting things
> that Dr. Rubin would want to do.  ;-)
> 
> >Why is it the case that people in the computing field think that someone
> >can understand a computer in ignorance of its instruction set and the
> >temporal operation of those instructions?

What kind of integer arithmetic is there in hardware?  Examples of the
problem are what kinds of integer products can be formed in one operation.
Some machines can form long x long -> double long in one operation, and
some cannot.  Some allow this to be done unsigned and some do not.  Some
machines allow double/single getting quotient and remainder in one operation
and some do not.  Some allow this to be unsigned and some do not.

Some machines have hardware square root and some do not.  A few allow 
fixed-point arithmetic, but most do not.  Different machines have different
addressing modes; vector operations should be coded differently depending on
that.  I have profitably used bit operations on floating-point numbers; this
means that I must know the exact representation.

How do you code the following:  look at one bit and transfer if it is a 1,
setting pointers so that that bit will not be looked at again?  In the cases
where I envision using this, it is a tool and will be used frequently.  
Another such is the spacing between 1's in a bit stream, and in this case 
there are alternatives.

What does your HLL have to say about these?

> The primary reason humans have done so well is not that we are so much
> better than other life forms, but rather that we build much better tools.
> To accomplish any large task, a human must be able to be ignorant of at least
> some details -- VLIW compiler technology is a prime example of a mechanism
> keeping track of, and optimizing use of, a structure too complex for humans
> to manage directly.

The human knows when not to use the tools provided.  Are you so dead sure that
I cannot manage such structures better than the compiler?  The most complicated
instruction set I have seen is MUCH simpler than HLLs.  The compiler maps a
route on Interstates; I find a short cut on good roads.  Will an autodriver
let you back the car out of the garage so you can clean the garage?

> I suppose you'd rather not use any tools...  but then why do you want to use
> the tools called "computers?"

Use the best tool for the user and the job.  The crude mechanic must put the
car on the analyzer and use what it tells him.  There are a few good mechanics
who can do better listening to the engine.

> 						-hankd@ecn.purdue.edu
> 
> PS: Dr. Rubin's views do not represent those of Purdue University.  ;-)

PPS:  Neither do those of Dr. Dietz.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

hankd@pur-ee.UUCP (Hank Dietz) (10/04/89)

In article <1629@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
...[numerous "why don't HLLs let me say..." flames omitted]...
>What does your HLL have to say about these?

What HLL?  What does this have to do with VLIW techniques?  Dr.  Rubin is
complaining about languages -- but we are talking about VLIW compiler
analysis and program transformation technology, not language design.

I hope that this confusion is not common.  Perhaps a lot of compilers do
blindly spit-out the "obvious" code for badly-designed language constructs,
but that certainly isn't the state of the art.  I would think that a person
who has spent some time counting T-states would really appreciate the VLIW
work that Ellis presents in his award-winning PhD thesis...  I know I do.

>... Are you so dead sure that
>I cannot manage such structures better than the compiler?  The most complicated
>instruction set I have seen is MUCH simpler than HLLs.  ...

VLIW technology is complex because of its use of parallelism -- it has very
little to do with instruction set complexity issues.  Generating good code
for a VLIW is most like microcode scheduling/compaction for a huge,
asymmetric, microcoded machine.  You really don't even want to try it by
hand...  well, I know I don't.  And why bother?  VLIW compiler techniques
come very close to optimal schedules every time.  I can't match it, let
alone beat it.

						-hankd@ecn.purdue.edu

"A good workman is known by his tools."
Intro to Chapter 12, F. Brooks, "The Mythical Man-Month," p. 127.

cik@l.cc.purdue.edu (Herman Rubin) (10/04/89)

In article <13050@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <1629@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> ...[numerous "why don't HLLs let me say..." flames omitted]...
> >What does your HLL have to say about these?
> 
> What HLL?  What does this have to do with VLIW techniques?  Dr.  Rubin is
> complaining about languages -- but we are talking about VLIW compiler
> analysis and program transformation technology, not language design.

If the language the compiler compiles does not have the operations to be used
in my program, I cannot use them.  I have seen NO language designed for 
the efficient use of hardware operations.  The choice of the algorithm 
depends on the hardware, and there can be lots of different algorithms.

An example: most of the efficient ways of generating non-uniform random 
variables are acceptance-rejection.  This is only partly vectorizable on
CRAYs, and is especially difficult on the CRAY 1.  This is due to the
non-existence of some hardware instructions.  It would be necessary to
try out a number of computationally less efficient methods to find the
ones which are best on the machine.  I would not consider any of them on
the CYBER 205, as they are obviously worse.

> I hope that this confusion is not common.  Perhaps a lot of compilers do
> blindly spit-out the "obvious" code for badly-designed language constructs,
> but that certainly isn't the state of the art.  I would think that a person
> who has spent some time counting T-states would really appreciate the VLIW
> work that Ellis presents in his award-winning PhD thesis...  I know I do.
> 
> >... Are you so dead sure that
> >I cannot manage such structures better than the compiler?  The most complicated
> >instruction set I have seen is MUCH simpler than HLLs.  ...
> 
> VLIW technology is complex because of its use of parallelism -- it has very
> little to do with instruction set complexity issues.  Generating good code
> for a VLIW is most like microcode scheduling/compaction for a huge,
> asymmetric, microcoded machine.  You really don't even want to try it by
> hand...  well, I know I don't.  And why bother?  VLIW compiler techniques
> come very close to optimal schedules every time.  I can't match it, let
> alone beat it.

What you are saying is that an instruction scheduler can do better thqn a
human.  This is mostly true.  It is necessary to allow human override, as
the scheduler only takes into account what its designers thought was 
of value.

But this does not answer my complaint.  I need to know these instructions
to make use of them.  I certainly can use information about their timing
and overlap to select those algorithms which will run faster.  I do not
insist that an assembler carry out the instructions in the order I present
them, as long as the program logic is maintained.  This is what an 
optimizer does, and I do not oppose this.

I do not even object to the optimizer-compiler-scheduler even suggesting
alternates which are more efficient, but which it cannot be sure will work.
One of our programmers asked me whether some transfers could be deleted
from code.  It was possible by essentially duplicating some code, and
making inessential changes in the program.  I doubt a compiler could
figure this out.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

preston@titan.rice.edu (Preston Briggs) (10/04/89)

In article <1630@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>One of our programmers asked me whether some transfers could be deleted
>from code.  It was possible by essentially duplicating some code, and
>making inessential changes in the program.  I doubt a compiler could
>figure this out.

Steve Carr showed me this transformation, which can be done automatically
(and *is* done, in a local experimental compiler).
This is slightly contrived, for expository simplicity.

A simple DO-loop

	DO i = 1, 2*n
	    x(i) = f(y(i))
	    z(i) = g(x(i-1))
	ENDDO

where x, y, and z are arrays and f and g are arbitrary functions.
Note the reuse of each x(i) value one iteration after it has been set.
We can take advantage of this reuse

	x1 = x(0)
	DO i = 1, 2*n
	    x0 = f(y(i))
	    x(i) = x0
	    z(i) = g(x1)
	    x1 = x0
	ENDDO

This is nice because we save a memory access per iteration.
(Assuming a smart backend (normal optimizing compiler) that
 will allocate x0 and x1 to registers.)
The problem is that we'll end up with an extra reg-reg copy
(x1=x0) at the end of the loop.
This can be avoided by unrolling slightly to subsume the copy.

	x1 = x(0)
	DO i = 1, n
	    x0 = f(y(i))
	    x(i) = x0
	    z(i) = g(x1)
	    x1 = f(y(i+1))
	    x(i+1) = x1
	    z(i) = g(x0)
	ENDDO

The first transformation, dependence-based register allocation,
is due to Randy Allan and Ken Kennedy.  The 2nd transform,
unrolling to subsume copies, is in Kennedy's thesis (1971).
Both have been implemented at Rice by Steve Carr.

Of course, this is a simple example; but these transformations
(with others of a similar nature) can give tremendous improvements
in real code.

What's the point?  Compilers are getting better.  Humans had to
come up with the original ideas (certainly compiler writers learn
from human coders), but a compiler can apply the ideas repeatedly,
without human effort, to amazingly nastly code.

Regards,
Preston Briggs

preston@titan.rice.edu (Preston Briggs) (10/05/89)

In article <1914@brazos.Rice.edu> 
preston@titan.rice.edu (Preston Briggs) writes:

>	DO i = 1, 2*n
>	    x(i) = f(y(i))
>	    z(i) = g(x(i-1))
>	ENDDO

becomes

>	x1 = x(0)
>	DO i = 1, 2*n
>	    x0 = f(y(i))
>	    x(i) = x0
>	    z(i) = g(x1)
>	    x1 = x0
>	ENDDO

becomes

>	x1 = x(0)
>	DO i = 1, n
>	    x0 = f(y(i))
>	    x(i) = x0
>	    z(i) = g(x1)
>	    x1 = f(y(i+1))
>	    x(i+1) = x1
>	    z(i) = g(x0)
>	ENDDO

but this last is wrong.
It should be

	x1 = x(0)
	DO i = 1, 2*n, 2
	    x0 = f(y(i))
	    x(i) = x0
	    z(i) = g(x1)
	    x1 = f(y(i+1))
	    x(i+1) = x1
	    z(i+1) = g(x0)
	ENDDO

I'm sorry for any confusion.
Think of this as an example of why
I'd prefer an optimizer to transform my program.

	Preston

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/05/89)

In article <1630@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>I have seen NO language designed for the efficient use of hardware operations.
Gee, I have, albeit for specific computers, e.g.:

	Bliss-10	PDP-10
	Fortran		IBM 709

Neither was very good in the portability department, at least not in its
initial form, as befits a hardware specific language.

To return to the original argument, John Ellis' book doesn't concern itself
with specific instruction sets because at the time he wrote it, VLIWs
existed only on paper.  The concrete design of a VLIW happened only after
most of the rest of the Yale VLIW project left to build real hardware at
Multiflow.  I was at Yale at the time and the longest instruction word we
had was 36 bits in our DEC-20.  Or maybe 48 bits in a PDP-11.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@esegue.segue.boston.ma.us, {ima|lotus}!esegue!johnl, Levine@YALE.edu
Massachusetts has 64 licensed drivers who are over 100 years old.  -The Globe

ok@cs.mu.oz.au (Richard O'Keefe) (10/05/89)

In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> If the language the compiler compiles does not have the operations to be used
> in my program, I cannot use them.  I have seen NO language designed for 
> the efficient use of hardware operations.  The choice of the algorithm 
> depends on the hardware, and there can be lots of different algorithms.

How about PL-360 and Bliss-10, both of which let you generate any instruction
you like?  Come to that, how about the C compiler that comes with UNIX in
System V.3 for the 386, which lets you write assembly-code routines that
are expanded in-line?  Or how about the SunOS C/Fortran/Pascal/Modula2
compilers, which have a pass that replaces apparent procedure calls with
arbitrary assembly code of your choice (specified in .il files) before
final code optimisation?

Herman Rubin clearly enunciates a policy which I believe is responsible
for much of the unreliability of present-day software:
	the machine is what it is,
	the programmer's job is to exploit the machine,
	the language's job is to expose the machine to the programmer.
The view that I take is
	the programmer's job is to express his intentions clearly,
	the machine's job is to obey the programmer,
	the language's job is to provide a *simple* conceptual model.
Rubin is willing to be the servant of his machines.  I'm not.
I've exchanged E-mail with a floating-point expert who had some
illuminating things to say about the work being done for Ada; it turns
out that they can do amazing things without needing to know the details
of the hardware instructions, but what does hurt them is the extent to
which Ada already lets hardware vagaries show through.

cik@l.cc.purdue.edu (Herman Rubin) (10/05/89)

In article <2307@munnari.oz.au>, ok@cs.mu.oz.au (Richard O'Keefe) writes:
> In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> > If the language the compiler compiles does not have the operations to be used
> > in my program, I cannot use them.  I have seen NO language designed for 
> > the efficient use of hardware operations.  The choice of the algorithm 
> > depends on the hardware, and there can be lots of different algorithms.
> 
> How about PL-360 and Bliss-10, both of which let you generate any instruction
> you like?  Come to that, how about the C compiler that comes with UNIX in
> System V.3 for the 386, which lets you write assembly-code routines that
> are expanded in-line?  Or how about the SunOS C/Fortran/Pascal/Modula2
> compilers, which have a pass that replaces apparent procedure calls with
> arbitrary assembly code of your choice (specified in .il files) before
> final code optimisation?

I do not believe that the .il approach is quite adequate.  As to the others
you mention, I am unfamiliar with them.  And even that is inadequate; I want
to be able even to introduce overloaded infix operators in addition to the
usual ones. Another feature missing from most languages since day 1 is to
allow the return of a list (NOT a struct).

> Herman Rubin clearly enunciates a policy which I believe is responsible
> for much of the unreliability of present-day software:
> 	the machine is what it is,
> 	the programmer's job is to exploit the machine,
> 	the language's job is to expose the machine to the programmer.
> The view that I take is
> 	the programmer's job is to express his intentions clearly,
> 	the machine's job is to obey the programmer,
> 	the language's job is to provide a *simple* conceptual model.
> Rubin is willing to be the servant of his machines.  I'm not.

I would like to be able to travel to any place in the world at low cost
by stepping into a booth and pressing a button.  I would like a computer
which has a rather large number of machine instructions now lacking with
a gigabyte memory, a reasonable screen, good color graphics, a WYSIWYG
editor capable of handling mathematical expressions well, at a cost of $1000.  

These machines do not exist.  Does that mean that I should not use the ones
that are available?  I should not travel by airplane because there are no
teleportation booths?  That I should do no computing because the machines
I would like do not exist?  That I should write no papers because the 
processors I want are not there?  What would O'Keefe do in these situations?

My travel intentions depend on what is available and how long it will take.
I would like to go to more meetings than I do.  So I state that my intention
is to go to such-and-such meetings for X dollars and return home to sleep
each night.  Okay, machines; obey me!

I have clearly stated that the performance of an algorithm depends on the 
hardware, so much so that a single hardware instruction can be crucial.
The current languages do not provide a conceptual model for the operations
which I have thought of.  I ask O'Keefe to tell me how to find his Utopian
machine and language situation for my problems.

> I've exchanged E-mail with a floating-point expert who had some
> illuminating things to say about the work being done for Ada; it turns
> out that they can do amazing things without needing to know the details
> of the hardware instructions, but what does hurt them is the extent to
> which Ada already lets hardware vagaries show through.

How can it be otherwise?

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

ok@cs.mu.oz.au (Richard O'Keefe) (10/06/89)

In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin)
complained that he had "seen no language designed for the efficient use of
hardware operations".
In article <2307@munnari.oz.au>, ok@cs.mu.oz.au (I) listed several
languages (PL-360, Bliss-10, SunOS C) which provide full access to
all hardware instructions.
In article <1632@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin)
replied

> I do not believe that the .il approach is quite adequate.

What, precisely, can it not do?  Admittedly, this approach relies on
the back-end optimiser being able to handle any assembly code you throw
at it, but then, so must any scheme giving you access to all the operations.

> As to the others
> you mention, I am unfamiliar with them.  And even that is inadequate;

If you don't know them, how can you know they are inadequate?

> I want
> to be able even to introduce overloaded infix operators in addition to the
> usual ones. Another feature missing from most languages since day 1 is to
> allow the return of a list (NOT a struct).

Having used Fortran, Algol, SNOBOL, Lisp, Prolog, APL, COBOL, PL/I,
and lots of other languages, I gave up thinking that the syntax of
programming languages had any great importance a long time ago.
I should have mentioned Ada as one of the languages which provides
full machine access; it has both assembly code inserts and overloaded
operators.  I don't know precisely what Rubin means by returning a list;
Algol 68 permits the return of anything (except pointers into the local
frame).  The reason that most languages don't return lists is precisely
that they are designed to let the hardware show through, and you can't
fit much in a register.

I wrote
> Herman Rubin clearly enunciates a policy which I believe is responsible
> for much of the unreliability of present-day software:
> 	the machine is what it is,
> 	the programmer's job is to exploit the machine,
> 	the language's job is to expose the machine to the programmer.
> The view that I take is
> 	the programmer's job is to express his intentions clearly,
> 	the machine's job is to obey the programmer,
> 	the language's job is to provide a *simple* conceptual model.
> Rubin is willing to be the servant of his machines.  I'm not.

Rubin replied that he would like transmatters and the ultimate computer.
> These machines do not exist.  Does that mean that I should not use the ones
> that are available?  I should not travel by airplane because there are no
> teleportation booths?  That I should do no computing because the machines
> I would like do not exist?  That I should write no papers because the 
> processors I want are not there?  What would O'Keefe do in these situations?

Why, I would do the same as I do now.  Nowhere in the text cited above
do I say that machines should be ideal!  The language's job is to
conceal the deficiencies of the computer.  The computer can only do
32-bit arithmetic?  Then it is the language's job to implement
arbitrary-precision integers on top of that and not bother me with
stupid little details.  For a more common example, I really do not want
to be bothered with the MC68010 limitation that a shift-by-literal can
shift by at most 8.  (I have used a language where this limit showed
through, and I did _not_ like it.)  A C compiler papers over this limit,
and so it should.  I normally use a language which runs 2 to 4 times
slower than C; I do this gladly because the reduction in _language_
complexity means that I can manage more complex _algorithms_, and win
back O(n) speed that way.  (Yes, such algorithms can then in principle
be recoded in C.  But it is much harder to _develop_ complex algorithms
in complex languages.)

I wrote
> I've exchanged E-mail with a floating-point expert who had some
> illuminating things to say about the work being done for Ada; it turns
> out that they can do amazing things without needing to know the details
> of the hardware instructions, but what does hurt them is the extent to
> which Ada already lets hardware vagaries show through.

Rubin replied:
> How can it be otherwise?

But that's _my_ point; letting the hardware show through is precisely
what _he_ is asking for, and _that_ is the problem!

This is not unlike the RISC/CISC debate.  I want simple regular
languages (C is a monster of complexity) which I can master completely.
I don't care whether it's a ROMP or a WM or a /360 underneath.  I
particularly don't want to _have_ to know.

lewitt@Alliant.COM (Martin Lewitt) (10/06/89)

In article <1989Oct5.025841.2046@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
*** much deleted ***
>To return to the original argument, John Ellis' book doesn't concern itself
>with specific instruction sets because at the time he wrote it, VLIWs
>existed only on paper.  The concrete design of a VLIW happened only after
>most of the rest of the Yale VLIW project left to build real hardware at
>Multiflow.  I was at Yale at the time and the longest instruction word we
>had was 36 bits in our DEC-20.  Or maybe 48 bits in a PDP-11.

Ouch!!  Why don't the Floating Point System's AP120B and FPS-164 processors
qualify as VLIW?  They were commercial products introduced in 1975 and 1981
respectively:
	     1) Both have 64 bit instructions, with fields controlling
	        10 operations in parallel, i.e., micro-coded.

	     2) Both have FORTRAN compilers, the 164 from its
	        introduction in 1981 and the 120B in 1985 (though
	        one may have been available from a third party earlier)

             3) By 1983, the FORTRAN compiler for the 164 was "software
                pipelining", i.e., executing operations from different 
                iterations of a loop in parallel in an instruction.
 
             4) Both were installed on the Yale campus by 1985
                (correct me on this one).    8-)

I've tried to think of possible objections to their classification
as VLIW:
             1) The 64 bit instruction word isn't long enough.
                This quantitative argument falls to the qualitative,
                10 operations controlled by a horizontal micro-instruction.

	     2) They were only "array processors" or "subroutine
		boxes".  This is true, especially for the AP120B,
		but the 164 was able to run complete FORTRAN jobs
		from 1982 on.  It is more of a back end or
		attached processor, like the CRAY.

             3) No virtual memory or multi-tasking.  What does this
                have to do with VLIW?

             4) They were too easy to micro-program, humans could do
                it.  True!  And they were fun to program as well.  The
                software pipelining compiler was faster though, and
                produced fine code.  There were only a few hand coding
                tricks it couldn't handle, e.g., moving the loop
                control calculations to the floating point adder when
                the integer/address unit was too busy.

The Multiflow machine was no mystery to FPS sales analysts.  It was the 
FPS dream machine: UNIX, virtual memory, better compilers, etc.  We had 
asked FPS to give us these features right from the beginning of the 164 
back in 1981.  By the time, Multi-flow delivered, it was too, late,
Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
and the high end workstations arrived in short order.

It will be interesting to see if FPS gets the credit it deserves when the
history is written, I doubt it.  Will the historians do any better than
the contemporaries?  In 1985, an article about micro-programming appeared
in Scientific American, written by a Stanford professor.  He anticipated
the day when there would be commercial machines compiling directly to
micro-code.  FPS had sold nearly 150 machines already.

Let's get the history right.
       1) The first commercially available VLIW machine was the FPS-164,
          not the Multiflow (by 6 years).
       2) The "first affordable supercomputer" was the FPS-164, not the
	  Convex (by 4 years).  FPS was using the "affordable
	  supercomputer" and "departmental supercomputing" phrases
	  long before the Convex advertisements and literature took
	  them up.
       3) The first commercially available machine to compile complete
          HLL applications to micro-code was once again, the FPS-164
          (well, actually the VAX since it was cross-compilation).
       4) The first commercially available machine to successfully exploit 
          parallel processors automatically using "dusty deck", serial 
          FORTRAN, the Alliant FX8, (by 4 years and counting).
       5) The first commercially available RISC machine was the FPS-164.
          (I'd love to see this one discussed, are VLIWs RISCy?)   8-)

I don't know if I've got the history right, I've only been in the industry
since '81, so feel free to propose "minor" adjustments.  I'll gracelessly hide 
out when the heavyweights start swinging.

> -- >John R. Levine, Segue Software, POB 349, Cambridge MA 02238,

lewitt@Alliant.COM (Martin Lewitt) (10/06/89)

Apologies to John Levine and the net if the authorship of my last
posting on this subject was unclear.  I'm posting without a home
directory for the first time so my signature wasn't automatically 
appended.  It is manually appended here.  Please double check before
flaming John, you possibly really intend to flame me.
---
Phone: (206) 931-8364			Martin E. Lewitt      My opinions are
Domain: lewitt@alliant.COM		2945 Scenic Dr. SE    my own, not my
UUCP: {linus|mit-eddie}!alliant!lewitt  Auburn, WA 98002      employer's.

dricejb@drilex.UUCP (Craig Jackson drilex1) (10/06/89)

In article <2307@munnari.oz.au> ok@cs.mu.oz.au (Richard O'Keefe) writes:
>In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
>> If the language the compiler compiles does not have the operations to be used
>> in my program, I cannot use them.  I have seen NO language designed for 
>> the efficient use of hardware operations.  The choice of the algorithm 
>> depends on the hardware, and there can be lots of different algorithms.
>
>Herman Rubin clearly enunciates a policy which I believe is responsible
>for much of the unreliability of present-day software:
>	the machine is what it is,
>	the programmer's job is to exploit the machine,
>	the language's job is to expose the machine to the programmer.
>The view that I take is
>	the programmer's job is to express his intentions clearly,
>	the machine's job is to obey the programmer,
>	the language's job is to provide a *simple* conceptual model.
>Rubin is willing to be the servant of his machines.  I'm not.
>I've exchanged E-mail with a floating-point expert who had some
>illuminating things to say about the work being done for Ada; it turns
>out that they can do amazing things without needing to know the details
>of the hardware instructions, but what does hurt them is the extent to
>which Ada already lets hardware vagaries show through.

I think you have probably guess Herman Rubin's view pretty well (I don't
know him).  It is a view which was very common up until around 1980,
when computers finally began to be fast enough for many uses.  (Before then,
there were few users who really thought the computer that they worked on
was fast enough.)  I choose 1980 because that is when the VAX 11/780 and
the 4341 became common; before those machines, few organizations had
more than one 32-bit machine with more than one MIPS or so of power.

Your view is conventional wisdom in computer science today, partially
driven by the common surfeit of computational power available today.
When there is enough computing power so that few tasks run into
compute-hours, and compilers are good enough to translate rather
abstract programming languages to good-enough machine code, this is a
reasonable idea.

I think that the difference is that Mr. Rubin still doesn't see himself
as having a surfeit of computing power available to him.  I am not familiar
with his area of work, but I know that there remain many problems which
still cannot be solved in reasonable time by today's fastest computers.
When you are at the bleeding edge of technology, you *do* want to know
about the relative speeds of various instructions, and other capabilities
of the processor.  It might make a difference if multiply were 3 or 10
times faster than divide--a differences of minutes, if not hours.  If
your compute times were on that order, you would care about how you used
the underlying hardware.

We should be thankful for the people at the bleeding edge of computability
--it is they who keep the MIPS (VUPS, whatever) increasing every year...
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,ll-xn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

cik@l.cc.purdue.edu (Herman Rubin) (10/06/89)

In article <2311@munnari.oz.au>, ok@cs.mu.oz.au (Richard O'Keefe) writes:
> In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin)
> complained that he had "seen no language designed for the efficient use of
> hardware operations".
> In article <2307@munnari.oz.au>, ok@cs.mu.oz.au (I) listed several
> languages (PL-360, Bliss-10, SunOS C) which provide full access to
> all hardware instructions.
> In article <1632@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin)
> replied

< > I do not believe that the .il approach is quite adequate.

> What, precisely, can it not do?  Admittedly, this approach relies on
> the back-end optimiser being able to handle any assembly code you throw
> at it, but then, so must any scheme giving you access to all the operations.

I admit I am not familiar with .il.  How would you do the following:

	fabs(x) implemented as		x &~ CLEARSIGN;

where this clears the sign bit, and may depend on the precision of x.  No
moving of x to specific registers, etc., should be used.

< > As to the others
< > you mention, I am unfamiliar with them.  And even that is inadequate;

> If you don't know them, how can you know they are inadequate?

< > I want
< > to be able even to introduce overloaded infix operators in addition to the
< > usual ones. Another feature missing from most languages since day 1 is to
< > allow the return of a list (NOT a struct).

And suppose I want to use << and >> on double longs?  How do I get this in
without using any unnecessary memory references, and keeping things which
are already in registers there?

I want to be able to define a heavily overloaded power operator with a
reasonable notation such as a!b or something similar.  I see no more reason
why I should have to use prenex form for this than I should have to write
sum(a,b) instead of a+b.  To paraphrase, why should I be a slave to the 
compiler's notation rather than using familiar mathematical notation?

> Having used Fortran, Algol, SNOBOL, Lisp, Prolog, APL, COBOL, PL/I,
> and lots of other languages, I gave up thinking that the syntax of
> programming languages had any great importance a long time ago.
> I should have mentioned Ada as one of the languages which provides
> full machine access; it has both assembly code inserts and overloaded
> operators.  I don't know precisely what Rubin means by returning a list;
> Algol 68 permits the return of anything (except pointers into the local
> frame).  The reason that most languages don't return lists is precisely
> that they are designed to let the hardware show through, and you can't
> fit much in a register.

When computers first came out, most of them would multiply an integer by
an integer, giving a most and least significant part.  Most of them would
divide a double length integer by an integer, giving a quotient and a
remainder.
> I wrote
< > Herman Rubin clearly enunciates a policy which I believe is responsible
< > for much of the unreliability of present-day software:
< > 	the machine is what it is,
< > 	the programmer's job is to exploit the machine,
< > 	the language's job is to expose the machine to the programmer.
< > The view that I take is
< > 	the programmer's job is to express his intentions clearly,
< > 	the machine's job is to obey the programmer,
< > 	the language's job is to provide a *simple* conceptual model.
< > Rubin is willing to be the servant of his machines.  I'm not.
> 
> Rubin replied that he would like transmatters and the ultimate computer.
< > These machines do not exist.  Does that mean that I should not use the ones
< > that are available?  I should not travel by airplane because there are no
< > teleportation booths?  That I should do no computing because the machines
< > I would like do not exist?  That I should write no papers because the 
< > processors I want are not there?  What would O'Keefe do in these situations?
> 
> Why, I would do the same as I do now.  Nowhere in the text cited above
> do I say that machines should be ideal!  The language's job is to
> conceal the deficiencies of the computer.  The computer can only do
> 32-bit arithmetic?  Then it is the language's job to implement
> arbitrary-precision integers on top of that and not bother me with
> stupid little details.  For a more common example, I really do not want
> to be bothered with the MC68010 limitation that a shift-by-literal can
> shift by at most 8.  (I have used a language where this limit showed
> through, and I did _not_ like it.)  A C compiler papers over this limit,
> and so it should.  I normally use a language which runs 2 to 4 times
> slower than C; I do this gladly because the reduction in _language_
> complexity means that I can manage more complex _algorithms_, and win
> back O(n) speed that way.  (Yes, such algorithms can then in principle
> be recoded in C.  But it is much harder to _develop_ complex algorithms
> in complex languages.)

Getting O(n) may or may not be appropriate.  Algorithms for generating
non-uniform random variables efficiently tend to be short, but with
a fairly intricate structure.  They will be used millions of times.
The most efficient, from the standpoint of computational complexity,
are difficult even knowing the hardware, and may be not even worth
programming.  Two of my operations are transferring on a bit and
"removing" the bit whether or not the transfer occurs, and finding the
distance to the next one in a bit stream and "removing" all bits scanned. 
These are tools, as basic as addition.

As for not worrying about those stupid little details about multiple
precision arithmetic, it is the difference between factoring a large
number in a month and several years.  I can give algorithms which anyone
who understands even what is taught in computing will understand, but
which the HLLs cannot express.

In a university, it is almost impossible to get programming assistance
in mathematics and statistics.  Also, these supposedly highly optimizing
compilers do not seem to be available.  I doubt that I would have any
more difficulty in writing a decent machine language than any HLL, 
because a decent machine language would be weakly typed and overloaded
operator.

> I wrote
< > I've exchanged E-mail with a floating-point expert who had some
< > illuminating things to say about the work being done for Ada; it turns
< > out that they can do amazing things without needing to know the details
< > of the hardware instructions, but what does hurt them is the extent to
< > which Ada already lets hardware vagaries show through.
> 
> Rubin replied:
< > How can it be otherwise?
> 
> But that's _my_ point; letting the hardware show through is precisely
> what _he_ is asking for, and _that_ is the problem!
> 
> This is not unlike the RISC/CISC debate.  I want simple regular
> languages (C is a monster of complexity) which I can master completely.
> I don't care whether it's a ROMP or a WM or a /360 underneath.  I
> particularly don't want to _have_ to know.

When I design an algorithm, I _have_ to know.  O'Keefe wants to master
his language completely, but he is a slave to the language, whether or
not he knows it.  I will use different algorithms on different computers,
and have no problem keeping them straight.  I find it no problem whatever
in seeing how the hardware interacts with the algorithm and exploiting it.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

ECULHAM@vm.ucs.UAlberta.CA (Earl Culham) (10/06/89)

In article <1630@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
 
 .... all sort of things about languages accessing the hardware
      through computer languages
 
This is an open letter to Herman Rubin.
========================================================================
Rubin, I realise that what you are trying to say in your postings is
  "Don't throw away the power of the computer by limiting yourself to
   high level languages."
But, you are going overboard when you say that NO language is designed
for the effecient use of hardware operations. I can name two right off,
Assembler, and PL360.
 
We have a dichotomy between wanting a program to run as fast as possible
on our current machine, and wanting it to keep running as we migrate
to new machines.
 
Often, your comments are along the line
  "We need access to the nitty-gritty in order make this algorithm fly."
Nice ideas, but they don't belong in comp.arch. Please take them to
a language discussion.
 
On the other hand, comments along the lines of
  "Having feature 'x' in hardware would sure make algorithm 'y' run fast"
are welcome grist.

schwartz@psuvax1.cs.psu.edu (Scott Schwartz) (10/06/89)

In article <2311@munnari.oz.au> ok@cs.mu.oz.au (Richard O'Keefe) writes:
   Herman Rubin writes:
   > I want
   > to be able even to introduce overloaded infix operators in addition to the
   > usual ones. Another feature missing from most languages since day 1 is to
   > allow the return of a list (NOT a struct).
   The reason that most languages don't return lists is precisely
   that they are designed to let the hardware show through, and you can't
   fit much in a register.

Isn't the idea to return out-parameters on the left side of an
assignment?  I.e.
	(out1, out2, out3) := foo (in1, in2, in3);
corresponds to
	procedure foo (in in1, in2, in3; out out1, out2, out3) 	...
	foo (in1, in2, in3, out1, out2, out3);

This doesn't seem that bad to me.  A byte full of syntactic
sugar helps the eyestrain go down. :-)

--
Scott Schwartz		<schwartz@shire.cs.psu.edu>
Now back to our regularly scheduled programming....

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/07/89)

In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
>Ouch!!  Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW? ...
>4) Both were installed on the Yale campus by 1985 ...

They were indeed there, excuse me, my mind curdled.  (The fact that I never
used either of them while I was there may have something to do with it.)
Ellis discusses the FPS-164 in his book.  He calls it an LIW, kind of a
predecessor to a VLIW.  It's a horizontally microcoded machine rather than a
RISC with a lot of parallel functional units.  The trace scheduling compiler
used for VLIWs is quite different from the one used for the FPS.  Someone did
try trace scheduling for the FPS, but gave up because the assymetry of the
architecture made it very hard.  There are several pages in Ellis' book that
talk about this.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@esegue.segue.boston.ma.us, {ima|lotus}!esegue!johnl, Levine@YALE.edu
Massachusetts has 64 licensed drivers who are over 100 years old.  -The Globe

cutler@mfci.UUCP (Ben Cutler) (10/08/89)

In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
>Ouch!!  Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW?  They were commercial products introduced in 1975 and 1981
>respectively:
>
>   [ lots of deleted text ]
>
>Let's get the history right.
>       1) The first commercially available VLIW machine was the FPS-164,
>          not the Multiflow (by 6 years).

VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand.  Ellis
describes why the FPS-164 doesn't qualify as a good compilation target in
his thesis:

``Many micro-programmed architectures have so-called `hot spots', latches,
that will hold a value for only one cycle or until another value displaces it.
For example, the outputs of the multiplier and adder pipelines in the FPS-164
[ APAL64 Programmer's Guide, FPS 1982 ] are latched and can be fed back to the
register banks or directly to the inputs of the pipelines.  There isn't
enough bandwidth into the register banks, so to keep the pipelines full the
programmer is often forced to use the latched pipeline outputs directly as
pipeline inputs.  But the programmer must be very careful to synchronize the
pipelines' inputs and outputs, since the next value coming out of the
pipeline will overwrite the previous latched value.''

Ellis goes on to describe how hot spots are unpleasant artifacts for a
compiler to deal with, how it isn't always possible to move a value in a
latch out of the way and into a register bank, or if you can do so by making
room in the register bank, that in turn might require displacing a value in
another latch, and so on.  If you want to do a good job on complex loops,
a backtracking compiler (blech) may be in order.

According to Ellis, ''Every value producing operation and every data transfer
reads its operands from a register bank and delivers its result to a register
bank.''  Ellis notes that, ``Architectures with hot spots are easy to build,
but building compilers for them is hard. (It took years to build the FPS-164
compiler.)  Better to build the compiler and hardware in parallel, avoiding
hardware features that can't be used easily by the compiler.''

If you want to know more, then read the thesis, which is extremely
well-written and informative.

( My apologies for any typos in the above quotations. )

cutler@mfci.UUCP (Ben Cutler) (10/08/89)

Oops, in case it wasn't clear, the sentence reproduced below describes a
fundamental characteristic of VLIWs (and NOT FPS machines), that operations
don't leave values sitting around in hot spots waiting to be trashed or
pushed along someplace else:

    According to Ellis, ''Every value producing operation and every data
    transfer reads its operands from a register bank and delivers its result
    to a register bank.''

khb%chiba@Sun.COM (Keith Bierman - SPD Advanced Languages) (10/09/89)

In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
... stuff about FPS being first VLIW...

It is my understanding that FPS purchased the basic design from Glen
Culler. Culler continued his design activities and spun up a company
for rev 7 of the basic design (The infamous Culler-7) down in Santa
Barbara. 

Glen is still at it, and perhaps the 8th, 9th or 10th edition will
gain a popular following.

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

lewitt@Alliant.COM (Martin E. Lewitt) (10/09/89)

In article <1068@m3.mfci.UUCP> cutler@mfci.UUCP (Ben Cutler) writes:
*** some deleted ***
>VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand.  Ellis
>describes why the FPS-164 doesn't qualify as a good compilation target in
>his thesis:

I guess what I find strange is that the definition of a type of computers
and a (trademarked) compiler technology are so closely tied.  The definition
doesn't seem as general as the RISC and CISC definitions (which I won't
attempt to give here).  It is strange that the definition doesn't encompass
the similarities with the FPS architecture.

Ex-FPS persons I know, immediately saw the similarities with the
Multiflow and correctly anticipated (in a general way) how it would
perform.  We would have welcomed the improvements in register unit
bandwidth that Ellis requires, though we would not have given up
the ability to latch the result of one pipeline directly to the
input of another, if there were a performance penalty.  A sophisticated
compiler technology should be able to handle this (but why bother
if you don't have to?).  From a marketing viewpoint the virtual
memory and multiuser OS were more important features.  The Multiflow
architecture just seemed a natural evolution of the FPS architecture.

From your posting, it appears Ellis correctly understood the limitations
of the FPS architecture and his analysis of it seems to acknowledge
some related place for it in history.

*** much deleted ***

>If you want to know more, then read the thesis, which is extremely
>well-written and informative.

Yes, I hope to have the opportunity to read this.
-- 
Phone: (206) 931-8364			Martin E. Lewitt      My opinions are
Domain: lewitt@alliant.COM		2945 Scenic Dr. SE    my own, not my
UUCP: {linus|mit-eddie}!alliant!lewitt  Auburn, WA 98002      employer's.

colwell@mfci.UUCP (Robert Colwell) (10/09/89)

In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
>In article <1989Oct5.025841.2046@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
>*** much deleted ***
>>To return to the original argument, John Ellis' book doesn't concern itself
>>with specific instruction sets because at the time he wrote it, VLIWs
>>existed only on paper.  The concrete design of a VLIW happened only after
>>most of the rest of the Yale VLIW project left to build real hardware at
>>Multiflow.  I was at Yale at the time and the longest instruction word we
>>had was 36 bits in our DEC-20.  Or maybe 48 bits in a PDP-11.
>
>Ouch!!  Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW?  They were commercial products introduced in 1975 and 1981
>respectively:

FPS failed to realize a couple of key concepts, and this is why they don't
get to retrofit the "VLIW" label to their boxes (after all, Josh Fisher
invented this technology and came up with that label, so he gets to provide
the definition for VLIW).

The first concept is that you can't just invent whatever architecture and
instruction set comes to mind, or is easiest to build, or is cheapest.  You
must start with what the compiler wants to see as a target.  A trace-scheduling
compiler is hard enough to do that you certainly do not want to add to its
burdens without a compelling reason.

If you have only a wide-word (and I won't quibble about the actual width)
machine with parallel functional units then you're not even halfway there.

>The Multiflow machine was no mystery to FPS sales analysts.  It was the 
>FPS dream machine: UNIX, virtual memory, better compilers, etc.  We had 
>asked FPS to give us these features right from the beginning of the 164 
>back in 1981.  By the time, Multi-flow delivered, it was too, late,
>Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
>and the high end workstations arrived in short order.

Too late for whom or for what?  What a weird paragraph.

>It will be interesting to see if FPS gets the credit it deserves when the
>history is written, I doubt it.  Will the historians do any better than
>the contemporaries?  In 1985, an article about micro-programming appeared
>in Scientific American, written by a Stanford professor.  He anticipated
>the day when there would be commercial machines compiling directly to
>micro-code.  FPS had sold nearly 150 machines already.

There *is* a relationship between the FPS boxes and the VLIW work done at
Multiflow and Yale.  As Ellis reports, the FPS boxes showed what would be
wrong with a machine that was not designed at the outset to be a good target
for compiled code.  There are hot spot registers, the functional units are 
not decoupled at the instruction set level, the datapaths are not regular,
and there aren't enough registers to support the flops.  When you design a
machine to be completely scheduled at compile-time, it shows in the design.
Retrofitting a compiler to an existing architecture will never work.  The
influence that the FPS work had on the Multiflow machines was a negative
one -- we knew that was one approach which would not get us what we needed.

>Let's get the history right.
>       1) The first commercially available VLIW machine was the FPS-164,
>          not the Multiflow (by 6 years).

The FPS machines do indeed enjoy a unique place in the computer hall of
fame, but the label under the exhibit won't say VLIW, nor should it.
But the FPS machines were not statically scheduled and
controlled by the compiler.  They therefore don't qualify as VLIWs under 
Josh's definition of the term.  Feel free to make up some other name if
you want; "attached processor" has always seemed pretty appropriate to me.

The FPS designs also try to compress the instruction stream in a sub-optimal way, 
pre-combining certain operations (your "parallel micro-operations") in
much the same way (and for the same reasons) that CISCs did.  A VLIW,
on the other hand, completely decouples the control of all functional
units, and avoids the baggage of carrying around the whole instruction
word width during sequences of sequential code by using an encoding scheme
described in our IEEE Transactions paper.  This makes the compiler much
happier.  Ours smiles a lot.

>       2) The "first affordable supercomputer" was the FPS-164, not the
>	  Convex (by 4 years).  FPS was using the "affordable
>	  supercomputer" and "departmental supercomputing" phrases
>	  long before the Convex advertisements and literature took
>	  them up.

Agreed.  Although calling an attached processor a "computer" smacks of
marketing hype, to me.  And if your references to having a workable compiler
were legit, then why did Convex's first machine beat the 164, which had more
mflops under the hood?

>       3) The first commercially available machine to compile complete
>          HLL applications to micro-code was once again, the FPS-164
>          (well, actually the VAX since it was cross-compilation).

Yes, although with the cart before the horse this effort was largely 
doomed from the start.

>       4) The first commercially available machine to successfully exploit 
>          parallel processors automatically using "dusty deck", serial 
>          FORTRAN, the Alliant FX8, (by 4 years and counting).

"And counting"?  What does that mean?

>       5) The first commercially available RISC machine was the FPS-164.
>          (I'd love to see this one discussed, are VLIWs RISCy?)   8-)

What about the FPS-164 qualified it as a RISC?  You must have some interesting
definition of RISC.  From Hwang and Briggs: "The AP-120B and the FPS-164
are back-end attached arithmetic processors specially designed to process
large vectors or arrays (matrices) of data.  Operationally, these processors
must work with a host computer, which can be either a minicomputer (such as
the VAX-11 series) or a mainframe computer (IBM 308X series."  Whatever else
the RISC acronym implies, the "C" stands for computer, and to me, attached
processors aren't computers.

I think discussing VLIWs as RISCs is interesting, partly because it so aptly
illustrates why the early insistence on counting the number of instructions
before declaring something a RISC was so misguided.  The important thing is
to be able to implement the most-used ops such that they're as fast as can be,
which usually means they get hard-wired control.  If you can manage to implement
your design such that you get fast, hard-wired simple ops, but lots of them
in parallel (controlled by the wide word) then I think you've got the essence
of the RISC approach without ending up hamstrung by religious zealotry.

>I don't know if I've got the history right, I've only been in the industry
>since '81, so feel free to propose "minor" adjustments.  I'll gracelessly hide 
>out when the heavyweights start swinging.

Aw, you're just a baby.  You should have stated this earlier and I'd have 
used smaller words.  :-)

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

hankd@pur-ee.UUCP (Hank Dietz) (10/10/89)

In article <1071@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
>>Let's get the history right.
>>       1) The first commercially available VLIW machine was the FPS-164,
>>          not the Multiflow (by 6 years).
>
>The FPS machines do indeed enjoy a unique place in the computer hall of
>fame, but the label under the exhibit won't say VLIW, nor should it.

I quite agree that the FPS-164 isn't trace-scheduled etc. as a VLIW, but I'm
not sure that means the hardware isn't VLIW....  However, it isn't the first
in either case.  Way back, Burroughs had some nifty WCS stuff that probably
qualifies as the first "VLIW hardware," but Multiflow's Trace clearly
deserves the title "first VLIW computer system."

>>       2) The "first affordable supercomputer" was the FPS-164, not the
>>	  Convex (by 4 years).  FPS was using the "affordable
>>	  supercomputer" and "departmental supercomputing" phrases
>>	  long before the Convex advertisements and literature took
>>	  them up.
>
>Agreed.  Although calling an attached processor a "computer" smacks of
>marketing hype, to me.

Define "supercomputer."  Intel's IPSC tried to claim the first title, and
there are plenty more if you define "affordable" the right way.  ;-)

>>       3) The first commercially available machine to compile complete
>>          HLL applications to micro-code was once again, the FPS-164

Were "complete HLL applications" really compiled to microcode?  I thought
they ran partly on the host machine?

>>       4) The first commercially available machine to successfully exploit 
>>          parallel processors automatically using "dusty deck", serial 
            ^^^^^^^^
>>          FORTRAN, the Alliant FX8, (by 4 years and counting).
>
>"And counting"?  What does that mean?

Are you defining "parallel" to be MIMD, but not SIMD/Vector?  In either
case, doesn't Cray count?  I don't know what compilers have been marketed,
but I know of more than a few "dusty deck" compilers for MIMD targets....

>>       5) The first commercially available RISC machine was the FPS-164.
>>          (I'd love to see this one discussed, are VLIWs RISCy?)   8-)
>...  The important thing is
>to be able to implement the most-used ops such that they're as fast as can be,
>which usually means they get hard-wired control.

Most early machines looked pretty RISCy...  in exactly the sense stated
above.  The main difference in recent RISC design is the use of compiled-code
statistics rather than applications algorithms in determining which ops are
frequently used.  I see no reason to single-out the relatively-modern
FPS-164; of course, just as the "first VLIW" title goes to whoever coined
the name, "first RISC" really belongs to UCB.

VLIW does tend to imply RISC...  and RISC implies VLIW if you really think
about it.  After all, parallelism is the *OBVIOUS* way to speed-up things,
and the compiler emphasis of RISC + parallelism = VLIW.  Not that the full
VLIW concept is trivial, rather that it is natural & elegant.  Best of all,
it even seems to work pretty well.  ;-)

						-hankd@ecn.purdue.edu

lewitt@Alliant.COM (Martin E. Lewitt) (10/10/89)

In article <13109@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
= In article <1071@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
= >In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
= >>       3) The first commercially available machine to compile complete
= >>          HLL applications to micro-code was once again, the FPS-164
= 
= Were "complete HLL applications" really compiled to microcode?  I thought
= they ran partly on the host machine?

The FPS-164 and its successors almost always ran complete applications, the
"subroutine box" mode was not used by most customers.  This was made possible
by an advanced (for FPS) OS called the Single Job Executive (SJE).

= >>       4) The first commercially available machine to successfully exploit 
= >>          parallel processors automatically using "dusty deck", serial 
=             ^^^^^^^^
= >>          FORTRAN, the Alliant FX8, (by 4 years and counting).
= >
= >"And counting"?  What does that mean?
= 
= Are you defining "parallel" to be MIMD, but not SIMD/Vector?  In either
= case, doesn't Cray count?  I don't know what compilers have been marketed,
= but I know of more than a few "dusty deck" compilers for MIMD targets....

I was thinking MIMD.  Until recently didn't Cray MIMD require
micro-tasking directives inserted in the code on the based on user analysis
of the code?

= 						-hankd@ecn.purdue.edu
-- 
Phone: (206) 931-8364			Martin E. Lewitt      My opinions are
Domain: lewitt@alliant.COM		2945 Scenic Dr. SE    my own, not my
UUCP: {linus|mit-eddie}!alliant!lewitt  Auburn, WA 98002      employer's.

lewitt@Alliant.COM (Martin E. Lewitt) (10/10/89)

In article <1071@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>In article <3449@alliant.Alliant.COM> lewitt@Alliant.COM (Martin Lewitt) writes:
>>The Multiflow machine was no mystery to FPS sales analysts.  It was the 
>>FPS dream machine: UNIX, virtual memory, better compilers, etc.  We had 
>>asked FPS to give us these features right from the beginning of the 164 
>>back in 1981.  By the time, Multi-flow delivered, it was too, late,
>>Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
>>and the high end workstations arrived in short order.
>
>Too late for whom or for what?  What a weird paragraph.

I agree, what a weird paragraph!  By "too late", I meant that Multiflow
was too late to exploit the wide open market opportunities that the
FPS analysts saw when they were specifying the machine they wanted FPS
to build, i.e., a wide instruction machine supporting virtual memory
and UNIX.  FPS's analysts saw these requirements of the market place 
in plenty of time to preempt the later arrivals of Convex and Alliant.

I was with Alliant when the Multi-flow Trace 7 first started shipping.
While we didn't fear the Trace 7 in most application areas, the Trace 7
did turn out to be a tough competitor in certain areas, e.g.,
the integer (non-floating point) ECAD codes.  However, after some initial
competitive success for the Trace 7, I noticed that Alliant was no longer
losing this market to the Multi-flow, but to the SUN-4 Sparcstation instead.
While this was bad news for Alliant, it had to be terrible news for Multi-flow.
A key area of competitive strength against the product that Alliant was
shipping at that time, was being undercut by cheap integer mips from SUN.
So this is another sense in which I meant "too late".

*** stuff deleted ***
>marketing hype, to me.  And if your references to having a workable compiler
>were legit, then why did Convex's first machine beat the 164, which had more
>mflops under the hood?

Convex's first machine the C1-XL had more megaflops, 40 single and
20 double peak megaflops vs the FPS's 11 peak megaflops.  Even so
the Convex rarely beat the FPS-164 on benchmarks unless the codes
were highly vectorizable.  The XL had poor scalar performance and
the FPS-164 often beat it by a factor of two.  Good scalar performance
is a strength of these wide instruction machines.  The C1-XP had
much improved scalar performance but still faced tough competition
from the 164.  Of course, performance is only one purchase criterion
and FPS could lose sales because of the lack of virtual memory,
lack of a multi-user operating system, reliability concerns, etc.

>>       4) The first commercially available machine to successfully exploit 
>>          parallel processors automatically using "dusty deck", serial 
>>          FORTRAN, the Alliant FX8, (by 4 years and counting).
>
>"And counting"?  What does that mean?

No competitors seem to have replicated Alliant's success in automatically
producing code for a MIMD machine.  So Alliant's lead is 4 years and growing.
By success, I mean that this is the standard way most customers use
the machine.   Most Cray jobs still only use one CPU.

*** more stuff deleted ***
>Bob Colwell               ..!uunet!mfci!colwell

-- 
Phone: (206) 931-8364			Martin E. Lewitt      My opinions are
Domain: lewitt@alliant.COM		2945 Scenic Dr. SE    my own, not my
UUCP: {linus|mit-eddie}!alliant!lewitt  Auburn, WA 98002      employer's.

cutler@mfci.UUCP (Ben Cutler) (10/11/89)

In article <3456@alliant.Alliant.COM> lewitt@alliant.Alliant.COM (Martin E. Lewitt) writes:
>In article <1068@m3.mfci.UUCP> cutler@mfci.UUCP (Ben Cutler) writes:
>*** some deleted ***
>>VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand.  Ellis
>>describes why the FPS-164 doesn't qualify as a good compilation target in
>>his thesis:
>
>I guess what I find strange is that the definition of a type of computers
>and a (trademarked) compiler technology are so closely tied.

That's precisely the point.  If you want to design the best high level
language possible, you don't have one team design the control structures and
a second team design the data structures and hope the two fit together well!

By analogy, if you want a high performance computer system, you want the
fastest overall system, not a superfast cpu for which you cannot generate a
compiler (and hence fast code).  So you design all the pieces together, in
this case VLIW hardware and Trace Scheduling compilation.  And by extension,
you want a real computer, not an awkward attached processor.

FPS never saw the big picture.

>From your posting, it appears Ellis correctly understood the limitations
>of the FPS architecture and his analysis of it seems to acknowledge
>some related place for it in history.

Historical footnote: The FPS machine was an attached numerical processor.
It bit the dust when general purpose computers with as much or more power and
greater ease of use appeared on the scene for many fewer dollars.

brooks@vette.llnl.gov (Eugene Brooks) (10/12/89)

In article <3460@alliant.Alliant.COM> lewitt@alliant.Alliant.COM (Martin E. Lewitt) writes:
>I was with Alliant when the Multi-flow Trace 7 first started shipping.
>While we didn't fear the Trace 7 in most application areas, the Trace 7
>did turn out to be a tough competitor in certain areas, e.g.,
>the integer (non-floating point) ECAD codes.  However, after some initial
>competitive success for the Trace 7, I noticed that Alliant was no longer
>losing this market to the Multi-flow, but to the SUN-4 Sparcstation instead.
You vendors of custom processors has better take note of this statement.  It
is really amusing to see the Multi-flow and Alliant fellows bragging at how
well they are doing when single chip microprocessors run circles around their
entire multiprocessor complexes.  The SUN-4 Sparcstation which did all this
"damage" is an absolute DOG compared to the most recent Sparc, MIPS, or i860
chip sets.  No doubt a real terror from Motorola is not far away.  The lifetime
of custom processors, even processors like the Cray line, are real short.
Imagine the surprise of CRI executives when they realize that they are no
longer losing market share to the Japanese super computer companies, and are
now losing market share to microprocessors.  We are watching this happen
internally at LLNL and do not doubt that it is happening elsewhere.

There is no defense against the ATTACK OF THE KILLER MICROS!


brooks@maddog.llnl.gov, brooks@maddog.uucp

mash@mips.COM (John Mashey) (10/13/89)

In article <35596@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
....
>There is no defense against the ATTACK OF THE KILLER MICROS!

Well, even if I agree with the fundamental point, this might be a teeny bit
strong :-)

1) Supers and mini-supers and mainframes still do things that micros
don't yet, like huge memories, and very fast I/O, or very high
connectivity, or very high bandwidths in more places.  Those things
cost money.  One can hope that having fast cheap micros will induce people to
build more fast cheap peripheral chips to help some parts of the problems.
Some things will always cost money, no matter what.

2) It's hard to believe that there will not always be a niche for:
	a) The fastest box, at almost any cost
	b) Very fast boxes with good cost/performance

3) The issue raised really is:
	How big are those niches?  How many companies can thrive in them?

4) OPINION: 1-2 companies each in both a) and b).
	This is a high-stakes game, and the ante to play keeps rising.

5) OBSERVATION: RISC micros are munching away at the lower levels of the
minisuper business; going up (i.e., CONVEX) was a good strategy.
Trying to fight it out with them where they can do the job, is like the
fight that huge horde of 1970s mini- makers had with CISC micros:
once the latter got to be reasonably competitive on performance,
the former just got run over, except for the very biggest players.

6) An interesting paper is: "SUPERCOMPUTING, Myth & Reality", by
George J. Luste, of the Physics Department at U. Toronto.
(I think this was in Supercomputing Symposium '89). This was a nice
intro to some of the issues of vector code versus scalar code,
and cost/performance issues of supercomputers versus RISC micros.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086