[comp.arch] Bandwidth and RISC vs. CISC

schow@bnr-public.uucp (Stanley Chow) (04/20/89)

In article <38853@bbn.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>
>  I'm not sure memory bandwidth has anything to do with RISC vs. CISC.
>Remember that there are (at least) two kinds of bandwidth: instructions,
>and data.  I guess I'll concede that RISC forces instruction bandwidth
>up, or requires somewhat larger instruction caches.  However data
>bandwidth is a much more severe limitation on certain programs.  I have
>in mind numerical or scientific codes, which spend most of their time in
>small loops (instructions all in cache) sweeping through large arrays
>(which may well not fit in cache).  The average scientific code appears
>to do roughly 1.5 memory references per floating-point operation.  Can
>your 10-Megaflop (64-bit precision) micro-processor move 120 Megabytes
>per second?  RISC vs. CISC seems largely irrelevant in this domain.
>
>	-- Richard
>	schooler@bbn.com

You are in effect saying the CPU architecture is not related to the
bandwidth requirement. I like to point out some ways that they do
interact. 

First of all, there are non-numeric programs, in fact, I would guess that
number crunching is no longer the major user of computing power. Some 
programs have very poor hit-rates in any cache. But, IMO, even in the number
crunching area, RISC is still sub-optimal. 

Second of all, I agree with you that data is a much harder problem. It is
here that I have the most trouble with RISC. It appears to me that to solve
the data bandwidth problem, one must give more information to the CPU. In
particular, a well designed architecture should work to minimize the impact
of data latency.  The basic premise of RISC is to not telll the CPU anything
until the last moment. This strikes me as a funny way of optimizing throughput.

To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access
intructions, 1.5 address adjusting instructions, say 0.5 instructions for
boundary condition checking and 0.5 jump instructions. This adds up to 5
instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2
instructions.

I can hear it now, everyone is jumping up and down saying, "what a fool,
doesn't he know that all those cycles are free?", "Hasn't he heard of 
pipelining and register scoreboarding?", "but the CISC instruction are slower
so the RISC will still run faster." 

In response, I can only say, work through some real examples and see
how many cycles are wasted. Alternatively, see how many stages of
pipelining is needed to have no wasted cycles. A suitable CISC will find
out earlier that it will be doing another memory reference and can prepare
accordingly. It is even possible to have scatter/gather type hardware to
offload the CPU while maximizing data throughput.

"Compilers can do optimizations", I hear the yelling. This is another
interesting phenomenon - reduce the complexity in the CPU so that the 
compiler must do all these other optimizations. I have also now seem any
indications that a compiler can to anywhere close to an optimal job on
scheduling code or pipelining. Even discounting the NP-completeness of
just about everything, theoratical indications point the other way,
especially when the compiler has to juggle so many conflicting constraints.

It would be interesting to speculate on total system complexity, is it
higher for CISC or for RISC (with its attendent memory and compiler
requirements).


Stanley Chow  ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public


As soon as flames start to show up, I will probably disown these
opinions to save my skin, at which point, these opinions will no 
longer represent anyone at all. Anyone wishing to be represented 
by these opinions need only say so.

bcase@cup.portal.com (Brian bcase Case) (04/21/89)

>Second of all, I agree with you that data is a much harder problem. It is
>here that I have the most trouble with RISC. It appears to me that to solve
>the data bandwidth problem, one must give more information to the CPU. In
>particular, a well designed architecture should work to minimize the impact
>of data latency.  The basic premise of RISC is to not telll the CPU anything
>until the last moment. This strikes me as a funny way of optimizing throughput.

This strikes me as a funny way of interpreting RISC!!! :-)  There are
several "basic premises" of RISC, and as far as I know, none of them is "to
not tell the CPU anything until the last moment."  Conversely, as far as I
know, one of the basic premises of CISC is not "to tell the CPU everything as
early as possible."  RISC emphasises exposing hardware to the software (and
vice versa, I guess) so that as much work as possible is avoided.  CISC
emphasizes binding operations together.  This has the effect of doing more
work than is necessary, and taking more cycles therefore.

>To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access
>intructions, 1.5 address adjusting instructions, say 0.5 instructions for
>boundary condition checking and 0.5 jump instructions. This adds up to 5
>instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2
>instructions.

Who cares about instructions?  How much of the work of the RISC instructions
can be reused (i.e., addressing calculations).  How much of the CISC work
is wasted?  Try an optimizing compiler....

>"Compilers can do optimizations", I hear the yelling. This is another
>interesting phenomenon - reduce the complexity in the CPU so that the 
>compiler must do all these other optimizations. I have also now seem any
>indications that a compiler can to anywhere close to an optimal job on
>scheduling code or pipelining. Even discounting the NP-completeness of
>just about everything, theoratical indications point the other way,
>especially when the compiler has to juggle so many conflicting constraints.

Well, I wasn't yelling, so much as muttering.  So, since the compiler can't
do an optimal job, it might as well not do anything.  Why bother.  I can
tell you that a compiler can do a whole lot better job than some microcode!
At leas the compiler has a view of the whole program!  (or at least a whole
procedure.)

>It would be interesting to speculate on total system complexity, is it
>higher for CISC or for RISC (with its attendent memory and compiler
>requirements).

Why speculate?  Design a CISC and then a RISC, of equal performance.  Or
look at existing implementations.  And don't confuse on-chip resources with
real complexity (the kind that makes it hard) like the editor of UNIX
<somethingorother> did in his comments on the i860.  He claimed it is the
CISCiest processor yet!

>As soon as flames start to show up, I will probably disown these
>opinions to save my skin, at which point, these opinions will no 
>longer represent anyone at all. Anyone wishing to be represented 
>by these opinions need only say so.

You are entitled to your opinion, of course.  This is not intended to
be a flame.  I'm just putting in my $0.02 too.

alan@rnms1.paradyne.com (Alan Lovejoy) (04/21/89)

In article <423@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
>"Compilers can do optimizations", I hear the yelling. This is another
>interesting phenomenon - reduce the complexity in the CPU so that the 
>compiler must do all these other optimizations. I have also now seem any
>indications that a compiler can to anywhere close to an optimal job on
>scheduling code or pipelining. Even discounting the NP-completeness of
>just about everything, theoratical indications point the other way,
>especially when the compiler has to juggle so many conflicting constraints.

If optimization is too difficult for compilers, how in the *&^%$#@! is the
hardware going to be able to do it??????!!!!

The compiler knows a LOT more about the istruction stream--and the intent of
the instruction stream--than the hardware does, with one big exception:  the 
hardware knows the dynamic instruction sequence; the compiler does not.  
Unfortunately, even the hardware doesn't have all that much advance notice of 
dynamically variable parameters.

The reason for increasing the primitiveness (a better characterization than
"reducing the complexity") of machine instructions is so that the compiler
CAN "do all these optimizations," not so that it "must."  RISC does not force
optimization, it permits it.  "Complex" or "high-level" instructions
necessarily become too application-specific.  The greater the semantic content
of an instruction, the less its generality.  The more primitive the instruction
semantics, the greater the probability that the instruction only does what
you need, and not what you don't.  Have you ever tried emulating unsigned
multiplication/division on a system that only provides signed integer
arithmetic and comparisons?  

Because primitive instructions do less work than "complex" instructions,
they can execute in less TIME.  This means either fewer and/or SHORTER
clock cycles.  At first blush, it would seem that complex instructions
can compensate for this by means of parallelism (e.g., pipelining,
mulitiple identical, parallel functional units).  But in practice, there
is no such thing as free lunch.  The steps in the hardware "algorithm"
that implements a complex instruction are usually inherently sequential (fetch
data from memory, do operation, store data to memory).  The parallelizable parts
tend to be the same functions that are parallelizable for primitive 
instructions.  So complex instructions usually gain nothing from this, 
relative to primitive instructions.  
 
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
_________________________Design Flaws Travel In Herds_________________________
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

slackey@bbn.com (Stan Lackey) (04/22/89)

In article <17417@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>of data latency.  The basic premise of RISC is to not telll the CPU anything
>>until the last moment. This strikes me as a funny way of optimizing throughput.
>
>This strikes me as a funny way of interpreting RISC!!! :-)  There are
>several "basic premises" of RISC, and as far as I know, none of them is "to
>not tell the CPU anything until the last moment."  Conversely, as far as I

Stuff like vector instructions, the VAX character string instructions,
VAX CALL/RET, the 680x0 MOVEM, etc.  give the CPU a real strong hint
as to what near-future memory accesses will be.  As memory access times 
become even longer [relative to cycle time], this becomes more important.
And will begin to widen the performance gap, if implemented properly.
RISC architectures don't have the ability to communicate this class 
of information, and if it is added, they won't be RISC's anymore (unless 
Marketing SAYS they are, I guess...)

>>To execute your 1.0 FLOP, the typical RISC will do about 1.5 memory access
>>intructions, 1.5 address adjusting instructions, say 0.5 instructions for
>>boundary condition checking and 0.5 jump instructions. This adds up to 5
>>instructions to do 1 FLOP,. Many CISC machines can do a FLOP with only 2
>>instructions.
>
>Who cares about instructions?

If each instruction consumes one cycle to issue I sure do!
(BTW, the Alliant takes one cycle to do MULF (a2)+,fp0; don't tell me
it's impossible.)

>look at existing implementations.  And don't confuse on-chip resources with
>real complexity (the kind that makes it hard) like the editor of UNIX
><somethingorother> did in his comments on the i860.  He claimed it is the
>CISCiest processor yet!

I have a real problem with anything that includes IEEE floating point
AND calls itself a RISC.  IEEE FP violates every rule of RISC; it has
features that compilers will never use (rounding modes), features that
are rarely needed that slow things down (denormalized operands), and
features that make things complex that nobody needs (round-to-even).
I'd really like to see someone stand up and say, "Boy, the IEEE
round-to-even is much more accurate than DEC's round .5 up.  I have an
application right here that proves it."  Or, "Gradual underflow is
much better.  I have an application that can be run in single precision
that would need to be run double precision without it."
:-) Stan

ingoldsb@ctycal.COM (Terry Ingoldsby) (04/22/89)

In article <423@bnr-fos.UUCP>, schow@bnr-public.uucp (Stanley Chow) writes:
> "Compilers can do optimizations", I hear the yelling. This is another
> interesting phenomenon - reduce the complexity in the CPU so that the 
> compiler must do all these other optimizations. I have also now seem any
> indications that a compiler can to anywhere close to an optimal job on
> scheduling code or pipelining. Even discounting the NP-completeness of
> just about everything, theoratical indications point the other way,
> especially when the compiler has to juggle so many conflicting constraints.
> 

I am quite naive on this subject (but that won't stop me from throwing in my
two cents worth :^)), but it seems to me that if we still programmed mostly
in assembler, then CISC would beat RISC.  I did a lot of programming of the
8086 using assembler language, and I became (painfully) aware of some of the
unusual instructions, and the difference the choice a a particular register,
or way of doing something would make on overall performance.  By skillfully
picking my instructions, I could improve performance *significantly*.  On the
other hand, I can't imagine any compiler (apologies to the compiler writers)
smart enough to have figured out what I wanted to do, and to choose the
optimal instructions if I had coded the algorithm in a high level language.
In fact, I suspect that a lot of the weird instructions were never used by
compilers at all.  This means that compilers often generate RISC code for CISC
machines (ie. they use the simplest instructions they can).

On the other hand, I can see that while a RISC processor programmed in assembler
might not be quite as quick as an expertly assembled CISC program, the compiler
has a reasonable chance of generating a good sequence of instructions.  At
least it doesn't have to ask questions like:
  1) If I use register A for this operation, the next 10 instructions will
     be quick, but
  2) would I be better to not use A, use B (slow) and wait until a really
     critical set of instructions comes up to use A.
Even if your compiler is brainy enough to figure that out, there is almost
no way it can recognize that the algorithm I'm performing is a Fast Fourier
Transform.  It will generate the code to perform it instead of using a
(hypothetical) FFT CISC instruction.

My point is that since almost everything is written in high level languages
today, they are better suited for RISC.  For applications that still use
assembler (eg. control systems) CISC makes sense.

But what do I know??

                                  Terry Ingoldsby
                                  Land Information Related Systems
                                  The City of Calgary
                                  ctycal!ingoldsb@calgary.UUCP
                                           or
                             ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (04/22/89)

In article <38971@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <17417@cup.portal.com) bcase@cup.portal.com (Brian bcase Case) writes:
)))of data latency.  The basic premise of RISC is to not telll the CPU anything
)))until the last moment. This strikes me as a funny way of optimizing throughput.
))
))This strikes me as a funny way of interpreting RISC!!! :-)  There are
))several "basic premises" of RISC, and as far as I know, none of them is "to
))not tell the CPU anything until the last moment."  Conversely, as far as I
)
)Stuff like vector instructions, the VAX character string instructions,
)VAX CALL/RET, the 680x0 MOVEM, etc.  give the CPU a real strong hint
)as to what near-future memory accesses will be.  As memory access times 
)become even longer [relative to cycle time], this becomes more important.
)And will begin to widen the performance gap, if implemented properly.
)RISC architectures don't have the ability to communicate this class 
)of information, and if it is added, they won't be RISC's anymore (unless 
)Marketing SAYS they are, I guess...)

I thought that many RISC chips have this property already: load delays.
You tell it to load some register or something or other but it wont
be valid until n cycles later.  In the meantime, though, you can have it
run the exact instructions that YOU want it do to for your program, and not
what the microcode programmer thought would be a commonly used bundle.  It's
the same effect--just more general purpose.

)
)I have a real problem with anything that includes IEEE floating point
)AND calls itself a RISC.  IEEE FP violates every rule of RISC; it has
)features that compilers will never use (rounding modes), features that
)are rarely needed that slow things down (denormalized operands), and
)features that make things complex that nobody needs (round-to-even).
)I'd really like to see someone stand up and say, "Boy, the IEEE
)round-to-even is much more accurate than DEC's round .5 up.  I have an
)application right here that proves it."  Or, "Gradual underflow is
)much better.  I have an application that can be run in single precision
)that would need to be run double precision without it."
):-) Stan

What do you think would be better?  


Matt Kennel
mbkennel@phoenix.princeton.edu

dgh%dgh@Sun.COM (David Hough) (04/22/89)

In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> 
> I have a real problem with anything that includes IEEE floating point
> AND calls itself a RISC.  IEEE FP violates every rule of RISC; it has
> features that compilers will never use (rounding modes), features that
> are rarely needed that slow things down (denormalized operands), and
> features that make things complex that nobody needs (round-to-even).
> I'd really like to see someone stand up and say, "Boy, the IEEE
> round-to-even is much more accurate than DEC's round .5 up.  I have an
> application right here that proves it."  Or, "Gradual underflow is
> much better.  I have an application that can be run in single precision
> that would need to be run double precision without it."

This is certainly the position that DEC took through the IEEE 754 and 854
meetings.  For better or worse, however, all RISC chips that I'm aware of
that have hardware floating point support implement IEEE arithmetic
more of less fully.

The anomaly here, of course, is that common scientific applications that,
by dint of great effort, have been debugged to the point of running
efficiently unchanged on IBM 370, VAX, and Cray, run about as well but
not much better on IEEE systems since they don't exploit any specific
feature of any particular arithmetic system.  Sometimes they run slower
if they underflow a lot in situations that don't matter, AND the
hardware doesn't support subnormal operands and results efficiently.
This is properly viewed as a shortcoming of the hardware/software
system that purportedly implements IEEE arithmetic: even on synchronous
systems you have to be able to hold the FPU for cache misses and page
faults, so similarly you should be able to hold the CPU for exception
misses in the FPU that take a little longer to compute.  On asynchronous
CISC systems like 68881 or 80387 this isn't a problem, but they are
slower in the non-exceptional case, which is why RISC systems are
mostly synchronous.

Conversely, however, programs that take advantage of IEEE arithmetic,
usually unknowingly, don't work nearly as well on 370, VAX, or Cray,
where simple assumptions like 

	if (x != y) /* then it's safe to divide by (x-y) */

no longer hold.

> an application that can be run in single precision
> that would need to be run double precision without [gradual underflow].

There will never be such an example that satisfies everyone since
you never "need" any particular precision.  After all, any integer
or floating-point computation is fabricated out of one-bit integer
operations.  It's just a matter of dividing up the cleverness between
the hardware and the software.  What you CAN readily demonstrate
are programs (written entirely in one precision)
that are no worse affected by underflow than by normal
roundoff, PROVIDED that underflow be gradual.  Demmel and Linnainmaa
contributed many pages of such analyses to the IEEE deliberations
and to subsequent proceedings of the Symposia on Computer
Arithmetic published by IEEE-CS.  Of course if you are sufficiently
clever you can use higher precision explicitly if provided by the
compiler or implicitly otherwise to produce robust code in the
face of abrupt underflow or even Cray arithmetic.  Many mathematical
software experts are good at this but most regard this as a 
evil only made necessary by hardware, system, and language designs
that through ignorance or carelessness become part of the problem
rather than part of the solution.

Not all code is compiled.  For instance, there is a great body of theory
and practice in obtaining computational error bounds in computations
based on interval arithmetic.  Interval arithmetic is
efficient to implement with the directed rounding modes required by IEEE
arithmetic, but you can't write the implementation in standard C or
Fortran.  In integer arithmetic, the double-precise product of 
two single-precise operands,
and the single-precise quotient and remainder of a double-precise
dividend and single-precise divisor, are important in a number of
applications such as base conversion and random number generation,
but there is no way to express the required computations in standard
higher-level languages.

As to rounding halfway cases to even, the advantage over biased
rounding is perhaps simplest understood by the observation
that 1+(eps/2) rounds to 1 rather than 1+eps.  The "even" result
is more likely to be the one you wanted if you had a preference.

Such rounding is no more expensive than biased rounding
on a system that is required to provide directed rounding modes as well.  
It's not the bottleneck on any hardware IEEE implementation of which I'm aware.
I have heard that adder carry propagate time and multiplier array size 
are the key constraints with a floating-point chip; hardware experts
will correct me if I'm wrong.  Memory bandwidth tends to be the key constraint
on overall system performance unless floating-point division and sqrt
dominate.  The last describes a minority of programs but they are quite
important in some influential circles.

David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

yuval@taux01.UUCP (Gideon Yuval) (04/22/89)

>round-to-even is much more accurate than DEC's round .5 up.  I have an
>application right here that proves it."  Or, "Gradual underflow is
>much better.  I have an application that can be run in single precision
>that would need to be run double precision without it."

The  video-tapes  of  Kahan's  "floating-point  indoctrination"  course (Sun,
May-Jul/88) have "somebody" (i.e. W.Kahan)  standing  up &  saying  precisely
that. Sneak a view if you can.


-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322

yuval@taux01.UUCP (Gideon Yuval) (04/22/89)

My previous posting got garbled. Here's an ungarbled version.
Stan Lackey, in his message <38971@bbn.com>, says:
>I'd really like to see someone stand up and say, "Boy, the IEEE
>round-to-even is much more accurate than DEC's round .5 up.  I have an
>application right here that proves it."  Or, "Gradual underflow is
>much better.  I have an application that can be run in single precision
>that would need to be run double precision without it."

The  video-tapes  of  Kahan's  "floating-point  indoctrination"  course (Sun,
May-Jul/88) have "somebody" (i.e. W.Kahan)  standing  up &  saying  precisely
that. Sneak a view if you can.


-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322

slackey@bbn.com (Stan Lackey) (04/25/89)

In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:
>In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
>> 
>> I have a real problem with anything that includes IEEE floating point
>> AND calls itself a RISC.  IEEE FP violates every rule of RISC; it has
>> features that compilers will never use (rounding modes), features that
>> are rarely needed that slow things down (denormalized operands), and
>> features that make things complex that nobody needs (round-to-even).
>> (emotional stuff deleted)
>Not all code is compiled.
I agree - I was just quoting the RISC guys.
>Interval arithmetic is
>efficient to implement with the directed rounding modes required by IEEE
>arithmetic, but you can't write the implementation in standard C or...
>Such rounding is no more expensive than biased rounding
>on a system that is required to provide directed rounding modes as well.  
>It's not the bottleneck on any hardware IEEE implementation of which I'm aware.

Having to detect EXACTLY .5 is a bottleneck in terms of transistor
count, design time, and diagnostics.  The extra execution time may not
affect overall cycle time, but the RISC guys say that any added
hardware increases cycle time (they usually use it in the context of
instruction decode).

>I have heard that adder carry propagate time and multiplier array size 
>are the key constraints with a floating-point chip; hardware experts
>will correct me if I'm wrong.  
These are probably the largest single elements in most implementations.
But, as the hardware guys will tell you, it's the exceptions that get you.
Note: It's prealigning a denormalized operand before a multiplication
that REALLY hurts.

>Memory bandwidth tends to be the key constraint
>on overall system performance unless floating-point division and sqrt
>dominate.
Absolutely true, but not very relevant.

>David Hough

Lots of valid uses of IEEE features listed.  I didn't mean that IEEE
was bad or useless, it's just that it was architected when CISC was
the trend, and it shows.  Especially after my own efforts in an IEEE
implementation, I am glad to see from this posting and others that at
least a few users can make use of the features.  I think the RISC
implementers should have a RISC-style floating point standard, though.

dik@cwi.nl (Dik T. Winter) (04/25/89)

In article <39049@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
 > Lots of valid uses of IEEE features listed.  I didn't mean that IEEE
 > was bad or useless, it's just that it was architected when CISC was
 > the trend, and it shows.  Especially after my own efforts in an IEEE
 > implementation, I am glad to see from this posting and others that at
 > least a few users can make use of the features.  I think the RISC
 > implementers should have a RISC-style floating point standard, though.

Oh, go ahead, but make sure you have some numerical analysts around to
help you, unless you are willing to make the same mistakes as numerous
designers before you have made.

To some of your points:

Round to even gives a better overall round-off error than truncate to
zero (i.e. better in larger expressions).

Gradual underflow is, as far as I see, not really needed, but the
alternative is trap on underflow, and allow the program to recover.
This would be just as hard, if not harder, in my opinion.

David Hough remarked that many applications are written to work
properly on a lot of machines and that they would not benefit very
much from IEEE arithmetic.  I might say that for a number of those
applications this was achieved with much trouble.  The original
design would, in a lot of cases, have benefitted if *only* IEEE
arithmetic had to be considered.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

dgh%dgh@Sun.COM (David Hough) (04/25/89)

In article <39049@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:
> >In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:

> >Such rounding is no more expensive than biased rounding
> >on a system that is required to provide directed rounding modes as well.  

> Having to detect EXACTLY .5 is a bottleneck in terms of transistor
> count, design time, and diagnostics.  The extra execution time may not
> affect overall cycle time, but the RISC guys say that any added
> hardware increases cycle time (they usually use it in the context of
> instruction decode).

EXACTLY .5 is no harder than correct directed rounding.  You have to
(in principle) develop all the digits, propagate carries, and remember
whether any shifted off were non-zero.  Division and sqrt are simplified
by the fact that EXACTLY .5 can't happen.

> Note: It's prealigning a denormalized operand before a multiplication
> that REALLY hurts.

This event is rare enough that it needn't be as fast as a normal
multiplication, so it's OK to slow down somewhat by holding the CPU,
but not so rare that you want to punt to software.  By throwing enough
hardware at the problem you can make it as fast as the normal case.
I don't advocate that but that's my understanding of what the Cydra-5 did.  

Interestingly enough, the early drafts of 754 specified that default 
handling of subnormal numbers be in a "warning mode" and that the more expensive
"normalizing mode" be an option.  This was with highly-pipelined 
implementations very much in mind.  However a gang of early implementers from
Apple managed to talk a majority of the committee into making the
normalizing mode the default.  The normalizing mode is easier to understand
and easier to implement in software.  Warning mode is a lot cheaper to
pipeline, however.  
I was part of the gang but I've since had opportunity to repent at leisure.

> Lots of valid uses of IEEE features listed.  I didn't mean that IEEE
> was bad or useless, it's just that it was architected when CISC was
> the trend, and it shows.  Especially after my own efforts in an IEEE
> implementation, I am glad to see from this posting and others that at
> least a few users can make use of the features.

Remember IEEE 754 and 854 are standards for a programming environment.
How much of that is to be provided by hardware and how much by software
is up to the implementer; in contrast RISC is a hardware design philosophy.  
The MC68881 is probably the best-known attempt
to put practically everything in the hardware so the software wouldn't
screw it up as usual.  The Weitek 1032/3 and their descendants and
competitors are examples of minimal hardware implementations that support
complete IEEE implementations once appropriate software is added.  
Evidently the first generations of such chips were too minimal; 
for instance nowadays
everybody has correctly-rounded division and sqrt in hardware, 
rather than software, on chips intended for general-purpose computation.

> I think the RISC
> implementers should have a RISC-style floating point standard, though.

There's a very minimalist floating-point standard, that of S. Cray,
which is very cheap to implement entirely in hardware (compared
to other standards at similar performance levels).  
The only hard part is writing the software that uses it.  So far
no other hardware manufacturers have seen fit to adopt Cray arithmetic.
IBM 370 architecture has been more widely imitated but not because of
any inherent wonderfulness for mathematical software.  DEC VAX
floating-point architecture
is well defined and a number of non-DEC implementations are available. 
But divide and sqrt are no easier than IEEE, and IEEE double precision
addition and multiplication are available now in one or two cycles
on some implementations. 
Does anybody still think there would be an advantage to VAX, 370, or Cray
floating-point architecture for a PC or workstation?



David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

slackey@bbn.com (Stan Lackey) (04/25/89)

In article <100891@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:
>EXACTLY .5 is no harder than correct directed rounding.  You have to
>(in principle) develop all the digits, propagate carries, and remember
>whether any shifted off were non-zero.  Division and sqrt are simplified
>by the fact that EXACTLY .5 can't happen.

OK, it's only a problem in multiplication.

>> Note: It's prealigning a denormalized operand before a multiplication
>> that REALLY hurts.
>This event is rare enough that it needn't be as fast as a normal
>multiplication, so it's OK to slow down somewhat by holding the CPU,
>but not so rare that you want to punt to software.  By throwing enough
>hardware at the problem you can make it as fast as the normal case.
>I don't advocate that but that's my understanding of what the Cydra-5 did.  

Ever design a pipelined machine?  It was probably easier in the Cydra
to make everything assume the worst case, than to deal with the
pipeline getting messed up.  The new micros (at least the i860) trap
and expect software to fix things up, which includes parsing the
instructions in the pipe, and fixing up the saved version of the
internal data pipeline.  I've seen statements in this newsgroup like
"not usable in a general purpose environment" when referring to the
i860.  Talk about debug time!  In the Alliant we wanted to get the
design done, and fit it on one board, so we shut denorms off.  (It 
sets the exception bits, though.)  After shipping for 4 years, there
have still been no complaints.

>for instance nowadays
>everybody has correctly-rounded division and sqrt in hardware, 
except Intel

Re: one or two-cycle DP IEEE mul/add exist: Alliant is the only one I
know of, but it's because 1) the cycle time is abnormally long and 2)
denorms are not supported.  I think it's valid to say that if floating
point (esp DP) ops take one cycle, your cycle time is too long.

>> I think the RISC
>> implementers should have a RISC-style floating point standard, though.
>DEC VAX floating-point architecture
>is well defined and a number of non-DEC implementations are available. 
Sounds like a good idea to me!  The IBM one is not useful, and (so it is
said) the Cray one is difficult to use.  The VAX one is accurate enough 
and has enough range for normal use, and if F or G aren't enough, there's
always H :-)
-Stan

dgh%dgh@Sun.COM (David Hough) (04/26/89)

In article <39095@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> The new micros (at least the i860) trap
> and expect software to fix things up, which includes parsing the
> instructions in the pipe, and fixing up the saved version of the
> internal data pipeline.  I've seen statements in this newsgroup like
> "not usable in a general purpose environment" when referring to the
> i860.  

I agree.  The i860 appears never to have been intended to support
an efficient implementation of IEEE 754.

> In the Alliant we wanted to get the
> design done, and fit it on one board, so we shut denorms off.
... 
> Re: one or two-cycle DP IEEE mul/add exist: Alliant is the only one I

Regardless of how fast you do the arithmetic, if

	(x != y)
and
	(x-y) != 0

are not equivalent for finite x, you don't conform to IEEE 754 or 854.
Subnormal numbers permit this equivalence.  754 committee members
were irked in advance, so to speak, by the prospect that some vendors
would claim conformance for such implementations.

> said) the Cray one is difficult to use.  The VAX one is accurate enough 
> and has enough range for normal use, and if F or G aren't enough, there's
> always H :-)

The VAX standard is D format double precision, not G.  Many people
consider it inadequate because unlike a pocket calculator it won't
accommodate 10+-99.

David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

cik@l.cc.purdue.edu (Herman Rubin) (04/26/89)

In article <288@ctycal.UUCP>, ingoldsb@ctycal.COM (Terry Ingoldsby) writes:
> In article <423@bnr-fos.UUCP>, schow@bnr-public.uucp (Stanley Chow) writes:
> > "Compilers can do optimizations", I hear the yelling. This is another
> > interesting phenomenon - reduce the complexity in the CPU so that the 
> > compiler must do all these other optimizations. I have also now seem any
> > indications that a compiler can to anywhere close to an optimal job on
> > scheduling code or pipelining. Even discounting the NP-completeness of
> > just about everything, theoratical indications point the other way,
> > especially when the compiler has to juggle so many conflicting constraints.
  
> I am quite naive on this subject (but that won't stop me from throwing in my
> two cents worth :^)), but it seems to me that if we still programmed mostly
> in assembler, then CISC would beat RISC.  I did a lot of programming of the
> 8086 using assembler language, and I became (painfully) aware of some of the
> unusual instructions, and the difference the choice a a particular register,
> or way of doing something would make on overall performance.  By skillfully
> picking my instructions, I could improve performance *significantly*.  On the
> other hand, I can't imagine any compiler (apologies to the compiler writers)
> smart enough to have figured out what I wanted to do, and to choose the
> optimal instructions if I had coded the algorithm in a high level language.

No apologies are due to the compiler writers.  Rather, criticism is due to
them for the arrogance they took in leaving out the possibility for the
programmmer to do something intelligent.  The HLLs are woefully inadequate,
and I would not be surprised that the ability to do intelligent coding using
the machine capabilities may be destroyed by learning such restrictive coding
procedures first.

You are far less naive than those gurus who thing they know all the answers.

> In fact, I suspect that a lot of the weird instructions were never used by
> compilers at all.  This means that compilers often generate RISC code for CISC
> machines (ie. they use the simplest instructions they can).

You are so right. I have no trouble using these not very wierd instructions
to do what the compiler writer did not anticipate.  Nobody can anticipate all
my needs, but nobody should say that he has given me all the tools, either.

As far as the difficulty of using machine language, I know of no machine as
complicated as a HLL, although some may be getting a little close.  The 
present assembler languages are another matter, though, and unnecessarily
so.

> On the other hand, I can see that while a RISC processor programmed in assembler
> might not be quite as quick as an expertly assembled CISC program, the compiler
> has a reasonable chance of generating a good sequence of instructions.  At
> least it doesn't have to ask questions like:
>   1) If I use register A for this operation, the next 10 instructions will
>      be quick, but
>   2) would I be better to not use A, use B (slow) and wait until a really
>      critical set of instructions comes up to use A.
> Even if your compiler is brainy enough to figure that out, there is almost
> no way it can recognize that the algorithm I'm performing is a Fast Fourier
> Transform.  It will generate the code to perform it instead of using a
> (hypothetical) FFT CISC instruction.

I have run into situations where the number of possible programs is well into
the thousands, at least.  In many cases I can see that using some operations
cannot pay unless those operations are hardware.  How is the programmer going
to write the program?  Those who want it to be machine independent make things
difficult.

A good example of using CISC versus RISC is division, with a quotient and 
remainder.  Some RISC machines do not even have this.  Now I suggested to
this group that the instruction be modified to allow the programmer to
specify which quotient and remainder are to be used as a function of the
signs.  This is trivial in hardware, and probably would not extend the
time by more than a small fraction of a cycle, although three conditional
transfers are involved.  The point is that they can be made while the
lengthy division is taking place by the division unit.

Another example is floating point arithmetic.  The RISCy CRAY, on problems
with rigid vectors, will run rings around the CYBER 205 in single precision
floating point (around 14 digits).  If we now change to double precision,
we not get a time factor of about 15 in favor of the CYBER.  Many problems
in which non-rigid vectors are appropriate also favor the CYBER.

Considering the cost of the CPU relative to the rest of the computer, I 
would suggest VRISC as the profitable way to go.  But we need input from
people like Terry and me about the apparently crazy instructions which 
will speed up throughput.

> My point is that since almost everything is written in high level languages
> today, they are better suited for RISC.  For applications that still use
> assembler (eg. control systems) CISC makes sense.

Must we only use tools that would appal an artist?  Programming is an art;
artists do not learn by filling in the squares with numbered colors.

> But what do I know??

More than the HLL gurus.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

peter@ficc.uu.net (Peter da Silva) (04/27/89)

Herman.

Please describe a language, higher level than forth, that will provide
all the tools you feel you need in a HLL. I am about convinced that such
a language is impossible.

Thanks,
	your partner in monomania,
		Peter da Silva.
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

chuck@melmac.harris-atd.com (Chuck Musciano) (04/27/89)

Oh well, into the breach...

In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>No apologies are due to the compiler writers.  Rather, criticism is due to
>them for the arrogance they took in leaving out the possibility for the
>programmmer to do something intelligent.  The HLLs are woefully inadequate,
>and I would not be surprised that the ability to do intelligent coding using
>the machine capabilities may be destroyed by learning such restrictive coding
>procedures first.

     Some misdirected flames here.  If you want to blame anyone, blame the
language designer, not the compiler writer.  Us poor compiler writers just
sit around, horrified, as the hardware guys take more and more of the hard
part and give it to us.  :-)

     If you dislike the current crop of HLLs so much, you are free to design
your own.  As one who has designed and implemented several languages on a 
variety of systems, I know how easy it is to take potshots at the language
implementors.  Go through the loop yourself, and then complain.  Designing
anything which will please some segment of the world is very difficult.
Designing a language which is elegant, orthogonal, easy to learn and use,
easy to implement on a variety of machines and that will appeal to a large
number of users is almost impossible.

Chuck Musciano			ARPA  : chuck@trantor.harris-atd.com
Harris Corporation 		Usenet: ...!uunet!x102a!trantor!chuck
PO Box 37, MS 3A/1912		AT&T  : (407) 727-6131
Melbourne, FL 32902		FAX   : (407) 727-{5118,5227,4004}

mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/28/89)

In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>Another example is floating point arithmetic.  The RISCy CRAY, on problems
>with rigid vectors, will run rings around the CYBER 205 in single precision
>floating point (around 14 digits).  If we now change to double precision,
>we not get a time factor of about 15 in favor of the CYBER.  Many problems
>in which non-rigid vectors are appropriate also favor the CYBER.

>Herman Rubin, Dept. of Statistics,hrubin@l.cc.purdue.edu

(1) What is a "rigid vector"?

(2) On 64-bit vector operations with long vectors, the Crays do not
    "run rings around" the Cyber 205.  The asymptotic speeds (MFLOPS) are:
		Cray-1		Cyber 205	Cray X/MP
		 160		   200		   235

(3) Both the X/MP and 205 perform "double precision" (128-bit) arithmetic
    in software, and experience a slow-down of close to a factor of 100
    relative to 64-bit vector operations.
-- 
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

dave@celerity.uucp (Dave Smith) (04/29/89)

In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <288@ctycal.UUCP>, ingoldsb@ctycal.COM (Terry Ingoldsby) writes:
>> My point is that since almost everything is written in high level languages
>> today, they are better suited for RISC.  For applications that still use
>> assembler (eg. control systems) CISC makes sense.
>
>Must we only use tools that would appal an artist?  Programming is an art;
>artists do not learn by filling in the squares with numbered colors.
>

	A good artist can create fine art with crayons or oil paints (or
whatever).  Assembly languages are definitely the crayons of the computer
world.  RISCs are kind of like the little box of Crayolas with 16 colors,
a CISC, like the VAX, is like the big box with 64.  Ever notice how it was
that the black, red and blue crayons always ended up the smallest in that
big box and the mauve crayon looked brand new?

	The problem I have with RISC designs are that they use up too much
memory bandwidth.  What I think would be better was something that gave you
the flexibility of a RISC (not being tied into the designer's particular
idea of how a string instruction should be implemented, for example) but 
with the memory bandwidth efficiency of a CISC.  I'm not familiar enough
with VLIW to make any good judgements on it, but it seems as though it's
a reasonable way to go.

David L. Smith
FPS Computing, San Diego
ucsd!celerity!dave
"Repent, Harlequin!," said the TickTock Man

cik@l.cc.purdue.edu (Herman Rubin) (04/29/89)

In article <1984@trantor.harris-atd.com>, chuck@melmac.harris-atd.com (Chuck Musciano) writes:
> Oh well, into the breach...
> 
> In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >No apologies are due to the compiler writers.  Rather, criticism is due to
> >them for the arrogance they took in leaving out the possibility for the
> >programmmer to do something intelligent.  The HLLs are woefully inadequate,
> >and I would not be surprised that the ability to do intelligent coding using
> >the machine capabilities may be destroyed by learning such restrictive coding
> >procedures first.
> 
>      Some misdirected flames here.  If you want to blame anyone, blame the
> language designer, not the compiler writer.  Us poor compiler writers just
> sit around, horrified, as the hardware guys take more and more of the hard
> part and give it to us.  :-)

Not all blames go to the language designer.  The implementation of asm in C
can be done to give much more benefit to the programmer.  The implementation
knows where the variable xyz is and can substitute it in the assembler
instruction.  This has nothing to do with the language.

And this atrocious use of underscores!  Why should C prefix an underscore to
all its names for systems purposes?  I know the reason, but I cannot respect
the intelligence of those who did not take the other action.  If we have
programs created by several languages, we should have no more than a calling
sequence problem, not a name problem.  If I call a buffer refill program with
the calling sequence I advocate, passing the pointer descriptor, it should
make no difference whether the subroutine is written in Fortran, C, Pascal,
APL, or anything else, and it should make no difference whether it is called
from any of those languages.  If they all prepended an underscore, and each
could use the others' names, it would not be too bad.  But some Fortrans leave
the name alone, some prepend and postpend, Pascal does not allow underscores
in the middle of a name, etc.

>      If you dislike the current crop of HLLs so much, you are free to design
> your own.  As one who has designed and implemented several languages on a 
> variety of systems, I know how easy it is to take potshots at the language
> implementors.  Go through the loop yourself, and then complain.  Designing
> anything which will please some segment of the world is very difficult.
> Designing a language which is elegant, orthogonal, easy to learn and use,
> easy to implement on a variety of machines and that will appeal to a large
> number of users is almost impossible.

I am asking for one thing you have left out; it should be able to write
efficient code.  I will throw out elegant and orthogonal completely.  If
the machine is not orthogonally designed, why should the language be.  And
most machines are not.

The first thing needed is a macro assembler in which the "macro name" can be
a pattern.  For example, I would like    x = y - z    to be the =- macro.
Allow the user to use these ad lib.  This way the language can be extended.
The #defines, and even the user-overloaded operators in C++ do not achieve
this.  If a machine instruction is complicated, it may be necessary to have
a complicated design, as well as type overrides, etc.  An example from the
CYBER 205 is the following large class of vector instruction, where A or B
can be either vector or scalar.  There are many options in this, and I point
this out.

		C'W =t -|A'X opmod | B'Y /\ ~ W
		 11  2 34 55   666 7  88 99 a 9

notice that I have 10 options for each opcode (someone might say that option
6 is part of the opcode, but there are natural defaults).  Since this is a
single machine instruction on this machine, I MUST be able to use it, and I
do not wish to have to use the clumsy notation provided by the "CALL8"
procedures.  I do not claim that this is optimal notation, and I would 
expect you to write it somewhat differently.  For example, you could always
leave out the /\; I put it in merely for clarity.

The best semi-portable procedures for generating such random variables as
normal are easily described.  Coding them on a CYBER 205 efficiently is
trivial.  Coding a slightly different version of them on an IBM 3090 is 
not difficult.  Coding them (any version) on a CRAY X-MP is an interesting
challenge.  Coding them on a CRAY 1 is a major headache, and not vectorizable
for much of the procedure.  These vector machines are that different.  So
should the code be easy to implement on a large variety of machines?  But
the programmer who does not understand the machine cannot code well on that
machine.

Some C-like language with macro augmentation is probably the answer.  But
types and operation symbols should be introduced by the user at will.  And
all manners of arrays should be included, and whatever else the user can
come up with.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

cik@l.cc.purdue.edu (Herman Rubin) (04/29/89)

In article <632@loligo.cc.fsu.edu>, mccalpin@loligo.cc.fsu.edu (John McCalpin) writes:
> In article <1262@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> 
> >Another example is floating point arithmetic.  The RISCy CRAY, on problems
> >with rigid vectors, will run rings around the CYBER 205 in single precision
> >floating point (around 14 digits).  If we now change to double precision,
> >we not get a time factor of about 15 in favor of the CYBER.  Many problems
> >in which non-rigid vectors are appropriate also favor the CYBER.
> 
> >Herman Rubin, Dept. of Statistics,hrubin@l.cc.purdue.edu
> 
> (1) What is a "rigid vector"?

Rigid vector operations are those in which the position of an element
is essentially unchanged, except by scalar shifts.  Examples of non-
rigid vector operations are removing the elements of a vector corresponding
to 0's in a bit vector with subsequent shrinking of the length of the vector,
inserting the first elements for vector a in locations in vector b selected
by a bit vector, merging under control of a bit vector, etc.

> (2) On 64-bit vector operations with long vectors, the Crays do not
>     "run rings around" the Cyber 205.  The asymptotic speeds (MFLOPS) are:
> 		Cray-1		Cyber 205	Cray X/MP
> 		 160		   200		   235

Asymptotic speeds are much less often approximated on the CYBER, unfortunately.
The CYBER also can only do one vector operation at a time, but there is no
interference, in general, on the CYBER for vector and scalar.  I prefer the
CYBER myself, and I guess I took the most pessimistic view.  The actual
ratios depend on a lot of things.

> (3) Both the X/MP and 205 perform "double precision" (128-bit) arithmetic
>     in software, and experience a slow-down of close to a factor of 100
>     relative to 64-bit vector operations.

Double precision has 96 bits in the mantissa on the X/MP and 94 on the CYBER.*
If one is willing to lose 1-2 bit accuracy on the CYBER, the slow-down factor
can be reduced to around 5.  The CYBER has the direct capability of getting
both the most and least significant part of the sum or product, with two 
instruction calls, but no additional overhead, but the CRAYs only get the
most significant part; this is the biggest problem, and requires that half-
precision is used to get double precision.

*No flames, please.  There is disagreement on how the number of bits is to be
counted.  This is the number of significant bits in a sign-magnitude represent-
ation.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

aglew@mcdurb.Urbana.Gould.COM (04/30/89)

>Herman.
>
>Please describe a language, higher level than forth, that will provide
>all the tools you feel you need in a HLL. I am about convinced that such
>a language is impossible.
>
>Thanks,
>	your partner in monomania,
>		Peter da Silva.
>-- 
>Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

If I remember correctly, POP-2 had some nice mechanisms for accessing 
the underlying machine. As did Algol-68.

Myself, I'm just about happy with GNU CC style assembly function
inlining, and C++ function overloading and typing. Although I haven't
used G++ yet, to see what they would feel like together.

khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)

In article <423@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
> ... cogent argument deleted..
>
>I can hear it now, everyone is jumping up and down saying, "what a fool,
>doesn't he know that all those cycles are free?", "Hasn't he heard of 
>pipelining and register scoreboarding?", "but the CISC instruction are slower
>so the RISC will still run faster." 
> ... and more

>"Compilers can do optimizations", I hear the yelling. This is another
>interesting phenomenon - reduce the complexity in the CPU so that the 
>compiler must do all these other optimizations. I have also now seem any
>indications that a compiler can to anywhere close to an optimal job on
>scheduling code or pipelining. Even discounting the NP-completeness of
>just about everything, theoratical indications point the other way,
>especially when the compiler has to juggle so many conflicting constraints.
>

Cydrome and Multiflow have both demonstrated that it is possible to
move much of the analysis to the compiler (with a increase in compile
times :>). 

The original paper on the Bulldog compiler by ellis (well, its a book
:>) describes how the memory bandwidth problem can be dealt with, and
in many cases quite well.

The Cydra 5 compiler could, in interesting programs (but by no means
all) generate optimal code for key loops (as long as the vector were
long; but this was a hardware constraint, not a compiler issue).

It should be noted that both Cydrome and Multiflow chose to have an
almost fully exposed pipeline, and no scoreboarding or other nastiness.

Memory bandwidth is key to deleivering high performance, but the
RISCiness or CISCiness of the processor (only impacting the
instruction side of things) would seem to be a non-issue.
Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist 
I Voted for Bill &    |   Languages and Performance Tools. 
Opus            (* strange as it may seem, I do more engineering now     *)

khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)

In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:
> ... much good reading deleted..

>Not all code is compiled.  For instance, there is a great body of theory
>and practice in obtaining computational error bounds in computations
>based on interval arithmetic.  Interval arithmetic is
>efficient to implement with the directed rounding modes required by IEEE
>arithmetic, but you can't write the implementation in standard C or
>Fortran.  In integer arithmetic, the double-precise product of 
>two single-precise operands,
>and the single-precise quotient and remainder of a double-precise
>dividend and single-precise divisor, are important in a number of
>applications such as base conversion and random number generation,
>but there is no way to express the required computations in standard
>higher-level languages.
>

I believe that the next rev of Fortran (new socially approved
spelling) will allow us to write this sort of code. I also think that
it can be done in Ada.
Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist 
I Voted for Bill &    |   Languages and Performance Tools. 
Opus            (* strange as it may seem, I do more engineering now     *)

khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)

In article <39049@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <100524@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:
>>In article <38971@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
>>> 
>>> I have a real problem with anything that includes IEEE floating point
>>> AND calls itself a RISC.  IEEE FP violates every rule of RISC; it has
............. deleted

>I agree - I was just quoting the RISC guys.

Which RISC guys ? Note dgh's (and my) corporate affiliation ... :>:>

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist 
I Voted for Bill &    |   Languages and Performance Tools. 
Opus            (* strange as it may seem, I do more engineering now     *)

khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)

In article <100891@sun.Eng.Sun.COM> dgh%dgh@Sun.COM (David Hough) writes:

>
>This event is rare enough that it needn't be as fast as a normal
>multiplication, so it's OK to slow down somewhat by holding the CPU,
>but not so rare that you want to punt to software.  By throwing enough
>hardware at the problem you can make it as fast as the normal case.
>I don't advocate that but that's my understanding of what the Cydra-5 did.  

Well, the details are probably the only things of value for the
Cydrome investors to fence (I mean sell :>) so we won't go into the
details. The Cydra 5 was constructed to that fp mults could be _issued_
EVERY cycle, and took a precise number of instructions to complete.
There was not enough real estate (it was only a 17 board ECL numeric
engine) to do the same for divide.... this was something of a
performance bottleneck (divide may be rare, but when it happens, it
happens often! :>)
Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist 
I Voted for Bill &    |   Languages and Performance Tools. 
Opus            (* strange as it may seem, I do more engineering now     *)

yair@tybalt.caltech.edu (Yair Zadik) (05/01/89)

In article <231@celit.UUCP> dave@celerity.UUCP (Dave Smith) writes:
>	The problem I have with RISC designs are that they use up too much
>memory bandwidth.  What I think would be better was something that gave you
>the flexibility of a RISC (not being tied into the designer's particular
>idea of how a string instruction should be implemented, for example) but 
>with the memory bandwidth efficiency of a CISC.  I'm not familiar enough
>with VLIW to make any good judgements on it, but it seems as though it's
>a reasonable way to go.
>
>David L. Smith
>FPS Computing, San Diego
>ucsd!celerity!dave
>"Repent, Harlequin!," said the TickTock Man

A couple of years ago there was an article in Byte about a proposed design
which they called WISC for Writeable Instruction Set Computer.  The idea
was to do a RISC or microcoded processor which had an on board memory 
containing macros which behaved like normal instructions (I guess it was
on EEPROM like memory).  That way, each compiler could optimize the 
instruction set for its language.  The end result (theoreticly) is that
you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
heard else about it.  Is anyone out there working on such a processor or is
it just a bad idea?

Yair Zadik
yair@tybalt.caltech.edu

stuart@bms-at.UUCP (Stuart Gathman) (05/01/89)

The GCC asm() interface gives an excellent interface to special CISC
instructions.  One can code any arbitrary assembler code with register and
address substitutions for C variables, specify input, output, and scratch
registers, and put it in a macro to disguise it as a function call (or
put it in an inline function).

Portability is maintained by proper design of the function interface, machines
that don't have a similar instruction can use a real function.

Turbo C has an inline assembler capability with similar features.  It is
geared specifically to '86 code, however.  Instead of specifying the register
environment in the asm, the compiler knows which instructions affect which
registers.  Automatic register & address substitution for C variables is
available here also.

With this capability, the only difference between custom inline CISC
instructions and standard operators is syntactic.  Using C++ can help that also.
-- 
Stuart D. Gathman	<stuart@bms-at.uucp>
			<..!{vrdxhq|daitc}!bms-at!stuart>

bullerj@handel.colostate.edu (Jon Buller) (05/03/89)

In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
>A couple of years ago there was an article in Byte about a proposed design
>which they called WISC for Writeable Instruction Set Computer.  The idea
>was to do a RISC or microcoded processor which had an on board memory 
>containing macros which behaved like normal instructions (I guess it was
>on EEPROM like memory).  That way, each compiler could optimize the 
>instruction set for its language.  The end result (theoreticly) is that
>you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
>heard else about it.  Is anyone out there working on such a processor or is
>it just a bad idea?
>
>Yair Zadik
>yair@tybalt.caltech.edu

The only problem with this is that doing a context switch is nearly impossible.
Imagine not only saving registers but having to swap out microcode and
instructions too.  Not to mention that porting a compiler to do this might be
a lot harder.  Portability would be sure to take an incredible hit, one
machines microcode can do x and y in parallel, machine z has hardware to
do operation w...  I think it would be good for a controller, or some compute
server that does one thing, but that is probably better done with a custom chip
or a coprocessor with that particular microcode built in from the start.
Doing something like this would then lead to virtual microcode, which I heard
something about once, but I don't think I'd ever want to see it in use.  I
think about the only use for something like this would be a lab machine to test
out different machine styles (ie. what would happen if  a 68000 had instructions
to... or can we do the same thing without...)  Well, that's my $0.02 worth,
and probably wrong too, I'd like to hear better/other ideas, but finals are in
5 days, and then this account goes away permenantly...


-------------------------------------------------------------------------------
Jon Buller                                      FROM fortune IMPORT quote;
..!ccncsu!handel!bullerj                       FROM lawyers IMPORT disclaimer;

tim@crackle.amd.com (Tim Olson) (05/03/89)

In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
| A couple of years ago there was an article in Byte about a proposed design
| which they called WISC for Writeable Instruction Set Computer.  The idea
| was to do a RISC or microcoded processor which had an on board memory 
| containing macros which behaved like normal instructions (I guess it was
| on EEPROM like memory).  That way, each compiler could optimize the 
| instruction set for its language.  The end result (theoreticly) is that
| you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
| heard else about it.  Is anyone out there working on such a processor or is
| it just a bad idea?

"WISC" is just a new term for how most people build microcoded machines
(SRAMs are faster than EPROMS/ROMS).  I don't see how you can get "the
efficiency of RISC with the memory bandwidth of CISC" using such a
design.

The way CISCs attempt to reduce memory bandwith is to make an
instruction do as much as possible, so fewer are needed to perform an
operation.  This is the antithesis of RISC, which, by using simple
"building-block" instructions, allows the compiler to perform many more
optimizations.

The way to reduce memory bandwith while maintaining performance is to
change the Writeable Control Store into an instruction cache.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

ted@nmsu.edu (Ted Dunning) (05/03/89)

In article <1827@ccncsu.ColoState.EDU> bullerj@handel.colostate.edu (Jon Buller) writes:


   In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
   >A couple of years ago there was an article in Byte about a
   >proposed design >which they called WISC for Writeable Instruction
   >Set Computer.  The idea ...

   The only problem with this is that doing a context switch is nearly
   impossible.  Imagine not only saving registers but having to swap
   out microcode and instructions too.    ...

GOLLY, do you think that maybe we could build some cool hardware that
would keep track and only swap out the parts of the microcode that
were different, or maybe even only swap in the parts that were new,
and then why stop there.  I mean, like, lets swap parts of the user
program and data into and out of this fast control store.  And let's
make the backing store be main memory so it is easier to get to...



isn't this leading right back to a normal risc with a cache that
allows programs to share executable segments?  

khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/04/89)

In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
......
>>"Repent, Harlequin!," said the TickTock Man
>
>A couple of years ago there was an article in Byte about a proposed design
>which they called WISC for Writeable Instruction Set Computer.  The idea
>was to do a RISC or microcoded processor which had an on board memory 
>containing macros which behaved like normal instructions (I guess it was
>on EEPROM like memory).  That way, each compiler could optimize the 
>instruction set for its language.  The end result (theoreticly) is that
>you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
>heard else about it.  Is anyone out there working on such a processor or is
>it just a bad idea?
>

Honeywell and Buro..whoops UNISYS had mainframes like this, well over
a decade ago. Byte, lives at the cutting edge....

Performance is similar to having a RISC with a seperate instruction
cache (of some reasonable size). This is, in fact, often better
because different programs (we do live in a world full of context
switches) probably want different "microcode" ... 

cheers




Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

cquenel@polyslo.CalPoly.EDU (24 more school days) (05/04/89)

In article <10544@cit-vax.Caltech.Edu> (Yair Zadik) writes:
|WISC for Writeable Instruction Set Computer.  The idea ...

In article <1827@ccncsu.ColoState.EDU> (Jon Buller) writes:
|   The only problem with this is that doing a context switch is nearly
|   impossible.  Imagine not only saving registers but having to swap
|   out microcode and instructions too.    ...

In 9690 ted@nmsu.edu (Ted Dunning) sez:
|isn't this leading right back to a normal risc with a cache that
|allows programs to share executable segments?  

	Actually, no.
	The point is that micro-code is much more static over the life
	of a process.  A seperate cache of already-broken-down,
	easy to execute micro-code would be carrying RISC to an
	extreme (simple instructions), but would get around the
	icache/bandwidth problem inherent in conventional RISCs.
	

-- 
@---@  -----------------------------------------------------------------  @---@
\. ./  | Chris (The Lab Rat) Quenelle      cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |  You can keep my things, they've come to take me home -- PG   |   \ / 
==o==  -----------------------------------------------------------------  ==o==

peter@ficc.uu.net (Peter da Silva) (05/04/89)

In article <10544@cit-vax.Caltech.Edu>, yair@tybalt.caltech.edu (Yair Zadik) writes:
> A couple of years ago there was an article in Byte about a proposed design
> which they called WISC for Writeable Instruction Set Computer.
> ...each compiler could optimize the instruction set for its language.

Sounds like a great idea for an embedded controller, but can you imagine what
context switches would be like in a general purpose environment with multiple
supported compilers...?
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

cliff@ficc.uu.net (cliff click) (05/04/89)

In article <10544@cit-vax.Caltech.Edu>, yair@tybalt.caltech.edu (Yair Zadik) writes:
> In article <231@celit.UUCP> dave@celerity.UUCP (Dave Smith) writes:
> >	The problem I have with RISC designs are that they use up too much
> >memory bandwidth.
> A couple of years ago there was an article in Byte about a proposed design
> which they called WISC for Writeable Instruction Set Computer.
> I haven't heard else about it.  Is anyone out there working on such a 
> processor or is it just a bad idea?
A couple of years ago Phil Koopman took his WISC stuff to Harris - I think
their working with it.  He had a 32bit CPU built from off-the-shelf TTL
logic that plugged into an IBM PC and ran at 10Mhz.  It was stack based, 
Harvard archecture and a completly writable micro-code store.  He had some
amazing throughput numbers on it, and had tweaked micro-code for Prolog, C
and some other stuff (Lisp?).  Anyhow Harris is supposed to be putting
together a chip from it.

-- 
Cliff Click, Software Contractor at Large
Business: uunet.uu.net!ficc!cliff, cliff@ficc.uu.net, +1 713 274 5368 (w).
Disclaimer: lost in the vortices of nilspace...       +1 713 568 3460 (h).

henry@utzoo.uucp (Henry Spencer) (05/04/89)

In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
>... WISC for Writeable Instruction Set Computer.  The idea
>was to do a RISC or microcoded processor which had an on board memory 
>containing macros which behaved like normal instructions (I guess it was
>on EEPROM like memory).  That way, each compiler could optimize the 
>instruction set for its language.  The end result (theoreticly) is that
>you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
>heard else about it.  Is anyone out there working on such a processor or is
>it just a bad idea?

Consider a well-built RISC, with an instruction cache, executing an
interpreter that fetches bytes from memory and interprets them as if
they were, say, 8086 instructions.  Assuming that the interpreter fits
in the I-cache, in what way does this differ from the WISC idea?

Context switching between interpreters is trivial, you can write them
in high-level languages, and if you really want *speed*, you can forget
the interpreter and just compile real code.

In short, it's an excellent idea and everyone is already doing it, but
without some of the limitations that result from thinking in terms of
microcode and EEPROM.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

news@ism780c.isc.com (News system) (05/05/89)

In article <10544@cit-vax.Caltech.Edu> yair@tybalt.caltech.edu.UUCP (Yair Zadik) writes:
>A couple of years ago there was an article in Byte about a proposed design
>which they called WISC for Writeable Instruction Set Computer.  The idea
>was to do a RISC or microcoded processor which had an on board memory 
>containing macros which behaved like normal instructions (I guess it was
>on EEPROM like memory).  That way, each compiler could optimize the 
>instruction set for its language.  The end result (theoreticly) is that
>you get the efficiency of RISC with the memory bandwith of CISC.  I haven't
>heard else about it.  Is anyone out there working on such a processor or is
>it just a bad idea?

Yes, it is a bad idea.  In the mid 60's I was at Standard Computer (no longer
in existance) and I actually built such a machine.  It was called the
Standard EX01 (EX01 was for expermintal number 1).  The user could
dynamically alter the instruction set of the machine.  The machine was micro
coded and had a writable control store.  The 'basic' instruction set provided
a mechanism for writing to control storage.

In practice we found that it was impossible to make the thing work because
any modification to the control store could effect th 'basic' instruction
behavior.  As an example, one of the problems we found was that when running
the 'FORTRAN' instructions, double precision floating divide produced the
wrong answer if the instruction was executed at the same time as a tape unit
was reading a file mark.  I decided that there was no way to support a
machine like that in the field, so the experment was terminated.

   Marv Rubinstein

greg@cantuar.UUCP (G. Ewing) (05/09/89)

Yair Zadik (yair@tybalt.caltech.edu.UUCP) writes:
>A couple of years ago there was an article in Byte about a proposed design
>which they called WISC for Writeable Instruction Set Computer.  

Well, maybe the performance improvement would be debatable, but what
the heck - I think it would be fun!

In fact, I'd like to go further and make the processor sort of a
big writeable PAL! Rearrange the hardware according to the task
at hand. A WHISC (Writeable Hardware Interconnection Scheme Computer)?

Greg Ewing, Computer Science Dept, Canterbury Univ., Christchurch, New Zealand
UUCP: 	  ...!{watmath,munnari,mcvax,vuwcomp}!cantuar!greg
Internet: greg@cantuar.uucp		+--------------------------------------
Spearnet: greg@nz.ac.canterbury.cantuar | A citizen of NewZealandCorp, a
Telecom:  +64 3 667 001 x6367  		| wholly-owned subsidiary of Japan Inc.