[comp.arch] Intel/MIPS Dhrystone ratio

wbeebe@rtmvax.UUCP (Bill Beebe) (03/11/89)

In article <1552@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes:

>Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com
>Vicom Systems Inc  2520 Junction Ave  San Jose  CA  (408) 432-8660
>anybody remember the newsletter DTACK Grounded ?

 I remember DTACK, and your column in Programmer's Journal. I also remember
 your comments concering RISC. Do you still feel RISC is the equivalent of
 modern snakeoil? ;-)

bcase@cup.portal.com (Brian bcase Case) (03/12/89)

>> So where did Intel get the extra 22% ?
>Intel has a 64-bit bus, MIPS 32 bits.  It will take two clocks for MIPS to load
>an immediate value.  Intel can do it in one clock.  If 22% of the instruction
>mix involves immediate values, then we know where Intel 'got it.'

Unless I am missing something, Intel cannot "do it in one clock."  The
mechanism used by the i860 is essentially the same as on other RISCs:
load the low 16 bits, then overload (or add) the high 16 bits.   Even in
dual instruction mode, it takes two instructions (and usually, cycles).

>anybody remember the newsletter DTACK Grounded ?
Yes, I do.  This is the newletter that printed the *WORST* (most inaccurate)
assesment of RISC that I have yet read.  However, I think it was a letter
from a reader, and so doesn't necessarily represent the newsletter's position.

hal@vicom.COM (Hal Hardenbergh) (03/14/89)

In article <15690@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>anybody remember the newsletter DTACK Grounded ?
>>Yes, I do.  This is the newletter that printed the *WORST* (most inaccurate)
>>assesment of RISC that I have yet read.  However, I think it was a letter
>>from a reader, and so doesn't necessarily represent the newsletter's position.

That letter came in "over the transom," meaning it was not for attribution.
The writer worked for Big Blue at the time.

Uh, what did you think of Nick Tredennick's assessment of RISC in the Feb issue
of Microcomputer Report?  Was that the _second_ most inaccurate assessment of
RISC that you have yet read?

slackey@bbn.com (Stan Lackey) (03/14/89)

In article <1562@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes:
>Uh, what did you think of Nick Tredennick's assessment of RISC in the Feb issue
>of Microcomputer Report?  Was that the _second_ most inaccurate assessment of
>RISC that you have yet read?

I found his article to be more an accurate assessment of microprocessor
trends in general.  I just KNEW his article would get mentioned here
sooner or later.  A lot of the stuff he said needed to be said
eventually.

RISC is indeed a technology window, driven largely by the amount of
stuff you can fit in a chip.  Look at what is being added now that you
can fit more than a simple CPU core in a chip:

   Floating Point
   29000 null-terminated-string handling instructions
   Choice of endian-ness
   Caches; in fact with extremely complex, multi-mode capabilities
   Fully associative address translation caches
   Harvard architecture
   Multiprocessor capability

The trend in computer evolution is truly toward greater hardware
complexity.  This has been demonstrated countless times.  The
reversion back to too much simplicity did happen in the late 70's, but
here we are again, back on the same curve.

There is a true need for complexity.  How many times when reading this
newsgroup do you see things like, "Yes but that chip doesn't have <my
favorite feature>" where the feature is anywhere from instruction
cache size to multiprocessor cache invalidation (see N-10 bashing)?
Hardly a RISC headset!

Pure RISC religion is to keep things as simple as possible in order to
make the cycle time as fast as possible.  This can only go so far;
real memories must be used, the chip must be interfaces to with real
hardware, etc.  Clearly, the way to get past this brick wall is to do
more in parallel, either with more powerful instructions (including
VLIW/compiler technology), and/or multiprocessing.

Companies must make money.  They will do this by making not tiny
low-cost RISC micros, but the most complex thing they can fit in a
chip.  They need this so they can get product differentiation and thus
better margins.  The million transistors will NOT be used entirely for
large caches, but for more instructions, addressing modes, faster
floating point, elegant exception handling, etc.  And, just watch,
they will still find a way to call them RISC's!

I predict that the next hardware features to come back will be
auto-increment addressing and the hardware handling of unaligned data.

I am not saying that RISC is bad, but it was an interesting exercise
from which we all learned a lot.
:-) Stan

bcase@cup.portal.com (Brian bcase Case) (03/15/89)

>Uh, what did you think of Nick Tredennick's assessment of RISC in the Feb issue
>of Microcomputer Report?  Was that the _second_ most inaccurate assessment of
>RISC that you have yet read? [compared to the one in DTACK Grounded]

Well, I am a contributing editor to The Microprocessor Report.  I dissagree
with nearly everything Nick says when he talks about RISC.  [E.g., arguing
that RISC architects are going to be forced to add complex instructions to
the simple sets because of pressure from compiler writers, OS guys, etc.
is ridiculous to me, but this is one of the things he says.]  However, Nick
is a bright guy and is tremendously entertaining when speaking to a crowd.
If anyone can convince me to rethink my views, it is him.  He is entitled
to his views, and his article was printed as a "Viewpoint" in uP Report.

One thing I have relearned from pointing out the letter in DTACK Grounded:
People in glass houses shouldn't throw stones.  Sigh.

henry@utzoo.uucp (Henry Spencer) (03/17/89)

In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>The trend in computer evolution is truly toward greater hardware
>complexity.  This has been demonstrated countless times.  The
>reversion back to too much simplicity did happen in the late 70's, but
>here we are again, back on the same curve...

Except, this time the complexity added will be *useful* complexity, with
any luck.  No, we are not headed back towards CISC.

>... The million transistors will NOT be used entirely for
>large caches, but for more instructions, addressing modes, faster
>floating point, elegant exception handling, etc...

Faster floating pooint, okay.  More instructions and addressing modes?
*Why?*  They don't gain you anything, unless you start talking about
VLIW or other such significant departures.  Elegant exception handling?
Frankly, the relatively simple exception handling on many of the current
RISCS is much more elegant than all the garbage that showed up on the
CISC machines.

>I predict that the next hardware features to come back will be
>auto-increment addressing and the hardware handling of unaligned data.

Again, why?  Auto-increment addressing is useful only if instructions
are expensive, because it sneaks two instructions into one.  However,
the trend today is just the opposite:  the CPUs are outrunning the
main memory.  Since instructions can be cached fairly effectively,
they are getting cheaper and data is getting more expensive.  Doing
the increment by hand often costs you almost nothing, because it can
be hidden in the delay slot(s) of the memory access.  Autoincrement
showed up best in tight loops, exactly where effective caching can be
expected to largely eliminate memory accesses for instructions.  Why
bother with autoincrement?

As for hardware handling of unaligned data, this is purely a concession
to slovenly programmers.  Those of us who have lived with alignment
restrictions all our professional lives somehow don't find them a problem.
Mips has done this right:  the *compilers* will emit code for unaligned
accesses if you ask them to, which takes care of the bad programs, while
the *machine* requires alignment.  High performance has always required
alignment, even on machines whose hardware hid the alignment rules.
Again, why bother doing it in hardware?
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bcase@cup.portal.com (Brian bcase Case) (03/17/89)

 >RISC is indeed a technology window, driven largely by the amount of
 >stuff you can fit in a chip.  Look at what is being added now that you
 >can fit more than a simple CPU core in a chip:
 >
 >   Floating Point
RISCs can have floating point; floating-point is not CISCy.
 >   29000 null-terminated-string handling instructions
There's only one, and it is *QUITE* simple.  Even if it shouldn't be
there, this is does not support what Nick says.
 >   Choice of endian-ness
This adds about 2 gates to the design.
 >   Caches; in fact with extremely complex, multi-mode capabilities
???  This might have some effect on the instruction set, but the effect
should not be to make the basic instructions go slow.
 >   Fully associative address translation caches
This is architecturally neutral.
 >   Harvard architecture
This is simply required for performance, regardless of the instruction set.
 >   Multiprocessor capability
I don't see how adding multiprocessor capabilities makes a RISC into a
CISC.

None of this stuff is inconsistent with RISC.  These are not CISCy
things.  If you are going to complain, make your complaints valid.

aglew@mcdurb.Urbana.Gould.COM (03/17/89)

>Elegant exception handling?
>Frankly, the relatively simple exception handling on many of the current
>RISCS is much more elegant than all the garbage that showed up on the
>CISC machines.

Bravo!
Who needs vectored interrupts? 
How often does your device know better where to interrupt to than you do?

But (a bit more) seriously:
how can interrupt (not exception) handling be made better/worse?
As an erstwhile systems programmer in a real-time OS, I know that we often
wished that interrupts could be treated exactly like processes,
going through the same priority or deadline driven scheduler.
Yet applying RISC principles to the hardware that would be needed to do
something like this, I often arrive at the conclusion that a 
simple single entry point first level handler is all that is appropriate.
Everything else seems to need sequencing.

tim@crackle.amd.com (Tim Olson) (03/18/89)

In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
| >I predict that the next hardware features to come back will be
| >auto-increment addressing and the hardware handling of unaligned data.
| 
| Again, why?  Auto-increment addressing is useful only if instructions
| are expensive, because it sneaks two instructions into one.  However,
| the trend today is just the opposite:  the CPUs are outrunning the
| main memory.  Since instructions can be cached fairly effectively,
| they are getting cheaper and data is getting more expensive.  Doing
| the increment by hand often costs you almost nothing, because it can
| be hidden in the delay slot(s) of the memory access.  Autoincrement
| showed up best in tight loops, exactly where effective caching can be
| expected to largely eliminate memory accesses for instructions.  Why
| bother with autoincrement?

Also, auto-incrementing addressing modes imply:

	- Another adder (to increment the address register in parallel)

	- Another writeback port to the register file

Unless you wish to sequence the instruction over multiple cycles :-(

I'm certain that most people can find something better to do with these
resources than auto-increment. 


| As for hardware handling of unaligned data, this is purely a concession
| to slovenly programmers.  Those of us who have lived with alignment
| restrictions all our professional lives somehow don't find them a problem.
| Mips has done this right:  the *compilers* will emit code for unaligned
| accesses if you ask them to, which takes care of the bad programs, while
| the *machine* requires alignment.  High performance has always required
| alignment, even on machines whose hardware hid the alignment rules.
| Again, why bother doing it in hardware?

The R2000/R3000 can also trap unaligned accesses and fix them up in a
trap handler.  This is what the Am29000 does, as well.  This is mainly a
backwards compatibility problem (FORTRAN equivalences, etc.) It is
infrequent in newer code, mainly appearing in things like packed data
structures in communication protocols.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/18/89)

In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>Also, auto-incrementing addressing modes imply:
>	- Another adder (to increment the address register in parallel)
>	- Another writeback port to the register file
>Unless you wish to sequence the instruction over multiple cycles :-(
>I'm certain that most people can find something better to do with these
>resources than auto-increment. 

I neither agree nor disagree with this.  But, I think it should be noted
that auto-increment/decrement addressing modes can easily be generated by
compilers and are parallelizable in hardware, and are therefore potential 
performance wins, although in practice it may not work out.  I am sure 
people have simulated these questions to death, and examined the various
possibilities for code sequences.  You can increment on compare and branch
also (e.g. IBM BXLE).  Can you fill the delay slot in a branch if you have
already incremented, etc?  Detailed simulations using a lot of different
kinds of source code are needed to determine questions like this.

Anyway, this is a different situation 
from the alignment problem below, since the performance loss for doing
unaligned data accesses is significant, the hardware designers tell us.
Anyway, it is a separate performance hit from the usual RISC/CISC issues.
>| As for hardware handling of unaligned data, this is purely a concession

**************

The reason that the VAX (and a few other) architectures are hard to
pipeline is that the operand specifiers require a separate decode,
and that a variable number of operands may come from memory,
not because the machine has autoincrement/decrement addressing modes.

But, really the issue is not "complexity" (usually in the eye of the
beholder anyway) but ease of pipelining (a lot easier to measure).
The VAX (always the straw man in any RISC debate) achieves its design goals of:
" 1) all instructions should have the 'natural' number of operands and
  2) all operands should have the same generality in specification. "
(see Strecker's paper in Sieworek, Bell, and Newell).  
It just so happened that these design goals, which produce a small number of 
very compact instructions (and thus overcome the problem of "most architectures"
as stated in the paper) for a given piece of source code, were the wrong
goals to pursue if another goal is PERFORMANCE.  OK, so they bet wrong on
the VAX...they bet that instruction compactness was very important.  Almost
immediately, they began to be proved wrong.  On the other hand, ten years 
years, and billions of dollars of sales went by before the noise got to be
too loud, so, ...

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

csimmons@oracle.com (Charles Simmons) (03/18/89)

In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>>The trend in computer evolution is truly toward greater hardware
>>complexity.  This has been demonstrated countless times.  The
>>reversion back to too much simplicity did happen in the late 70's, but
>>here we are again, back on the same curve...
>
>Except, this time the complexity added will be *useful* complexity, with
>any luck.  No, we are not headed back towards CISC.

I'd have to agree with Henry completely.  One possible alternative
way of looking at this is that CISC technology, especially the VAX,
was developed during a period of time when a lot of programs were
written in assembler.  If you look carefully at the MIPS architechture
and the output of its C compiler, you'll soon discover that on a MIPS
machine, there is absolutely no reason to program in assembler.

Since one of the fundamental tenets of RISC design is that all programs
will be written in a high-level language, we aren't going to see
instructions sets that are real complicated for the simple fact that
the compilers can't deal with the complexity.

-- Chuck

rpw3@amdcad.AMD.COM (Rob Warnock) (03/21/89)

In article <28200290@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:
+---------------
| Bravo! Who needs vectored interrupts? 
| How often does your device know better where to interrupt to than you do?
+---------------

When I first began designing with the Am29000, at first all my old habits
felt cramped at "only" 4 levels of external interrupt, which don't even
read a vector from the interrupting device. But I quickly realized that since
the 29k has a "count-leading-zeroes" (CLZ) instruction, all you need is a
magic external location you can read (can you spell 74F374?) which gives you
one bit per interrupting device, and an inclusive-OR to your single interrupt
line. (Who needs 4 of them, anyway?) So you load the bits, CLZ, add a table
base, and jump...

Given slow 8-bit I/O chips, that takes a lot less time than a vector fetch.

+---------------
| But...  how can interrupt (not exception) handling be made better/worse?
| As an erstwhile systems programmer in a real-time OS, I know that we often
| wished that interrupts could be treated exactly like processes,
| going through the same priority or deadline driven scheduler.
| Yet applying RISC principles to the hardware that would be needed to do
| something like this, I often arrive at the conclusion that a 
| simple single entry point first level handler is all that is appropriate.
| Everything else seems to need sequencing.
+---------------

I agree.

[Tutorial alert. Many of you know this already. But it's worth saying once
or twice a decade, and I haven't heard it lately, so here goes...]

As has been done by many of us on a variety of machines, a useful interrupt
software "style" (good on many CISCs as well as RISCs) seems to be to split
interrupt handlers into a "first-level"/hardware-oriented/assembly-language
section, and a "second-level"/software-oriented/C-language part, with the
following characteristics:

- You leave the "real" hardware interrupts always enabled (especially during
  2nd-level handlers, system calls, etc.).

- When an interrupt occurs, all you do is clear the interrupting hardware,
  grab whatever really volatile data there might be, and queue up the
  2nd-level handler to run -- if it's really needed ("soft"-DMA can often
  just stash the data in a buffer and dismiss). If there's already a 2nd-level
  handler running at the same or higher *2nd-level* priority [see below],
  you just queue up a task block, and IRET. The trick is that the *hardware*
  interrupt is disabled only for the brief moment when a 1st-level handler
  is running.
  
- The Unix "spl??()" [Set Priority Level] routines are modified to manipulate
  a *software* notion of priority, which is respected by the 2nd-level routines
  and system-call level code (but not the hardware), and never turn off the
  hardware enables.  Necessary exclusion with 1st-level handlers is done with
  *very* short interrupt disable periods, or none at all. (Treating the 1-st
  level handlers like "DMA devices", you can usually find a way to eliminate
  the IOFFs).

- The interface between 1st- & 2nd-level sections is a little "task queue",
  sort of a light-weight "real-time scheduler". You can have a one, or any
  number of interrupt task queues, not necessarily related at all to whatever
  hardware priorities you are stuck with.

- Once you start running a 2nd-level routine, you continue taking tasks off
  the 2nd-level queue(s) until they are empty, before restoring the CPU state
  and dismissing. (Since hardware*interrupts are still on, it is quite
  possible that more than one 2nd-level routine gets run per CPU state save.)

- If you *can* get by with just one 2nd-level priority, do so. It avoids
  the extra state saving that comes with preempting multi-level priorities.
  (I know, sometimes you can't avoid it. But sometimes you can. On one
  system we just used the Unix "callout" queue, just setting a zero delay
  time if the task was for an interrupt.)

The advantages of this style are these:

1. Since hardware interrupts are never turned off for long, input data
   overruns are easy to avoid. (...unlike some Unixes which turn off the
   world whenever they are searching the buffer cache!!! No wonder so many
   people think Unix can't do 19200 baud input. At the same time, you save
   a some hardware cost, since the need for real DMA hardware is lessened.)

2. The 1st-level tasks can usually be done in a few assembly instructions
   without saving very much CPU state; the 2nd-level tasks need a full
   C context, reentrant and "interruptable" -- a lot more state. Since
   interrupts are often "bursty", the two-level structure saves state
   *once* for several interrupts, a significant efficiency gain. In fact,
   interrupt handling gets more efficient the higher the interrupt rate.

3. Most interrupts from "character" devices can be handled entirely in
   the 1st-level handlers as "soft-DMA", or "pseudo-DMA", thus lessening
   further the number of full CPU state saves done.

4. Since hardware and software priorities now have nothing to do with
   each other, you can allocate priorities more rationally. For example,
   you may have a multi-line serial card which has one interrupt level
   for all the transmitters and receivers on the card; also in the system
   is a disk. In this case, the 1-st level serial-I/O handler will probably
   want to queue input (received) data to be processed at a *higher* 2nd-level
   priority than the disk, but queue output (transmit done) interrupts at a
   *lower* priority than the disk.

Applying the above to a Version 7 Unix port to a 5.5 MHz 68000 (years ago),
we were able to take a system which could hardly do a single 2400-baud UUCP
and get it to cheerfully handle three simultaneous 9600-baud UUCPs! ...and
with no change to the hardware: interrupt-per-character SIO chips.


[Note: When the 29000 takes an interrupt, volatile state (PC, PS) is "frozen"
in backup or shadow registers in the CPU, and execution continues (with some
slight restrictions). An "IRET" restores the running process's state from the
shadow registers. Instructions exist to read/write the shadow registers if a
full save/restore is to be done.

The very-light-weight "freeze mode" interrupt matches very nicely with the
above interrupt software style. You dedicate a few protected global registers
to freeze-mode processing, and *no* state has to be explicitly saved/restored
unless a 2nd-level handler needs to be started in a full "C" context.]


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

w-colinp@microsoft.UUCP (Colin Plumb) (03/21/89)

tim@amd.com (Tim Olson) wrote:
> Also, auto-incrementing addressing modes imply:
> 
> 	- Another adder (to increment the address register in parallel)
> 	- Another writeback port to the register file

Another adder?  Most RISC chips use base+offset addressing; all you need is
the ability to send the result back to the base register as well as to
the address bus.  This is almost always possible for stores, and may be
possible for loads, since the result of the load generally comes in
significantly later then the address goes out.  (The 29000 uses this cycle
to store back the result of the previous load, which had been waiting in a 
scoreboard register, but other schemes may do something else.)

In my dream chip, I added postincrement by latching the address from the ALU
input bus, and was happy.
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

tim@crackle.amd.com (Tim Olson) (03/22/89)

In article <12@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
| tim@amd.com (Tim Olson) wrote:
| > Also, auto-incrementing addressing modes imply:
| > 
| > 	- Another adder (to increment the address register in parallel)
| > 	- Another writeback port to the register file
| 
| Another adder?  Most RISC chips use base+offset addressing; all you need is
| the ability to send the result back to the base register as well as to
| the address bus.

I must have had my architectural blinders on that day -- others have
pointed this out to me as well.

I was thinking about the other adder requirement because the offset is
typically used to supply a constant offset from the current "frame
pointer" [be it an actual register or an adjustment from the stack
pointer] for local array accesses.  However, this need not be the case
-- it can be folded in to the base register at the top of the loop (like
the Am29000 does) and the offset field can be used as the increment
specifier.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

dam@mtgzz.att.com (d.a.morano) (03/23/89)

In article <24929@amdcad.AMD.COM>, rpw3@amdcad.AMD.COM (Rob Warnock) writes:
> 
> [Tutorial alert. Many of you know this already. But it's worth saying once
> or twice a decade, and I haven't heard it lately, so here goes...]
> 
> As has been done by many of us on a variety of machines, a useful interrupt
> software "style" (good on many CISCs as well as RISCs) seems to be to split
> interrupt handlers into a "first-level"/hardware-oriented/assembly-language
> section, and a "second-level"/software-oriented/C-language part, with the
> following characteristics:
> 
>	[many characteristics of the above style deleted]

As you probably know, you have described in essence exactly what DEC did
for their RSX-11M and VMS operating systems 15 or so years ago.  DEC calls 
their second level handlers 'fork processes'.  These second level fork 
processes could execute partially in true hardware interrupt time and 
partially in scheduled light weight process time after the hardware 
interrupt has been dismissed.  The amount of time spent in either mode is
programmed in the fork process by using dispatching and light weight
scheduling primitives.  This style, as you have called it, does have the
benefits that you have sited.  This approach to interrupt handling 
better positions these OSes for hardware oriented time critical applications.
Of course, DEC would code both portions of their handlers in assembly.

Dave Morano	AT&T Bell Laboratories