[comp.arch] Wirth's challenge

baum@apple.UUCP (Allen J. Baum) (12/04/87)

--------
>However, you can buy 2 machines that are clearly 801-descendents:
>HP Precision and MIPS R2000, which, as far as I can tell, are the
>closest ones on the market to the 801.  Whenever it comes out, the
>78000 has a lot of similarities also.

The only way that the HP Precision architecture can be considered is in spirit.
Although there were former IBM'ers on the project, they weren't allowed to talk
about it at all, and they didn't. To this day, I haven't talked with anyone
who would tell me any details on the 801 architecture. Some details did come
out in the papers at the ASPLOS conference, but they did not influence any
of the design decisions on the Precision.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mash@mips.UUCP (John Mashey) (12/04/87)

In article <6892@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>--------
>>However, you can buy 2 machines that are clearly 801-descendents:
>>HP Precision and MIPS R2000, which, as far as I can tell, are the
>>closest ones on the market to the 801.  Whenever it comes out, the
>>78000 has a lot of similarities also.

>The only way that the HP Precision architecture can be considered is in spirit.
>Although there were former IBM'ers on the project, they weren't allowed to talk
>about it at all, and they didn't. To this day, I haven't talked with anyone
>who would tell me any details on the 801 architecture. Some details did come
>out in the papers at the ASPLOS conference, but they did not influence any
>of the design decisions on the Precision.

Sorry, I meant absolutely no hint that proprietary info got moved,
and I did mean in spirit, especially of methodology of starting with
serious optimizing compiler technology and doing substantial analysis.
It is extremely interesting that there was no influence from the
March 1982 Radin paper at ASPLOS; I would have thought that it would have
been analyzed thoroughly at least for confirmation of direction,
but I wasn't there, so I believe you.

Certainly, it is interesting that both 801 and Spectrum used
optimizing compiler-driven design,
32 registers,
32-bit instructions,
separate I&D caches,
and no windows.
(MIPS can't be included as independent: we certainly had access to the
published 801 documents before we started.)

Either the similarities arise from the limited choices once you've
made some of those decisions, or, starting from some of the same assumptions,
and proceeding with realted methodologies, you get to somewhat similar
designs.  [There are of course all sorts of little differences amongst
the 3 machines mentioned, but if you take the universe of RISC machines,
they look more similar than most.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

baum@apple.UUCP (12/04/87)

--------
[]
>In article <1047@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:

>Sorry, I meant absolutely no hint that proprietary info got moved,
>and I did mean in spirit, especially of methodology of starting with
>serious optimizing compiler technology and doing substantial analysis.
>It is extremely interesting that there was no influence from the
>March 1982 Radin paper at ASPLOS; I would have thought that it would have
>been analyzed thoroughly at least for confirmation of direction,
>but I wasn't there, so I believe you.
>
>Certainly, it is interesting that both 801 and Spectrum used
>optimizing compiler-driven design,
>32 registers,
>32-bit instructions,
>separate I&D caches,
>and no windows.
>(MIPS can't be included as independent: we certainly had access to the
>published 801 documents before we started.)
>
>Either the similarities arise from the limited choices once you've
>made some of those decisions, or, starting from some of the same assumptions,
>and proceeding with realted methodologies, you get to somewhat similar
>designs.  [There are of course all sorts of little differences amongst
>the 3 machines mentioned, but if you take the universe of RISC machines,
>they look more similar than most.]

I think you hit it right on the head. We did indeed use the ASPLOS papers for
confirmation of our intuition & measurements, but they only confirmed, so 
nothing got changed. We did see some differences, but we liked our approach
(for better or worse), so no differences there either. We did have a chance
to consider register windows, since the RISC I papers were published by then,
but decided that for future implementations, the large register file, and its
decoding would be a bottleneck (impact critical path), and that smart register
allocation would be sufficient.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

baum@apple.UUCP (12/04/87)

--------
[]
>In article <1047@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>It is extremely interesting that there was no influence from the
>March 1982 Radin paper at ASPLOS; I would have thought that it would have
>been analyzed thoroughly at least for confirmation of direction,

A small addendum: while the 801 may not have played an enormous role,
System/370 certainly did! Many of the design decisions in Spectrum were in
reaction to problems we saw (& measured & were told about) in the 370 
architecture, from both IBM and Amdahl.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

aglew@ccvaxa.UUCP (12/12/87)

..> IBM 360, BCD, and COBOL support

I wouldn't go so far as putting packed decimal into a
modern machine, but unpacked decimal (ascii) might be
another thing... except that it can be composed almost
as well out of masks and binary arithmetic.

As for COBOL support, well... I think we are about to
pass the point where a scientific computer will do better
at COBOL support than a business computer. Because, what's
a business computer? ...Well, it has BCD - see above.
It has good I/O - but scientific computers increasingly
have good I/O, since they do graphics. It handles strings well
- but most strings are short, or fixed length. And you can
move a lot of characters through a 64 bit register, and do 
a lot of string operations 8 characters at a time, instead of 
one by one.

Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
aglew@mycroft.gould.com    ihnp4!uiucdcs!ccvaxa!aglew    aglew@gswd-vms.arpa
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) (12/17/87)

In article <28200075@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>As for COBOL support, well... I think we are about to
>pass the point where a scientific computer will do better
>at COBOL support than a business computer.

"About to pass the point"?  The places that used to run service
bureaus with CDC 6600's knew this years ago.  Once you get over
the fixation that your problem is so different that the hardware
designer has to tailor an instruction just for you, you realize
that what you want is something that does some basic functions
fast, and let the programmer construct the other stuff.  The
design problem is to choose the right basic operations. 

The DG Nova, PDP-8, and the CDC 6600 all gave very impressive
performances on commercial applications even though many people
claimed they were not "designed" for this type of application.

Basically this should be the "RISC" argument, but that term
seems to have been co-opted for a very narrow range of hardware
design strategies.  In particular I have seen statements that
RISC must be
  - 1 microcode cycle per clock cycle.
       Why assume a synchronous (clocked) hardware implementation?
       There have been a number successful machines with
       asychrounous CPU's
  - 1 microinstruction per instruction
       Why assume microcode at all?  Microcode is certainly a
       valuable hardware design technique, but again it is
       not manditory
  - register windows with "general purpose" registers.
       A neat idea, but again, special purpose register architectures
       have done impressive things in the past.  I've often wondered
       if most of the performance gains quoted for the "RISC"
       machines can be attributed to the fact that someone decided
       the best thing to do with the registers was to use them
       for passing the first few arguments, and that similar
       gains can be made on other machines by making the compiler
       pass the first few arguments in registers, instead of expecting
       that the callee preserves his registers.
I'm not saying any of these are bad ideas.  They clearly aren't.
It just seems that a lot of discussion is going on with assumptions
that all computers are implemented with a particular methodology,
or must have a certain architectural feature.

Those pushing the simple and fast approach must also be aware
of why machines acquire the specialized fancy instructions,
such as packed decimal.  Given that one has an existing implementation
of an architecture, there will always be some commercially
important applications that the machine is "poor" at.  (as defined
by some customer with money).  The engineer can go back to the
drawing board and re-engineer the whole machine to make it faster,
but often adding some opcodes and some hardware to do the
job.  (I am using the term extra hardware loosely, this could
mean some more gates on the cpu chip).  Sometimes no
extra hardware is needed, just an addition to the microcode
(if microcode is used in the implementation).  Of course when
the next total re-engineering does occur, tradeoffs will be made,
and because the new datapath layout will mean the some of the
instructions can't be implemented with the previous techique.
So they may be done with a long slow microcode sequence,
and it may well be that on the new machine the sequence that was
used before the new feature was added is faster at doing the job
wanted by the application than using the feature.  The reason the
feature was added was valid, it made the machine significantly
faster.  The reason for maintaining it is valid, it preserves
object code compatibility.

esf00@amdahl.amdahl.com (Elliott S. Frank) (12/19/87)

In article <12181@orchid.waterloo.edu> atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:
>
>                                              Once you get over
>the fixation that your problem is so different that the hardware
>designer has to tailor an instruction just for you, you realize
>that what you want is something that does some basic functions
>fast, and let the programmer construct the other stuff.  The
>design problem is to choose the right basic operations. 
>

Amen.

The Amdahl 580 (a 370-compatible [CISC] machine designed ca. 1978-79)
was designed with contemporary 'UNIX machine' features -- separate
I and D caches, etc. It turned out it ran 360/370 COBOL programs like
the proverbial 'bat out of hell'.
-- 
Elliott Frank      ...!{hplabs,ames,sun}!amdahl!esf00     (408) 746-6384
               or ....!{bnrmtv,drivax,hoptoad}!amdahl!esf00

[the above opinions are strictly mine, if anyone's.]
[the above signature may or may not be repeated, depending upon some
inscrutable property of the mailer-of-the-week.]

pds@quintus.UUCP (Peter Schachte) (12/23/87)

In article <12181@orchid.waterloo.edu>, atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:
> ....  Once you get over
> the fixation that your problem is so different that the hardware
> designer has to tailor an instruction just for you, you realize
> that what you want is something that does some basic functions
> fast, and let the programmer construct the other stuff.  The
> design problem is to choose the right basic operations. 

That's the issue, alright.  For symbolic languages, tagged pointer
operations are very important.  Typically, tagged dispatch and tagged
pointer following are done quite a lot, and cutting the number of
machine instructions to do these things can make quite a difference in
performance.  Take the example of following a tagged pointer.  If the
tag is kept in the high few bits of a 32 bit address, one must and the
pointer with a 32 bit constant.  Quite a lot of overhead, when a simple
addressing mode that ignored the high, say, 4 bits of the address would
do the trick perfectly.  I know this runs contrary to the RISC ideal.
But on a CISC, this is no more arcane than some of the other addressing
modes.  

Similarly for tagged dispatch.  If there were an instruction to take
the top, say, 2 bits of a register, shift them right a whole bunch of
places, add them to a given address, and jump to the address stored
there.  Sure, this could be done with a shift or rotate, an and, an add,
and an indirect jump.  But wouldn't you rather do one instruction than
4?

These are some of the operations that would make Lisp and Prolog run
faster.  I'm sure each language, and each class of languages, has it's
own favorite chip features.  The important questions are:  how much
will a given feature speed up a given task?  How much will it cost (in
terms of $, speed of other operations, etc.)?  And how important is
that task?  I imagine BCD probably wasn't worth it.  Perhaps the
features I've just asked for aren't worth it either.  Maybe it would be
better, on average, to have a scaled, post-incremented, memory indirect
addressing mode (0.5 :-), 0.5 :-().  Or a 48 bit one's complement
multiply instruction, or whatever.

The point machine designers should take into account is that more and
more, people are buying general-purpose hardware rather than the more
expensive specialized hardware.  Therefore, they should design their
machines taking symbolic languages, CAD, and other specialized tasks
into account.

-- 
-Peter Schachte
pds@quintus.uucp
...!sun!quintus!pds

aglew@ccvaxa.UUCP (12/25/87)

..> Peter Schachte ...!sun!quintus!pds, and others, responding and amplifying
..> my statement that scientific processors (general purpose processors)
..> can do special purpose work as fast as other processors.

Tagged Operations:

    As Peter points out, a natural way to support tags is to place them
in the high order bits, and have an architecture that ignores, say, the
top 4 bits.
    I work on such an architecture, and we have a common LISP that takes
advantage of it. Except that it was a pain to port this LISP to a new machine
that ignored fewer of the top order bits (is that correct Brian, Scott?)

    Of course, another way is to put the tags in the low order bits, since
tagged systems usually don't need object addresses of byte granularity
- the smallest object is usually a word or two.

    Lately, I have been thinking that the best thing to do is to implicitly
AND mask all addresses with a loadable ADDRESS_MASK value. The AND masking can
be done by dedicated gates away from the ALU, and so shouldn't stretch your
critical path (although it is close to the critical memory address generation,
I think that it would fall in the slack at the end of one pipeline stage).
    The biggest advantage of of ADDRESS_MASK would be that it would let you
support applications with different ideas of the shape of the address on 
the same machine; and it would provide a way for you to increase the address
space while still letting old, broken programs that rely on overflow
of a 32 bit quantity to work. Ie. programs that rely on addresses being
32 bits would have an address mask 0x0FFFFFFFF; programs that rely on 
addresses being 40 bits would be 0x0FFFFFFFFFF; and so on. There are quite
a few programs that rely on 24 bit and 16 bit addresses, even now. Hardware
would, of course, limit the values that can be loaded into the ADDRESS_MASK
- 32 bits now, but tomorrow 40 bits, then 48 bits, and so on.
    Using ADDRESS_MASK for tags is obvious: 
If you are using 2 bits of high order tags on a 32 bit machine, set your mask
to 0x3FFFFFFF; if you are using 2 bits of low order tags, and you want to 
avoid a misaligned address trap, set mask to 0xFFFFFFFA.

    This is not unlike the LOAD-TAGGED instructions in SPARC and SPUR and SOAR;
the main difference is that the architecture does not force any decisions,
as to the size of the tag, etc., onto the implementor of the tagged 
language system.



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
aglew@mycroft.gould.com    ihnp4!uiucdcs!ccvaxa!aglew    aglew@gswd-vms.arpa
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

hank@spook.UUCP (Hank Cohen) (01/05/88)

In article <19825@amdahl.amdahl.com> esf00@amdahl.amdahl.com (Elliott S. Frank) writes:
>
>The Amdahl 580 (a 370-compatible [CISC] machine designed ca. 1978-79)
>was designed with contemporary 'UNIX machine' features -- separate
>I and D caches, etc. It turned out it ran 360/370 COBOL programs like
>the proverbial 'bat out of hell'.
>-- 
I always found the most interesting feature of the Amdahl 580 to be
the single cycle decimal adder in the ALU.  It was (when I last knew
the details of such things) <24nS. I believe that the CPU designers
were willing to use a lot of gates to achieve that speed.

Amdahl does a lot of simulation and analysis of their instruction
mix and is willing to design the machine to optimize certain
benchmarks.  When you run a lot of COBOL and PL/I a fast decimal adder
makes sense.  Perhaps Elliot could tell us how fast the 5890 decimal
adder is.