[comp.arch] What should be in hardware but isn't

baum@apple.UUCP (Allen J. Baum) (01/01/70)

--------
[]

A while back, somebody wrote:
>BTW, there is an address modification procedure which is missing on all
>machines I have seen except the UNIVAC's.  That is to consider the register
>file as a memory block and allow indexing on it...

Somebody else wrote:
>The PDP-10 also did this.  The first 16 memory locations were the registers.
>There was an option to get fast (non-core) memory for these few bits.

And >Brian Utterback replied:
>Another advantage the PDP-10 had by mapping the registers to the memory space,
>other than indexing, was in execution.  You could load a short loop into the
>registers and jump to them!  The loop would run much faster, executing out 
>of the registers.  

This is not the case exactly. EMACS did, in fact, use this trick to
significantly speed up such things as 'search', but when the KL-10 &
-20 processors came out, this trick ran SLOWER than running code out
of regular RAM.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

cik@l.cc.purdue.edu (Herman Rubin) (09/21/87)

There are many instructions which are easy to implement in hardware, but
for which software implementation may even be so costly that a procedure
using the instruction may be worthless.  Some of these instructions have
been implemented in the past and have died because the ill-designed
languages do not even recognize their existence.  Others have not been
included due to the non-recognition of them by the so-called experts and
by the stupid attitude that something should not be implemented unless
99.99% of the users of the machine should be able to want the instruction
_now_.  As you can tell from this article, I consider the present CISC
computers to be RISCy.

One situation of this type which has been discussed in this newsgroup is
the proper treatment of quotient and remainder for integer division when
the numbers are not both positive.  Everyone took a stand for some specific-
ation.  I say "let the user decide."  Even if both signs are positive, 
which alternative I want for one problem may not be the one I want for
another problem.  Having 2-4 bits to specify the alternative for each
sign combination should take very little run time and little space.

Since floating point machines first came out, the much needed instruction
to divide one floating point number by another with an integer quotient
and a floating remainder has not, to my knowledge, appeared.  If you need
to see uses of this, look at any good trigonometric or exponential subroutine.

With the advent of floating point, fixed point operations seem to be 
vanishing.  On the early floating point machines, frequently numerical
functions would be done in fixed point for speed and accuracy.  The need
for this has not changed, but the availability has.  Also, it should be
possible to convert between fixed and floating point without the overhead
of a multiply; this was possible on the UNIVAC 1108 and 1110.  Another
operation is to multiply a floating point number by a power of 2 by 
adding to the exponent; this was on the CDC 3600.  The need for this as
a separate instruction is because of the possibility of overflow and/or
underflow.

I have run into situations in non-uniform random number generation for which
considerable time is needed to carry out tests which would be better handled
as exceptions.  One of these is to decrement an index, use the result for a
read or write instruction if non-negative, and interrupt if negative to a 
user-provided exception handler.  Another is to find the distance to the next
one in a bit stream, with an interrupt if the stream is emptied.  There are
procedures which are extremely efficient computationally, but for which the
overhead is large if this is not hardware; if a higher level language has to
be used for the instruction, I would make the cost prohibitive.  The VAXen
have in hardware (at least for some machines) a FFO instruction, but it 
requires three other operations, one of which is a conditional, to get one
result.

On many machines, even if fixed point arithmetic is in the hardware, multipli-
cation and division cannot be unsigned.  All of the multiple precision software
with which I am familiar is sign-magnitude.  An additional hardware bit to say
if signed or unsigned is to be used would be cheap.  (It is extremely difficult
to program multiple precision arithmetic in floating point.  It is difficult
on machines, of which there are many, which do not have reasonable integer
multiplication.)

I make no pretense that this list is complete.  While I might find it useful,
I would not suggest that transcendental functions (except for the CORDIC
routines) be hardware, as they would be merely encoding a software algorithm
using the existing instructions as a hardware, rather than software, series
of instructions.  What I am suggesting is that instructions manipulating the
bits in different ways, or using easy branching at nanocode time instead of
slow branching when the hardware cannot use the non-restrictive nature of the
branch, should be.  The cost of the CPU is usually a small part of the cost
of the computer.




-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

haynes@ucscc.UCSC.EDU.ucsc.edu (99700000) (09/22/87)

In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>...
>With the advent of floating point, fixed point operations seem to be 
>vanishing.  On the early floating point machines, frequently numerical
>functions would be done in fixed point for speed and accuracy.  The need
>for this has not changed, but the availability has.  Also, it should be
>possible to convert between fixed and floating point without the overhead
>of a multiply; this was possible on the UNIVAC 1108 and 1110.
Burroughs (pre-Unisys) handles this by making all numbers floating point.
Integers simply have a zero exponent.  The normalizing algorithm tries to
keep the exponent zero rather than invariably normalizing.  So fixed-to-float
takes no time; float-to-fixed may take time.
haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes

tim@amdcad.AMD.COM (Tim Olson) (09/22/87)

In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
+----- 
| There are many instructions which are easy to implement in hardware, but
| for which software implementation may even be so costly that a procedure
| using the instruction may be worthless.  Some of these instructions have
| been implemented in the past and have died because the ill-designed
| languages do not even recognize their existence.  Others have not been
| included due to the non-recognition of them by the so-called experts and
| by the stupid attitude that something should not be implemented unless
| 99.99% of the users of the machine should be able to want the instruction
| _now_.  As you can tell from this article, I consider the present CISC
| computers to be RISCy.
+-----
From the following examples, it sure appears as if you are arguing for
"letting the user decide" how certain functions are implemented.  The
easiest (and probably best) way to do this is to provide a fast, fixed
set of primitive operations, and let users build what they need from
that set (i.e. RISC).

+-----
| One situation of this type which has been discussed in this newsgroup is
| the proper treatment of quotient and remainder for integer division when
| the numbers are not both positive.  Everyone took a stand for some specific-
| ation.  I say "let the user decide."  Even if both signs are positive, 
| which alternative I want for one problem may not be the one I want for
| another problem.  Having 2-4 bits to specify the alternative for each
| sign combination should take very little run time and little space.
+-----
With the correct primatives, you can easily code these as procedures
which will run *as fast* as standard div, mod, rem.


+-----
| I have run into situations in non-uniform random number generation for which
| considerable time is needed to carry out tests which would be better handled
| as exceptions.  One of these is to decrement an index, use the result for a
| read or write instruction if non-negative, and interrupt if negative to a 
| user-provided exception handler.  
+-----
Fast (free) detection of over/underflow conditions is important,
especially to efficiently implement languages with runtime
bounds-checking and exception handling.  This is why the Am29000 (and
other RISC processors) have, in addition to the standard add and sub
instructions, adds (add signed) and addu (add unsigned) which trap on
overflow/underflow conditions.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

henry@utzoo.UUCP (Henry Spencer) (09/23/87)

> One situation of this type which has been discussed in this newsgroup is
> the proper treatment of quotient and remainder for integer division when
> the numbers are not both positive.  Everyone took a stand for some specific-
> ation.  I say "let the user decide."...

Of course, on most of the RISC machines the user *does* have the choice,
since division is generally done in software rather than hardware.

> Since floating point machines first came out, the much needed instruction
> to divide one floating point number by another with an integer quotient
> and a floating remainder has not, to my knowledge, appeared...

Although you don't get it bundled into one instruction, the pieces needed
to do this are present in any IEEE floating-point implementation, e.g. the
68881.  The remainder can be had with one instruction (on the 68881, FMOD
or FREM depending on exactly what you're doing), the quotient would take
two I think (just a divide and a convert-to-integer).

> ... Another
> operation is to multiply a floating point number by a power of 2 by 
> adding to the exponent; this was on the CDC 3600...

FSCALE on the 68881.

> ... Another is to find the distance to the next
> one in a bit stream, with an interrupt if the stream is emptied...

On most modern machines it should be possible to write a loop that will do
this at very nearly full memory bandwidth, looking at a byte or a word at
a time and using table lookup for the final bit-picking.  I am constantly
amused by people who scream for bit-flipping instructions when doing it a
byte or a word at a time, using table lookup for non-trivial functions, is
still faster.  "Work smart, not hard".

> On many machines, even if fixed point arithmetic is in the hardware, multipli-
> cation and division cannot be unsigned...

Again, on the RISCs you generally get your choice, because multiply is done
in tuned software rather than hardware.  (And it's usually faster than a
CISC multiply, since most multiplies are by small integer constants that a
RISC can generate custom code for.)

> I would not suggest that transcendental functions (except for the CORDIC
> routines) be hardware, as they would be merely encoding a software algorithm
> using the existing instructions as a hardware, rather than software, series
> of instructions...

Actually, there is one fairly good argument for putting the transcendentals
in hardware, to wit making a high-quality implementation available cheaply.
The transcendentals in (say) the 68881 are *better* than anything you will
come up with in software without large amounts of work.  You can buy a 68881
for far less than it would cost you to commission or license equivalent code.

> What I am suggesting is that instructions manipulating the
> bits in different ways, or using easy branching at nanocode time instead of
> slow branching when the hardware cannot use the non-restrictive nature of the
> branch, should be...

Note that many RISCs are directly quite specifically at this objective:
giving the programmer (or, more usually, compiler writer) detailed control
of the hardware, rather than putting a half-baked interpretive layer in
between.  To misquote the famous adage, "microcode stands between the user
and the hardware".
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

cik@l.cc.purdue.edu.UUCP (09/23/87)

In article <18336@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> +-----
> From the following examples, it sure appears as if you are arguing for
> "letting the user decide" how certain functions are implemented.  The
> easiest (and probably best) way to do this is to provide a fast, fixed
> set of primitive operations, and let users build what they need from
> that set (i.e. RISC).
> +-----
> +-----
> With the correct primatives, you can easily code these as procedures
> which will run *as fast* as standard div, mod, rem.
> +-----
> 	-- Tim Olson
> 	Advanced Micro Devices
> 	(tim@amdcad.amd.com)

Olson greatly underestimates the number of RISC instructions needed to do
even a fair job.  If the user is going to be able to do the things needed
efficiently, the combining of instructions must be done at the "nanocode"
level.  Frankly, I think that having thousands of instructions, arranged 
so that decoding patterns can be used, is much easier. 

One of the reasons for the problem is that such things as arranging which
way the quotient and remainder are formed depending on the signs of the
arguments is extremely clumsy in software unless at least the adjustment
procedure is hardware.  I cannot think of any reasonable method even in
"microcode" to do the trivial operations to achieve this for the four 
combinations of signs, especially if the choices change at run time. 

I have read on this net of hardware which enables the user to specify
_sometimes_ that a particular branch should be assumed, with an exception
otherwise; I have not seen such.  I have not seen any remotely efficient
bit-handling hardware on any machine.  (BTW, I am interested in seeing
what modifications I must make to my procedures base on the architecture
of particular machines; if you know of one which is sufficiently different,
I would be interested.)  Of the machines I know, only the 680xx, 8086
and similar (although otherwise I consider its architecture horrible),
and 16/320xx have (I believe) the right integer operations.  To do unsigned
multiplication with only signed multiplication available requires that
2 conditional additions must be done after the multiplication; as machines
get faster conditional operations are bad except in nanocode.  Unsigned
division is so complicated that one introduces other inefficiencies instead.

BTW, there is an address modification procedure which is missing on all
machines I have seen except the UNIVAC's.  That is to consider the register
file as a memory block and allow indexing on it.  Another missing procedure
is to enable the register file to be treated as a block of memory so that
bytes or short words can be addressed.  These two operations can be combined
on a byte-addressable machine.

I am definitely not the person to run this, but maybe there should be a mailing
list to communicate suggestions about the manifold instructions which would
be profitable in hardware.  Also, I know a little about microcode (enough to
know that what I think should be done cannot be done that way) and very little
about nanocode.  I find it relatively easy to read the manuals describing the
machine instructions.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

ccplumb@watmath.waterloo.edu (Colin Plumb) (09/24/87)

In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>BTW, there is an address modification procedure which is missing on all
>machines I have seen except the UNIVAC's.  That is to consider the register
>file as a memory block and allow indexing on it.  Another missing procedure
>is to enable the register file to be treated as a block of memory so that
>bytes or short words can be addressed.  These two operations can be combined
>on a byte-addressable machine.

The PDP-10 also did this.  The first 16 memory locations were the registers.
There was an option to get fast (non-core) memory for these few bits.
(I think this is interesting, since these days you'd implement the registers
on-chip (on the CPU board, at least), and handle memory accesses to them as
a special case.)

	-Colin (ccplumb@watmath)

P.S. No, I'm not that old - I just read the manual today, thinking I should
know about the first (as far as I know) machine to have a register file.

alverson@decwrl.UUCP (09/24/87)

In article <8646@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Although you don't get it bundled into one instruction, the pieces needed
>to do this are present in any IEEE floating-point implementation, e.g. the
>68881.  The remainder can be had with one instruction (on the 68881, FMOD
>or FREM depending on exactly what you're doing), the quotient would take
>two I think (just a divide and a convert-to-integer).

Careful now.  FREM gets you the remiander you want.  However, getting the
integer quotient is actually harder.  The problem occurs when the
quotient is larger than an integer.  Often you want the low few bits of
the integer quotient when using FREM to do range reduction.  The last time
I looked the IEEE standard did not provide for these.  However, most chips
give 3 or so of the low end bits, since the designers have actually
thought about why you want FREM.

Overall though, I agree with Henry.  The main reason most of the
complicated instructions mentioned do not show up in RISC's is that
there is no way to express the action in C or Pascal such that the
compiler can reasonably determine to select the complicated instruction
over a sequence of simpler one's.

Bob

daveb@geac.UUCP (Brown) (09/24/87)

In article <8646@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
|[discussion of operations in the MC 68881]
||  ... Another is to find the distance to the next
||  one in a bit stream, with an interrupt if the stream is emptied...
| 
| On most modern machines it should be possible to write a loop that will do
| this at very nearly full memory bandwidth, looking at a byte or a word at
| a time and using table lookup for the final bit-picking.  I am constantly
| amused by people who scream for bit-flipping instructions when doing it a
| byte or a word at a time, using table lookup for non-trivial functions, is
| still faster.  "Work smart, not hard".
| 

The distance-to-next bit instruction is, for operands of about 2-4
words in lenght, called "floating normalize".  A chess program
(Johnathon Schaefer's) I once worked on used this...
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

mash@mips.UUCP (09/24/87)

In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
....general discussion of wishlists of things that should be in hardware;
more support for various flavors of multiply/divide, bit manipulation, etc.
>Olson greatly underestimates the number of RISC instructions needed to do
>even a fair job.  If the user is going to be able to do the things needed
>efficiently, the combining of instructions must be done at the "nanocode"
>level.  Frankly, I think that having thousands of instructions, arranged 
>so that decoding patterns can be used, is much easier. ...

Earlier, Herman had written:
>There are many instructions which are easy to implement in hardware, but
>for which software implementation may even be so costly that a procedure
>using the instruction may be worthless.  Some of these instructions have
>been implemented in the past and have died because the ill-designed
>languages do not even recognize their existence.  Others have not been
>included due to the non-recognition of them by the so-called experts and
>by the stupid attitude that something should not be implemented unless
>99.99% of the users of the machine should be able to want the instruction
>_now_.  As you can tell from this article, I consider the present CISC
>computers to be RISCy....

Sigh.  There is a useful point embedded here, but it sounds like a topic
I'd thought beaten to death in this newsgroup has to be reviewed
one more time.

Legitimate point: 
	a) If an operation is not supported in hardware, AND IF
	b) Doing it in software takes a lot longer, AND IF
	c) Programs use that operation with high dynamic frequency, THEN
	d) Providing that operation in hardware might be worthwhile.
For example, there have been some ludicrous statements in the press about
"RISC machines can't do floating point": if floating point is important
to you, you'd better include it in the instruction set.

However, the general approach (anecdotal) is not the way people design
computers, these days, and for good reason.
As noted before here, a plausible way to design a computer is:

1) Pick a REPRESENTATIVE set of benchmarks.
2) Do a first-cut architecture, based on past experience.
3) Do compilers.
4) Add or delete features, measuring the impact by running compiled/assembled
code through architectural simulators.
5) Iterate until you can't find anything else to add that actually
improves performance by some noticeable amount, or until you run out of time.

This is not a perfect recipe, of course.  For example, if the benchmark
set is chosen poorly, bad surprises will happen. 

However, what people don't do these days, is design architectures by
saying "I remember code where it would have been handy to have operation X,
which was stupidly not provided.  Let's add it."

What's needed to be useful, is NOT a list of anecdotes about 
features that might be useful in some cases [and indeed, they might],
or that look interesting when one reads the instruction manuals,
or that look like they save a few cycles here or there in the context
of small code sequences, but hard DATA about the benefits of including them.

For example, much more useful to people who design computers is
reasoning like:

1) Here is a specific application program, or even better, this
known to be typical of an important class of applications.
2) Running on a computer that lacks feature X, with appropriate
instrumentation, we've found that the addition of X would reduce the
runtime by Y%.

One of the other comments wished to have back the UNIVAC-style addressing
of registers, with no backup for why this would be good. As it happens:
	a) This can be costly in hardwre, especially in single-chip
	implementations, unless the whole architecture is built around
	it (like CRISP).
	b) It can complexify life for optimizing compilers.
Show us some data why this feature is worth more than it hurts.

Data, not anecdots.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

baum@apple.UUCP (09/24/87)

--------
[]
>In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>
>Olson greatly underestimates the number of RISC instructions needed to do
>even a fair job.

I don't think that Olson is underestimating anything. Most RISC architectures
have a divide step instruction, which is precisely what underlying microcode
would use. Furthermore, in order to get signed/unsigned variations, microcode
has to do the same kinds of conditional operations that a RISC would have to
do. It is is mistake to assume that a RISC would be slower to do these than
a microcoed engine; some RISC machines (Acorn ARM, HP Spectrum) have support
for conditional operations. Furthermore, any hardware support in excess of this
will inevitably slow the basic cycle down (I've been through the exercise).

> I have not seen any remotely efficient bit-handling hardware on any machine.

Check out the HP Spectrum.

>  To do unsigned
>multiplication with only signed multiplication available requires that
>2 conditional additions must be done after the multiplication; as machines
>get faster conditional operations are bad except in nanocode.  Unsigned
>division is so complicated that one introduces other inefficiencies instead.

Again, you make the mistake of believing that for some reason
nanocode is somehow magically faster or more efficient than a well
designed instruction set. Wrong. Microcode, or nanocode, has to go
through all the same operations that assembly level code does. While
special purpose data paths can be included to make the sign
correction run faster, it is just that: special purpose. It can't be
used for anything else, it may have the effect of making everything
else run slower, and making division run a cycle or two faster will
have no noticable effect on performance. Its VERY difficult to make
fixed point division run faster than a bit per cycle, without a LOT
of hardware. By leaving out the special purpose speedup stuff, you can afford
to include some VERY useful general purpose speedup stuff: More registers,
perhaps, or branch folding logic ala the ATT CRISP.

>
>BTW, there is an address modification procedure which is missing on all
>machines I have seen except the UNIVAC's.  That is to consider the register
>file as a memory block and allow indexing on it.  Another missing procedure
>is to enable the register file to be treated as a block of memory so that
>bytes or short words can be addressed.  These two operations can be combined
>on a byte-addressable machine.

The original PDP-10 from DEC allowed that, originally because
registers were real expensive, so that hardware registers were an
expensive (but effective) speedup option; otherwise, they went to
real memory. Registers were the first  16 locations in memory. This
came back to bite them in the later KL models, because instructions could
put into the registers and executed from them. While this was a real 
speedup hack on the older models, it slowed down the newer ones.

The ATT CRISP doesn't have any registers. But, by caching the top of
the local frame, references to locals are effectively turned into
register references, and you get register windows as well. You can
index into these 'registers', byte access them, and reference them
with short 5-bit fields in the instruction.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

earl@mips.UUCP (09/24/87)

In article <8646@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes:
> Actually, there is one fairly good argument for putting the transcendentals
> in hardware, to wit making a high-quality implementation available cheaply.
> The transcendentals in (say) the 68881 are *better* than anything you will
> come up with in software without large amounts of work.  You can buy a 68881
> for far less than it would cost you to commission or license equivalent code.

The 68881 transcendentals are not implemented in hardware; they are
implemented in microcode.  I believe the extra 0.5-1.5ulp of accuracy
of the 68881 is due to the use of extended precision calculations, not
to either hardware or algorithm (simple rational approximations are
very accurate too when evaluated in extended precision).  This is
likely one reason why the numerical analysts put extended precision
in the IEEE standard.

Usually when someone says "X should be in hardware", it's usually
because they haven't thought very much about how to solve the problem.
Usually the easiest way to solve a problem without thinking about it
is to say "someone else should do it".  In this case "the hardware
designer should solve it".

If an extra half bit of accuracy for transcendentals is important (I'm
not sure it is), then the right way to accomplish this is to add IEEE
extended precision hardware, not transcendental instructions.  In some
ways, this is the RISC approach: when someone says "I need X to do Y",
first ignore X, and then figure out the right way to provide
general-purpose building blocks to accomplish Y.

A final note: implementing transcendentals in 68881 microcode did
nothing to make them fast.  The cycle counts for sin, cos, tan, atan,
log, exp, etc. average about 3.5 longer for 68881 instructions than
for MIPS R2000 libm subroutines.

alan@pdn.UUCP (Alan Lovejoy) (09/25/87)

In article <705@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes:
>extended precision hardware, not transcendental instructions.  In some
>ways, this is the RISC approach: when someone says "I need X to do Y",
>first ignore X, and then figure out the right way to provide
>general-purpose building blocks to accomplish Y.

Interesting.  Sounds like my philosophy for programming language design:
the best way to provide a feature is to build the proper abstraction
mechanisms and primitive operations into the language that will provide
the most general solution to the problem which makes the feature
desirable.  The analogy with 'primitive operations' in hardware is
clear, but what's the hardware equivalent of an abstraction mechanism?
Perhaps some of the features of the Smalltalk Virtual Machine represent
hardware abstraction mechanisms.  I think user-definable microcode routines
would also qualify.  Hmmm...

--alan@pdn

greg@xios.XIOS.UUCP (Greg Franks) (09/25/87)

cik@l.cc.purdue.edu (Herman Rubin) writes about the need for all sorts
of fancy instructions done in hardware.  Might I suggest the good-ole
780 and certain Burroughs machines which let you load your own
microcode.  The details of this operation are left as an exercise for
the reader.... (especially because a: I don't have a VAX to play with,
and b: I don't know the details of this operation myself :-) ).



-- 
Greg Franks             XIOS Systems Corporation, 1600 Carling Avenue,
(613) 725-5411          Ottawa, Ontario, Canada, K1Z 8R8
uunet!mnetor!dciem!nrcaer!xios!greg        "Vermont ain't flat!"

esf00@amdahl.amdahl.com (Elliott S. Frank) (09/25/87)

>In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>>BTW, there is an address modification procedure which is missing on all
>>machines I have seen except the UNIVAC's.  That is to consider the register
>>file as a memory block and allow indexing on it.  Another missing procedure
>>is to enable the register file to be treated as a block of memory so that
>>bytes or short words can be addressed.  These two operations can be combined

In article <14750@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:

> ...PDP-10...
>           the first (as far as I know) machine to have a register file.

The feature was part of the UNIVAC SS-90 and the 110x machines (all which
[architecturally] predate the founding of DEC). It may be in the SS-80,
too (but that machine was before my time :-)).

It is one of the ultimate cases of providing a hardware feature where
the implementation ignores the architecture. (Can you say `dependant
upon a side effect'?)  The UNIVAC 1108 assumed (if memory serves me
right) that indirect addressing would reference the memory
`masked' by the registers.  Direct references to the memory locations
accessed the registers.  Two different access modes referring to the
same location referenced two different objects!
-- 

Elliott S Frank    ...!{hplabs,ames,sun}!amdahl!esf00     (408) 746-6384
               or ....!{bnrmtv,drivax,hoptoad}!amdahl!esf00

[the above opinions are strictly mine, if anyone's.]
[the above signature may or may not be repeated, depending upon some
inscrutable property of the mailer-of-the-week.]

thomsen@trwspf.TRW.COM (Mark Thomsen) (09/26/87)

In article <704@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>However, the general approach (anecdotal) is not the way people design
>computers, these days, and for good reason.
>As noted before here, a plausible way to design a computer is:
>
>1) Pick a REPRESENTATIVE set of benchmarks.
>2) Do a first-cut architecture, based on past experience.
>3) Do compilers.
>4) Add or delete features, measuring the impact by running compiled/assembled
>code through architectural simulators.
>5) Iterate until you can't find anything else to add that actually
>improves performance by some noticeable amount, or until you run out of time.

I second this -- a processor design in the abstract space of "features that
are useful in current processors, that would seem to be useful, or that every
other processor has" creates the rut that the breakthrough designs have been
exploiting.  I am familiar with the MIPS, Transputer, and Lilith processors
and all three seem very efficiently designed for their role.  By efficiency
I mean speed of execution of the applications run relative to clock speeds -
as compared to competitive processors.  The Lilith for example (Wirth's
desktop marvel) is a legitimate 1 MIP workstation with a clock speed of
6.67 MHz.  Now, if you want to use it for numerical analysis it practically
grinds to a halt.  But for programming support, document production, and such
it is exceptional.  It also supports bit-map operations at 30 Mb/s.  Not bad
for its age.

Please don't go bashing the good processors of our time.  The bad processors
are so easy to pick out.  However, after weeding them out the remainders
generally have some domain of support (Unix/C, Lisp, Modula-2, assumbly
language, CAD, program development, document production, numerical analysis)
that they are quite good for.  Learn from them and sally forth to the next
generation.

This is self-serving.  I get to use the processors and I am delighted to see
what is happening.  Better and better processors are becoming available.

                                          Mark R. Thomsen

peter@sugar.UUCP (Peter da Silva) (09/27/87)

In article <14750@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
> In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >BTW, there is an address modification procedure which is missing on all
> >machines I have seen except the UNIVAC's.  That is to consider the register
> >file as a memory block and allow indexing on it...
> The PDP-10 also did this.  The first 16 memory locations were the registers.
> There was an option to get fast (non-core) memory for these few bits.

The TI 99 processor has something like an address base register, and uses the
next X words of memory as the registers. A standard trick (apparently) is to
map the registers into the I/O page. I think the subroutine call mechanism
involves copying the PC and assigning a new register file. Sort of like a
slow RISC. A very interesting design, anyway.
-- 
-- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter
--                 'U`  Have you hugged your wolf today?
-- Disclaimer: These aren't mere opinions... these are *values*.

henry@utzoo.UUCP (Henry Spencer) (09/29/87)

> > The transcendentals in (say) the 68881 are *better* than anything you will
> > come up with in software without large amounts of work...
> 
> The 68881 transcendentals are not implemented in hardware; they are
> implemented in microcode.  I believe the extra 0.5-1.5ulp of accuracy
> of the 68881 is due to the use of extended precision calculations, not
> to either hardware or algorithm (simple rational approximations are
> very accurate too when evaluated in extended precision)...

Nope, sorry, you have misunderstood slightly.  I wasn't saying "the 68881
is more accurate than carefully-implemented double-precision software such
as one would expect from e.g. MIPSco"; I was saying "the 68881 is more
accurate than the sloppy first-cut software that one confidently expects
XYZ Vaporboxes Inc. to ship as its `production' release".  The point is not
that the 68881 has inherent advantages over software, but that it represents
a *cheap* *prepackaged* high-quality solution.  In principle one could find
the same thing in software, but commercial realities make this unlikely
unless it comes from a university:  the 68881 can be cheaply and widely
sold at a profit because *it cannot be pirated easily*.

I agree that the right way to do transcendentals is in software, with help
(e.g. extended-precision arithmetic) in the hardware when appropriate.
But how much carefully-written software can you buy for the price of one
68881?
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

stachour@umn-cs.UUCP (09/29/87)

> In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> ...
> 
> As noted before here, a plausible way to design a computer is:
> 
> 1) Pick a REPRESENTATIVE set of benchmarks.
> 2) Do a first-cut architecture, based on past experience.
> 3) Do compilers.
I've NEVER seen anyone design compilers for a machine that is only
being similated, and chose the architecture of the hardware based
on measurement, and build the machine later. (Well, one exception,
Multics many years ago, but that design set goals seldom met now.)
> 4) Add or delete features,
>       measuring the impact by running compiled/assembled
> code through architectural simulators.
> 5) Iterate until you can't find anything else to add that actually
> improves performance by some noticeable amount,
> or until you run out of time.
> 
> This is not a perfect recipe, of course.  For example, if the benchmark
> set is chosen poorly, bad surprises will happen. 

However, the biggest problem getting support is finding relevent
benchmarks.  For example, Bill Young's article about design &
implementation showed that unix security was broken, and all unix
systems vulnerable because of a 'bug'.  The bug causer was over-running
an array bounds.  But people don't write code that checks array-bounds.
They even choose ugly languages like C that don't check rather than
ones (like PL/I or Pascal) that do because they can't stand the
inefficiency (or so they say). Instead they write buggy programs!
It is these buggy programs used for the benchmarks, they're there...
But try to find a benchmark with lots of good array-checking in it.
Unless the program was written in Algol for a B55xx or PL/I for a
GE6xxx, you probably won't.  So putting array-checking into hardware
(to make it reasonable for all programmers to do something which
most of them know they should, but don't) will not happen, because
the benchmarks will not contain code that checks array bounds.

BENCHMARKS PREVENT ONE FROM REPEATING ERRORS OF THE PAST, *BUT*
THEY ARE NOT VERY HELPFUL IN GUIDING THE FUTURE.

Paul Stachour
Honeywell SCTC (Stachour@HI-Multics)
UMinn. Computer Science (stachour at umn-cs.edu)

eugene@pioneer.arpa (Eugene Miya N.) (09/29/87)

It's pretty obvious to putting vector and floating point hardware
in Silicon with products like the Weitek, but I was having a discussion
with a colleague about LISP machines, Intellicorp and all those
companies doing "AI."  What about putting CDR hardware into machines?
The colleague pointed out that SUN is the only company doing well in
this arena.  Agree or disagree?  Aren't Symbolics, TI, LMI doing okay?

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  On second thought, don't send me follow ups on this one.

chuck@amdahl.amdahl.com (Charles Simmons) (09/30/87)

In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes:
>It's pretty obvious to putting vector and floating point hardware
>in Silicon with products like the Weitek, but I was having a discussion
>with a colleague about LISP machines, Intellicorp and all those
>companies doing "AI."  What about putting CDR hardware into machines?
>The colleague pointed out that SUN is the only company doing well in
>this arena.  Agree or disagree?  Aren't Symbolics, TI, LMI doing okay?
>
>From the Rock of Ages Home for Retired Hackers:
>
>--eugene miya
>  {hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

John Hennesy was giving a talk on RISC architechtures in Santa Clara
today.  You should take a look at the performance ratios between a
MIPS processor running LISP and dedicated LISP architechtures running
LISP.  MIPS seems to win big.

(Hopefully, John Mashey or some other knowledgeable person at MIPS
will correct me if I misunderstood the slide.  It may be the case
that MIPS was only comparing LISP performance against general purpose
processors like a Cray, Vax, and a couple of small LISP boxes.)

-- Chuck

lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)

In article <15393@amdahl.amdahl.com> chuck@amdahl.amdahl.com (Charles Simmons) writes:

>In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes:

>>It's pretty obvious to putting vector and floating point hardware
>>in Silicon with products like the Weitek, but I was having a discussion
>>with a colleague about LISP machines, Intellicorp and all those
>>companies doing "AI."  What about putting CDR hardware into machines?
>>this arena.  Agree or disagree?  Aren't Symbolics, TI, LMI doing okay?

>John Hennesy was giving a talk on RISC architechtures in Santa Clara
:
>MIPS processor running LISP and dedicated LISP architechtures running
>LISP.  MIPS seems to win big.
:

On a lot of benchmarks, a general purpose (e.g. MIPS, 68020, etc.) processor
will run faster than a special purpose LISP machine.  However, the speed of,
and behavior during the process of, garbage collection, is not always so well
advertised.  CDR support is easily added to any general purpose machine, even
a "RISC" machine, but probably not necessary.  But, it is a good idea
(actually, I would like to see more machines with a "descriptor" hardware data
type that could be used for lists, pointers, and vectors).  Hardware garbage
collection support is an entirely different question.  However, the
performance advantage of the LISP machines has been eroded for the simple
reason that general purpose machines, being used in a much bigger marketplace,
have typically gone through new generations much more quickly and tend to use
current technology.  It is difficult to finance the level of R&D necessary to
do that for a special purpose processor.  AI machines are not unique in this
respect.

  Hugh LaMaster, m/s 233-9,  UUCP {topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

(Disclaimer: "All opinions solely the author's responsibility")

eugene@pioneer.arpa (Eugene Miya N.) (09/30/87)

In summary, as requested ( had not intended orginally):
From: johnl@ima.ISC.COM (John R. Levine)

In article <2917@ames.arpa> you write:
>The colleague pointed out that SUN is the only company doing well in
>this arena.  Agree or disagree?  Aren't Symbolics, TI, LMI doing okay?

No, actually they're not.  LMI essentially went bankrupt, the empty shell
of the company was picked up by some Canadians real cheap.  Symbolics is
doing marginally well, but my impression is that's due to software more
than hardware.  Their future may well be in moving their software to
platforms like MIPSco machines or even 386's.

I was around when the T version of Scheme was under construction at Yale,
and got the strong impression that for practically anything you want to
do in Lisp, a little cleverness lets you do it on a conventional processor
with little performance loss compared to a microcoded version in similar
technology.  But for the same money you can certainly buy a lot faster
high volume conventional machine than a special purpose machine like the
Symbolics box.  Looks to me like the RISC philosophy wins again.

John

From: mips!mash@ames (John Mashey)

This must have been sarcastic.
LMI is out of that business.
Symbolics is hurting.
I don't know how the AI part of TI is doing.

Sun's SPARC architecture has only the slightest caterings to LISP,
and I hear from friends that most of the AI folks do NOT use those features,
because it turns out that you can't get at them well thru the O.S.

reiter@pandora.UUCP (10/01/87)

In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes:
>It's pretty obvious to putting vector and floating point hardware
>in Silicon with products like the Weitek, but ...
>  what about putting CDR hardware into machines?
>The colleague pointed out that SUN is the only company doing well in
>this arena.  Agree or disagree?  Aren't Symbolics, TI, LMI doing okay?

The last I heard, the various LISP companies were in trouble because of
stagnating sales (i.e. sales are constant - they're not decreasing),
which could be as much due to market saturation as anything else.

Technically, CDR is just a pointer operation and is trivial to implement
on any machine.  The special hardware that LISP machines tend to have are:

   1) Tagged data.  The tags give data type.  A word in memory might, for
example, consist of 32 data bits and a 4 bit type field.  Note that in LISP,
variables are not typed, so the types of data elements in an operation may not
be known at compile time.

   2) Memory management.  Support for CONS and for garbage collection.

   3) Special caches, instruction sets, etc., which are geared towards LISP.

[This list is by no means exhaustive]

Whether LISP machines give any price/performance advantage over conventional
machines is unclear.  I once looked into this, and received highly
contradictory data.  In any case, since LISP tends to be a "research"
language (as opposed to a "production" language), most LISP people are
more interested in good software development environments than in hardware
speed.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

rbbb@acornrc.UUCP (10/02/87)

In article <2207@umn-cs.UUCP>, stachour@umn-cs.UUCP (Paul Stachour) writes:
> However, the biggest problem getting support is finding relevent
> benchmarks.
  ...
> But try to find a benchmark with lots of good array-checking in it.
> Unless the program was written in Algol for a B55xx or PL/I for a
> GE6xxx, you probably won't.  So putting array-checking into hardware
> (to make it reasonable for all programmers to do something which
> most of them know they should, but don't) will not happen, because
> the benchmarks will not contain code that checks array bounds.

If you check some of the IBM 801 and 801-related literature, I believe you
will find a discussion of optimizations that (safely) remove checking code,
so at least someone has thought about this problem.  Running a language
with array bounds checking the 801 (in combination with the PL8 compiler)
ran very well.  Again, RISC wins (with a clever enough compiler).

David Chase

blu@hall.cray.com (Brain Utterback) (10/02/87)

In article <826@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes:
>In article <14750@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>> In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>> >BTW, there is an address modification procedure which is missing on all
>> >machines I have seen except the UNIVAC's.  That is to consider the register
>> >file as a memory block and allow indexing on it...
>> The PDP-10 also did this.  The first 16 memory locations were the registers.
>> There was an option to get fast (non-core) memory for these few bits.

Another advantage the PDP-10 had by mapping the registers to the memory space, 
other than indexing, was in execution.  You could load a short loop into the
registers and jump to them!  The loop would run much faster, executing out 
of the registers.  
Brian Utterback
Cray Research Inc.
(603) 888-3083

franka@mmintl.UUCP (Frank Adams) (10/02/87)

In article <2913@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
|The special hardware that LISP machines tend to have are:
|
|   1) Tagged data.  The tags give data type.  A word in memory might, for
|example, consist of 32 data bits and a 4 bit type field.
|
|   2) Memory management.  Support for CONS and for garbage collection.
|
|   3) Special caches, instruction sets, etc., which are geared towards LISP.
|
|[This list is by no means exhaustive]

This looks like much the same sort of stuff that one would want for
Smalltalk.  Has anyone looked at implementing Smalltalk on Lisp machines?

(Of course, if Lisp machines really *don't* give better Lisp performance for
the price than conventional architectures, it is unlikely that they would do
better for Smalltalk.)
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

lyang%scherzo@Sun.COM (Larry Yang) (10/08/87)

In article <8668@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>Nope, sorry, you have misunderstood slightly.  I wasn't saying "the 68881
>is more accurate than carefully-implemented double-precision software such
>as one would expect from e.g. MIPSco"; I was saying "the 68881 is more
>accurate than the sloppy first-cut software that one confidently expects
>XYZ Vaporboxes Inc. to ship as its `production' release".  The point is not
>that the 68881 has inherent advantages over software, but that it represents
>a *cheap* *prepackaged* high-quality solution.  In principle one could find
>the same thing in software, but commercial realities make this unlikely
>unless it comes from a university:  the 68881 can be cheaply and widely
>sold at a profit because *it cannot be pirated easily*.
>
>I agree that the right way to do transcendentals is in software, with help
>(e.g. extended-precision arithmetic) in the hardware when appropriate.
>But how much carefully-written software can you buy for the price of one
>68881?

How much more time would Motorola buy if they didn't do transcendentals
in micro/nanocode and had software engineers write libraries that they
could sell to customers?  Could the 881 be fit onto a smaller die (i.e.,
easier layout, better yield)?  What's wrong with Motorola saying:
"Here's this wonderful fp chip we've made.  It does all the basic
fp operations really fast.  If you want to do sin, cos, and stuff, then
here are the software library routines that are guaranteed to work."
Are there no competent software engineers at these IC houses?

I'll have to admit that I haven't designed any floating point arithmetic,
so if I'm way off base, someone please correct me. (Of course, I didn't
have to request this... :-) But it would seem that much would be gained
from the chip design/fab/test area if the sweating over complex functions
would be moved to the software realm.

*************************************************************************

--Larry Yang [lyang@sun.com,{backbone}!sun!lyang]|   A REAL _|> /\ |  _   _   _ 
  Sun Microsystems, Inc., Mountain View, CA      | signature |   | | / \ | \ / \
    Hobbes: "Why do we play war and not peace?"  |          <|_/ \_| \_/\| |_\_|
    Calvin: "Too few role models."               |                _/          _/

oconnor@sunray.steinmetz (Dennis Oconnor) (10/09/87)

In article <30382@sun.uucp> lyang@sun.UUCP (Larry Yang) writes:
>How much more time would Motorola buy if they didn't do transcendentals
>in micro/nanocode and had software engineers write libraries that they
>could sell to customers?  Could the 881 be fit onto a smaller die (i.e.,
>easier layout, better yield)?  What's wrong with Motorola saying:
>"Here's this wonderful fp chip we've made.  It does all the basic
>fp operations really fast.  If you want to do sin, cos, and stuff, then
>here are the software library routines that are guaranteed to work."
>Are there no competent software engineers at these IC houses?

Anyone who can write microcode for the chip considered the STANDARD
( as in, if you and the '881 disagree on a result , you are wrong )
for IEEE floating point MUST be a competent software engineer.

>I'll have to admit that I haven't designed any floating point arithmetic,

Actually, I could tell this without you admitting it, so you need't have.

>so if I'm way off base, someone please correct me. (Of course, I didn't
>have to request this... :-) But it would seem that much would be gained
>from the chip design/fab/test area if the sweating over complex functions
>would be moved to the software realm.
>
>--Larry Yang [lyang@sun.com,{backbone}!sun!lyang]
>  Sun Microsystems, Inc., Mountain View, CA      

Right, great idea (sarcasm). For every operation currently done
in the '881 in microcode, make the user fetch an instruction from
i-mem. Boy, that'll improve performance! (heavier sarcasm).
Hey, and don't forget to use plenty of the available user
registers in these routines.

The 68000, '010 and '020 family are NOT RISCs. Putting a RISCy
FP Coprocessor on them makes as much sense as putting a
telescopic sight on a sawed-off shotgun : different design
contexts bring about different solutions.

--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

nerd@percival.UUCP (Michael Galassi) (10/10/87)

In article <30382@sun.uucp> lyang@sun.UUCP (Larry Yang) writes:
...
>I'll have to admit that I haven't designed any floating point arithmetic,
>so if I'm way off base, someone please correct me.

I've designed some floating point routines for the 68k and can testify that
it is not overliy dificult, the hard part being detirminig what ranges of
inputs are valid and verifying that the function is well behaved over all
that range.

>... But it would seem that much would be gained
>from the chip design/fab/test area if the sweating over complex functions
>would be moved to the software realm.

The bigest advantage I see in using microcode to do the FP is that you
save the memory references while the computation is being done freeing
up the bus for the likes of dma, concurrent cpu operations, and any other
bus master's intervention.
-- 
If my employer knew my opinions he would probably look for another engineer.

	Michael Galassi, Frye Electronics, Tigard, OR
		...!tektronix!reed!percival!nerd

lyang%scherzo@Sun.COM (Larry Yang) (10/13/87)

In article <7587@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>
>Anyone who can write microcode for the chip considered the STANDARD
>( as in, if you and the '881 disagree on a result , you are wrong )
>for IEEE floating point MUST be a competent software engineer.
>
Hmm.  I'll concede this point.  Microcoders *are* software people, in
some sense.  And in order to understand the mystical IEEE standard,
one would have to be pretty competent.
>
>The 68000, '010 and '020 family are NOT RISCs. Putting a RISCy
>FP Coprocessor on them makes as much sense as putting a
>telescopic sight on a sawed-off shotgun : different design
>contexts bring about different solutions.
>
Good point, although it was hard to extract out of all that sarcasm. :-)
Making the 881 'RISCY' would have been incredibly foolish, given the
'CISCY' design of the 68000 family.  The analogy with the shotgun
is very appropriate.


--Larry Yang [lyang@sun.com,{backbone}!sun!lyang]|   A REAL _|> /\ |  _   _   _ 
  Sun Microsystems, Inc., Mountain View, CA      | signature |   | | / \ | \ / \
    Hobbes: "Why do we play war and not peace?"  |          <|_/ \_| \_/\| |_\_|
    Calvin: "Too few role models."               |                _/          _/