[comp.arch] Not-so RISCy

jhallen@wpi.wpi.edu (Joseph H Allen) (02/10/89)

Reduction of instruction set size/complexity is the main area of design which
enhances speed in RISC processors.  Another area which I'm wondering about is
data size handling.  Modern RISC processors handle 8, 16, 32 and 64 bit words.
Some even handle data which crosses "word" bounderies (and on some (well one)
the byte order can be changed).  The logic that must be dedicated to this must
be incedible, plus this logic is in the memory data path and therefore might a
speed constaint (especially if the data goes through the ALU before being
presented to the registers).  Would it be a terrible hardship to only have two
data sizes (perhaps character and word) and not allow words to cross word
boundaries?  Certainly it would require that people don't use "bad"
programming techniques similer to what has to be done on 68000 or IBM 360. 
But would not the improvement in speed (by freeing up chip space to allow for
more registers or to simply reduce data path delay time) be worth it?

cik@l.cc.purdue.edu (Herman Rubin) (02/10/89)

In article <732@wpi.WPI.EDU>, jhallen@wpi.wpi.edu (Joseph H Allen) writes:

			............................

>                               Would it be a terrible hardship to only have two
> data sizes (perhaps character and word) and not allow words to cross word
> boundaries?  Certainly it would require that people don't use "bad"
> programming techniques similer to what has to be done on 68000 or IBM 360. 
> But would not the improvement in speed (by freeing up chip space to allow for
> more registers or to simply reduce data path delay time) be worth it?

It would be a real nuisance.  For numerical problems, it would be a good idea
to have _at least_ 32, 64, and 128, for both fixed point and floating point.
There are very good reasons for editing, etc., to have 8 and 16 also.  I can
also think of good uses for individual bits.  As for crossing word boundaries,
this can be very convenient, but not as important.  BTW, what is a word?  Is
it 16, 32, or 64 bits?  And if a word is 32 bits, does a 64-bit quantity have
to start on an address divisible by 64?

We can get more registers by using more chips.  Data path delay is not likely
to be reduced by having fewer types.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

mash@mips.COM (John Mashey) (02/15/89)

In article <732@wpi.WPI.EDU> jhallen@wpi.wpi.edu (Joseph H Allen) writes:
>
>Reduction of instruction set size/complexity is the main area of design which
>enhances speed in RISC processors.  Another area which I'm wondering about is
>data size handling.  Modern RISC processors handle 8, 16, 32 and 64 bit words.
>Some even handle data which crosses "word" bounderies (and on some (well one)
>the byte order can be changed).  The logic that must be dedicated to this must
>be incedible, plus this logic is in the memory data path and therefore might a
>speed constaint (especially if the data goes through the ALU before being
>presented to the registers).  Would it be a terrible hardship to only have two
>data sizes (perhaps character and word) and not allow words to cross word
>boundaries?  Certainly it would require that people don't use "bad"
>programming techniques similer to what has to be done on 68000 or IBM 360. 
>But would not the improvement in speed (by freeing up chip space to allow for
>more registers or to simply reduce data path delay time) be worth it?

1) Automatic handling of unaligned data is indeed expensive, which is why
RISC machines geernally omit it.
2) You certainly need word & character operations [to match the statistics
of user programs.]  If you have to materialize halfword ops, UNIX kernel
code will suffer, for three reasons:
	a) There are many densely-encoded structures.  Some of those might
	convert shorts to ints, but that doesn't do anything about:
	b) Networking code has 16-bit things all over the place, and you have
	NO CHOICE about the sizes, and
	c) When dealing with arbitrary devices, across things like VME buses,
	you'd better be able to generate indivisible 16-bit loads/stores,
	or your choice of peripheral controllers will be impacted.  Some must
	be exactly 16-bits to match the semantics of the devices.
Although MOST user programs don't use 16-bit quantities a lot, some do, a lot.

3) Once you have load word, load byte [signed|unsigned], and load half
[signed|unsigned], all of which you really want to have, it doesn't take
much more logic to do the unaligned operations (as separate instructions,
NOT as an automatic thign that happens for unaligned operations).
4) Once you have all of that, it actually takes very little logic to do
the byte-ordering swapping: in fact, what really happened was that the
alignment network that shuffles bytes around anyway just got more complete.
Oddly enough, I don't think it ended up taking any more silicon space,
as the width was the same (32 bits), and the height was already forced by
other constraints.
5) As usual, most of this has to be determined scientifically, by simulation
of the impact of omitting the partial-word instructions.  It is interesting
that at least {HP, MIPS, Sun, Motorola} all came to the same conclusions on
this (include the partial-word load/stores).   In our case, we had some
heritage of word+byte only (Stanford MIPS); I wouldn't put UNIX on a machine
that didn't have 16-bit operations, even though many user-level statistics
wouldn't justify their presence.
6) The unaligned load/store operations have proved absolutely invaluable.
People maybe able to clean up their act on new code, but sometimes they
have huge databases that have alignment problems.  The unaligned operations
turn out to be useful for C strings, COBOL+PL/1, and for porting large
FORTRAN programs that have COMMON+EQUIVALENCE combinations that effectively
prohibit "correct" alignment, especially if these came from the IBM or DEC
worlds...which a few programs do.  If you own a 2-million line CAD program,
which you didn't write, and which contains code thru which the armies have
marched thru the years, you do NOT want to be told that you must rework the
program before you can get it to work the very first time.  It's a lot easier
to turn on a compiler switch that uses the unaligned instructions, typically
losing 10-15% of performance, and either tune it later, or not bother at all,
but at least get the application working....

Anyway, it's a good question: it's always good to question why features are
included.  In this case, there are good reasons.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rcbaps@eutrc3.UUCP (Pieter Schoenmakers) (02/16/89)

In article <13259@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>[...] I wouldn't put UNIX on a machine
>that didn't have 16-bit operations, even though many user-level statistics
>wouldn't justify their presence. [...]

Just for your information: it has been done: the Acorn Archimedes R140,
which is to be released officially this month, runs Unix BSD on the ARM,
a load/store RISC processor, supporting only 32 bit operations on registers
and having word (32bit) and signed char (8bit) load/store operations. 
   I don't have any benchmarks on the Unix version, but the C compiler I have
on my Archimedes warns about the use of shorts (ansi! :), but is _very_ fast
for a desktop computer running at a mixture of 4 and 8 Mhz (Dhrystone results
put it just below an IBM PS2/80).

---Tiggr

mash@mips.COM (John Mashey) (02/20/89)

In article <483@eutrc3.UUCP> rcbaps@eutrc3.UUCP (Pieter Schoenmakers) writes:
>In article <13259@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>[...] I wouldn't put UNIX on a machine
>>that didn't have 16-bit operations, even though many user-level statistics
>>wouldn't justify their presence. [...]
>
>Just for your information: it has been done: the Acorn Archimedes R140,
>which is to be released officially this month, runs Unix BSD on the ARM,
>a load/store RISC processor, supporting only 32 bit operations on registers
>and having word (32bit) and signed char (8bit) load/store operations. 
>   I don't have any benchmarks on the Unix version, but the C compiler I have
>on my Archimedes warns about the use of shorts (ansi! :), but is _very_ fast
>for a desktop computer running at a mixture of 4 and 8 Mhz (Dhrystone results
>put it just below an IBM PS2/80).

Oops, I should I have been more specific.  Not having anything but 32-bit
arithmetic doesn't bother me (or most of the other RISC types), but I care
about 16-bit loads and stores both for performance reasons, and for the
sturctural reason of dealing cleanly with 16-bit device registers from
arbitrarily-chosen peripheral boards.  UNIX certainly can be put on a machine
without 16-bit load/stores, and has been put on far uglier machines,
and for some implementations it might well be the least of evils to leave
out halfword operations.  [Note, or course, that Dhrystone doesn't use
halfwords in any significantly noticable numbers...]

Anyway, I didn't mean to imply that it was impossible to put UNIX on such
a machine, merely that *I* wouldn't do it!
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

andrew@frip.gwd.tek.com (Andrew Klossner) (02/21/89)

[]

	"Would it be a terrible hardship to only have two data sizes
	(perhaps character and word) and not allow words to cross word
	boundaries?"

Data sizes: you need do atomic 8-bit, 16-bit, and 32-bit loads and
stores in order to deal with all the sorts of device registers you
might meet.

No crossing boundaries:  absolutely.  My favorite machines prohibit
this.  I find it useful for detecting garbaged pointers while
debugging.  Also, it's quite convenient for the kernel if an
instruction can't reference more than one page, which wouldn't be so if
a load could refer to a word that starts in one page and ends in
another.

On the down side, this breaks a lot of existing code.  A pathological
case is the Fortran program that uses EQUIVALENCE statements to do its
own memory allocation (there being no such facility in the language),
and which "knows" that a double can be equivalenced to any integer.
This gives rise to double references at addresses that are not
multiples of eight bytes.  One workaround is to have the Fortran
compiler fall back to using two single-word operations to fetch or
store a double.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

aglew@mcdurb.Urbana.Gould.COM (02/21/89)

Alignment and selection of bytes are two different things.

<Deliberately obscure comment. I can't give away *all* my
research topics>

rcbaps@eutrc3.UUCP (Pieter Schoenmakers) (02/22/89)

In article <11040@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
>[]
>
>	"Would it be a terrible hardship to only have two data sizes
>	(perhaps character and word) and not allow words to cross word
>	boundaries?"
>
>Data sizes: you need do atomic 8-bit, 16-bit, and 32-bit loads and
>stores in order to deal with all the sorts of device registers you
>might meet.

On the Archimedes (not only the Unix machine), (only 32 and 8 bits (both
aligned) load/store), the I/O bus is 16 bits wide. Reading from I/O space
puts the data in the low 16 bits of the databus; Writing into I/O space
puts the 16 top bits of the databus onto the I/O bus. Both 16 and 8 bit
devices are no problem; all are accessed using 32-bit operations.

---Tiggr

mo@prisma (02/24/89)

The Acorn Risc Machine (ARM) is a very interesting beast.
All of its instructions are conditional in that they look
at the condition codes.  Further, since the machine does NOT
do delayed branches and it has a simple pipe, branches are
a bit more expensive that might be expected.  Hence, it makes 
sense in many cases to do an if-then-else as

	condition set
	true: instr
	true: instr
	true: instr
	true: instr
	true: instr
	true: instr
	false: instr
	false: instr
	false: instr
	false: instr
	false: instr
	false: instr

where the processor just falls through at noop speeds which it does
at one cycle per noop.  I don't remember exactly when the
tradeoff occurs, but it is surprisingly effect.

Further, the ARM folks took a slightly different view of RISC.
To paraphrase the conversation I had with them:

	The usual RISC folks ask the question: what's the best
	way to use 200k transistors in building a VLSI cpu.
	We (Acorn) asked: How do we build the simplest, dirt-cheapest
	cpu possible in 20K transistors that still gets good
	performance?

Well, the ARM is pretty spiffy.  It is already being used by
at least one peripheral controller company because "where else
can you get 6 mips for $35 with a decent instruction set and
large address space that already has a good C compiler and a
decent (Unix-based) development environment?"

All in all, a very tasty piece of work.  

	-Mike