[comp.arch] 16 & 32 bit vs 32 bit only instructions for RISC.

rajiv@im4u.UUCP (Rajiv N. Patel) (02/26/88)

>
>All you 32-bit instruction advocates : how many of your 32-bits of
>instruction are usually wasted ( like by leading zeroes or ones, or
>unused register specifications ) ? If it sounds like I'd welcome a
>debate on the merits of 16 vs 32 bit instructions : sure. Isn't that
>what comp.arch is for ? And I said a DEBATE, not a fire-fight :-)
>
>--
>	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
>				ARPA: OCONNORDM@ge-crd.arpa
>    "Nuclear War is NOT the worst thing people can do to this planet."


   	I agree with Dennis, this is a great topic of DEBATE on comp.arch . I
   am not sure if it has already been discussed already but its worth while
   starting again. 
   	In recent years many new(?) RISC architectures have been proposed and
   each of them seems to have a differences in its instruction set.
   Some have only 32 bit instructions whereas others have both 16 and 32
   bit instructions. Then there are some like the Bell CRISP which has
   16, 48 and 80 (believe it or not!) bit instructions. With these variations
   it seems only fair to question how the variable length instruction 
   architectures can compete with the so called fixed length (faster decode)
   architectures.
   	I have done some assembly coding for an on going RISCy project here
   where we have 16 and 32 bit instructions available. In my experience most
   of the programs I have coded (<2K instructions) have about 70-90% of the
   instructions from the 16 bit category. This only tells me that indeed 16
   bit instructions are very useful and the additional amount of time which
   one may incur in decoding 16 and 32 bit instructions could be offset by
   the time saved in fetching instructions from memory/cache assuming a 32
   bits wide bus. The fixed instruction architectures never seem to talk
   about the memory traffic involved for getting all the leading zeros and
   unnecessary third register name. Even CRISP people who had the courage to
   use an 80 bit instruction have 16 bit instructions and claim a very high
   percentage of these.
	I would like to know from other more experienced people (compared to
   me, ofcourse) what could be the different reasons for using a fixed 32 bit
   instruction architecture.


   Rajiv N Patel.
   (rajiv@im4u)
   University of Texas at Austin.

haynes@ucscc.UCSC.EDU (99700000) (02/26/88)

In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes:
>
>   	I agree with Dennis, this is a great topic of DEBATE on comp.arch . I
>   am not sure if it has already been discussed already but its worth while
>   starting again. 
I'm glad somebody brought this up, and while we're also at it let's debate
the value of including 16 bit data types in RISC machines.  A variety
of data sizes slows a machine down almost as much as a variety of
instruction sizes.  I was rather surprised to see that Sun included
16-bit data in SPARC in this day and age of cheap memory.  8-bit data
has the obvious utility for character strings; but do we really need
16-bit integers anymore?

haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes

bfb@mtund.ATT.COM (Barry Books) (02/27/88)

I wrote code for CRISP a while ago includes machine language ( no tools )
and was inpressed by the opcodes.  CRISP had only a few instructions but
calling it RISC may be stretching things since it had three operand instructions
and no registers.  However it had a 32 word stack cache which it seems to me
is better than registers ( or even register windows which at least allow future
versions to have more registers without recompiling to take advantage ).  Most
of the 16 bit instructions had 5 bit offsets so they could be used to access
the stack cache with small instructions.  This allowed instruction working on 
local variables to to be small.  The 32 bit instructions I belive had 10 bit
offsets so as the stack cache is made larger it still didn't take a large
address field.  These are also usefull for short jumps.  The 80 bit instructions
could take to address ( usefull for memcpy which seems to happen all to 
frequently ).  The thing that impressed me the most was it didn't take
a killer optimizing compiler ( read big, buggy and slow ) to make the
machine run fast.  It seems for stack based langauges a stack cache is
much more sensible than having a bunch of registers that the compiler
has to figure out what to do with.  Also since the subject of jumps seem
to be popular.  CRISP used a brach predition bit which if correct caused
jumps to take no cycles.  Brach predition is fairly easy for loops the
compiler I used always set it so the brach was taken.  This simple
approach works for loops assuming you go thru them more that once.
All in all a nice design.

Barry Books
mtund!bfb		better is bigger

elg@killer.UUCP (Eric Green) (02/27/88)

in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says:
>>All you 32-bit instruction advocates : how many of your 32-bits of
>>instruction are usually wasted ( like by leading zeroes or ones, or
>>unused register specifications ) ? If it sounds like I'd welcome a
>>debate on the merits of 16 vs 32 bit instructions : sure. Isn't that
>    of the programs I have coded (<2K instructions) have about 70-90% of the
>    instructions from the 16 bit category. This only tells me that indeed 16
>    bit instructions are very useful and the additional amount of time which
>    one may incur in decoding 16 and 32 bit instructions could be offset by
>    the time saved in fetching instructions from memory/cache assuming a 32
>    bits wide bus. The fixed instruction architectures never seem to talk
>    about the memory traffic involved for getting all the leading zeros and
>    unnecessary third register name.

The operative parameter here is, is the bus width {n:n>1) times greater than
the instruction width? If so, then it doesn't matter how large the
instructions are -- you'll always be able to fetch multiple instructions
faster than you can execute them. For example, what you mentioned -- a 32 bit
bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions.
Execution-time wise, it doesn't matter either way, unless there's additional
decoding overhead (such as, variable length instructions for the first, but
not for the second). 

And then there is the case of cache. Whenever you fetch an opcode out of cache
memory, you have no delays anyhow.  Someone from ?Pyramid? posted about their
architecture a long time ago. The cache is a very important part of that
machine, and is integrated with a bus that's wider than the instruction/data
width. For example, 32-bit instructions & data, with a 192-bit-wide memory
bus. Considering the locality of data, that means the next 5 instructions will
already be in cache, instantly available (nearbouts). I fail to see how this
can slow the machine down any, except for the i/o overhead for loading it into
main memory in the first place (which is not a cpu delay, but, rather, a
response time delay during which some other process is running).

So, while 16-bit variable-length instructions with a 32-bit data bus may be a
win on a machine with slow memory access time and no cache (i.e. your typical
microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit
instruction-fetch memory interface will blow it into the weeds come
ultimate-performance time, because of the lack of instruction decode overhead.
All a matter of keeping the memory interface bigger than the instruction
size.... 

I also have some papers here from the original RISC guys at UC-Berkeley, and
AMD's design team, which discuss the issue at great length. Their basic
conclusion is that the locality of reference of large caches means that 32-bit
fixed length instructions are a big win even in the absence of a big memory
interface. Of course, their basic problem was justifying the larger flow of
instructions in a RISC machine as vs. a CISC machine, instead of specifically
addressing variable-length vs. fixed-length instructions, but their
conclusions still apply. At least, if you accept the basic premise of RISC,
that is. 

--
Eric Lee Green  elg@usl.CSNET     Asimov Cocktail,n., A verbal bomb
{cbosgd,ihnp4}!killer!elg              detonated by the mention of any
Snail Mail P.O. Box 92191              subject, resulting in an explosion
Lafayette, LA 70509                    of at least 5,000 words.

mash@mips.COM (John Mashey) (02/28/88)

In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
>In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes:

>I'm glad somebody brought this up, and while we're also at it let's debate
>the value of including 16 bit data types in RISC machines.  A variety
>of data sizes slows a machine down almost as much as a variety of
>instruction sizes.  I was rather surprised to see that Sun included
>16-bit data in SPARC in this day and age of cheap memory.  8-bit data
>has the obvious utility for character strings; but do we really need
>16-bit integers anymore?

I encourage all potentially competitive RISC machines to avoid 16-bit
integer support. :-)

More seriously:
-UNIX user-level programs don't use 16-bit quantities very much.

-A few applications use halfwords a lot: Berkeley timberwolf,
for example, has 11% of its instrucitons (on MIPS R2000) as
load/store halfword.

-The UNIX kernel uses halfwords frequently, because it has large
arrays of structs whosedatra must be tightly packed.  You might argue
that some of these could be expanded (and they can), but others
are painful.

-If you want to do data communications, and write device drivers,
both cases where data structuring is beyond your control, you will be
very sorry if you don't have halfwords: both of these applications
use them frequently.

-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

frazier@oahu.cs.ucla.edu (Greg Frazier) (02/28/88)

In article <3508@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
>in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says:
>>    bit instructions are very useful and the additional amount of time which
>>    one may incur in decoding 16 and 32 bit instructions could be offset by
>>    the time saved in fetching instructions from memory/cache assuming a 32
>>    bits wide bus. The fixed instruction architectures never seem to talk
>>    about the memory traffic involved for getting all the leading zeros and
>>    unnecessary third register name.
>
>The operative parameter here is, is the bus width {n:n>1) times greater than
>the instruction width? If so, then it doesn't matter how large the
>instructions are -- you'll always be able to fetch multiple instructions
>faster than you can execute them. For example, what you mentioned -- a 32 bit
>
>And then there is the case of cache. Whenever you fetch an opcode out of cache
>memory, you have no delays anyhow.  Someone from ?Pyramid? posted about their
[cut explanation]
>
>So, while 16-bit variable-length instructions with a 32-bit data bus may be a
>win on a machine with slow memory access time and no cache (i.e. your typical
>microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit
>instruction-fetch memory interface will blow it into the weeds come
[cut some more]
>
>--
>Eric Lee Green  elg@usl.CSNET     Asimov Cocktail,n., A verbal bomb
>{cbosgd,ihnp4}!killer!elg              detonated by the mention of any
>Snail Mail P.O. Box 92191              subject, resulting in an explosion
>Lafayette, LA 70509                    of at least 5,000 words.

Eric has a good point about the cache.  However, if one is using
both 16 and 32 bit instructions, and 70-90% of the instructions
are the shorter ones, then you are going to be able to fit almost
twice as many instructions into the instruction cache.  Indeed,
when the RISC group went on to implement a RISC instruction cache
(Architecture of a VLSI Instruction Cache for a RISC", Patterson,
   et. al., 10th Annual Symposium on Comp. Arch.), they went to
the mixed-instruction length format, in order to improve its
performance.  They did expand the instructions on the back end of
the cache, partly in order to keep the same CPU hardware.

If one states the RISC philosophy as optimizing those aspects of the
CPU/instruction set which are used the most often, then it very
much make sense to make the instructions executed 70-90% of the time
16 bits.

In any case, if one is going to have a dedicated, off-chip instruction
cache, then the benefits of having 16-bit instructions is rather
minimal.  However, if one is going to have a combined instruction/data
cache, or if one is going to have an on-chip instruction cache,
then the amount of memory required to achieve a good instruction
hit ratio is very important, and having 16-bit instructions should
be a real win.
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ihnp4,ucbvax,sdcrdcf,trwspp,randvax,ism780}
	       ----^/----					!ucla-cs!frazier
		   /

radford@calgary.UUCP (Radford Neal) (02/28/88)

In article <2116@saturn.ucsc.edu>, haynes@ucscc.UCSC.EDU (99700000) writes:

> ... while we're also at it let's debate
> the value of including 16 bit data types in RISC machines.  A variety
> of data sizes slows a machine down almost as much as a variety of
> instruction sizes.  I was rather surprised to see that Sun included
> 16-bit data in SPARC in this day and age of cheap memory.  8-bit data
> has the obvious utility for character strings; but do we really need
> 16-bit integers anymore?

I don't understand this. Given that you support 8-bit bytes, and thus
need the extra two bit in addresses, plus the complications of more than
one data size on the bus, why does supporting 16-bit data cost a lot?
As far as instructions go, it looks like somewhere between two and
four extra instructions for a load/store architecture.

I agree that 16-bit data will be used less and less. It's not that
that there isn't a lot of data that would fit in 16 bits, and it's
not that this wouldn't sometimes be a significant savings, it's because
of the popularity of C. If Pascal (which of course had other faults)
had become dominant, lots of people would be declaring variables
like "var i:1..Max_widgets" and if Max_widgets were 1000 this would
be held as 16 bits. As it is, declaring a C variable to be "short" is
an invitation to future problems.

   Radford Neal

root@mfci.UUCP (SuperUser) (02/29/88)

....
>frequently ).  The thing that impressed me the most was it didn't take
>a killer optimizing compiler ( read big, buggy and slow ) to make the
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>machine run fast.  It seems for stack based langauges a stack cache is
>
>Barry Books
>mtund!bfb		better is bigger

I disagree with this attitude.  If you want to run fast, you pay the
price.  Whatever performance you got out of whatever compiler you had,
if it wasn't doing a good job at optimizing, then you could have gotten
more.  If you let that extra performance dribble away, you may have
your reasons, but it's likely that your competitors won't, so you may
not be able to keep doing it for long.  And anyway, "killer optimizing
compilers" don't have to be buggy.  Did you have some example in mind?

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/29/88)

In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
>In article <2574@im4u.UUCP> rajiv@im4u.UUCP (Rajiv N. Patel) writes:
>I'm glad somebody brought this up, and while we're also at it let's debate
>the value of including 16 bit data types in RISC machines.  A variety
>of data sizes slows a machine down almost as much as a variety of
>instruction sizes.  I was rather surprised to see that Sun included
>16-bit data in SPARC in this day and age of cheap memory.  8-bit data
>has the obvious utility for character strings; but do we really need
>16-bit integers anymore?

To us Japanese, 16bit unsigned integer too has the obvious utility for 
character strings.  We use 16bit/char. code for Kanji+Kanas+Alpha-numeric
characters.  

There are few variations to encode Kanji (unbounded, but about 7k-8k for
business use), Kanas (2 different sets of abuot 60 each), and alphanumerics
intermixed. Scanning these 8bit (ASCII)/ 16bit (JIS; Japanese Industry 
Standard) mixed string is a pain (simple FSM, sometime).  You need a new
set of C string libraries. Also in these code, bit7 (8'th bit) is used, 
which is messy with some UNIX utilities.  Want to try Boyer-Moore string
matching with alphabet size of 10K ?  It will be an interesting excersize.  

Chinese, and several other languages need larger character set, too.  
If you want to export computers and operating systems to these countries, 
you better not forget (Along with I/O devices, of course. Who buy a 
business computer which print bills in unreadable characters ?)

==============================================================================
Any opinion expressed here is my own.
------------------------------------------------------------------------------
Ryutarou Ohbuchi	"Life's rock."   "Climb now, work later." and, now,
			"Life's snow."   "Ski now, work later."
ohbuchi@cs.unc.edu	<on csnet>
Department of Computer Science, University of North Carolina at Chapel Hill
==============================================================================

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/29/88)

In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
>instruction sizes.  I was rather surprised to see that Sun included
>16-bit data in SPARC in this day and age of cheap memory.  8-bit data
>has the obvious utility for character strings; but do we really need
>16-bit integers anymore?
Yes, because no matter how cheap them bits are using twice as many as you
need will always cost twice as much!

jesup@pawl18.pawl.rpi.edu (Randell E. Jesup) (02/29/88)

In article <3508@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
>The operative parameter here is, is the bus width {n:n>1) times greater than
>the instruction width? If so, then it doesn't matter how large the
>instructions are -- you'll always be able to fetch multiple instructions
>faster than you can execute them. For example, what you mentioned -- a 32 bit
>bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions.

	Who ever said you had to run the bus a the same speed as the CPU?
The RPM-40 uses fixed-length 16-bit instructions, and fetches 2 every other
cycle.  This has several advantages: cheaper memory, more time for external
caches to respond, and improved efficiency of caches since the same amount of
memory holds twice the number of instructions.

	I do agree that variable length instructions ARE a big drag on
performance in most cases.  There are ways to get some of the effects of
variable length instructions, such as the RPM-40 PREFIX instruction, in a
fixed-length architecture.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jesup@pawl19.pawl.rpi.edu (Randell E. Jesup) (03/01/88)

In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
>I'm glad somebody brought this up, and while we're also at it let's debate
>the value of including 16 bit data types in RISC machines.  A variety
>of data sizes slows a machine down almost as much as a variety of
>instruction sizes.  I was rather surprised to see that Sun included
>16-bit data in SPARC in this day and age of cheap memory.  8-bit data
>has the obvious utility for character strings; but do we really need
>16-bit integers anymore?

	One reason should be enough: compatibility.  If you want another:
Ram space.  Ram, especially for RISC computers, is less than cheap.  Priced
any 20-ns SRAMs lately?  Also note that DRAM prices are goin UP lately, in
contradiction to all established precedent.

	And no, being able to load and store halfwords doesn't really
slow a machine down.  All internal operations can be done in natural word
size.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/02/88)

An article by elg@killer.UUCP (Eric Green) says:
] in article <2574@im4u.UUCP>, rajiv@im4u.UUCP (Rajiv N. Patel) says:
] >>All you 32-bit instruction advocates : how many of your 32-bits of
] >>instruction are usually wasted ( like by leading zeroes or ones, or
] >>unused register specifications ) ? If it sounds like I'd welcome a
] >>debate on the merits of 16 vs 32 bit instructions : sure. Isn't that
] 
] The operative parameter here is, is the bus width {n:n>1) times greater than
] the instruction width? If so, then it doesn't matter how large the
] instructions are -- you'll always be able to fetch multiple instructions
] faster than you can execute them. For example, what you mentioned -- a 32 bit
] bus, with 16 bit instructions. Or a 64 bit bus, with 32 bit instructions.
] Execution-time wise, it doesn't matter either way, unless there's additional
] decoding overhead (such as, variable length instructions for the first, but
] not for the second). 

The real question isn't bus WIDTH, but rather bus BANDWIDTH, usually
measured in megabytes per second. This is NOT an infinitely available
resource for single-chip CMOS processors. The package limits you to
only having a finite number of pins, and CMOS can only drive those
pins at a certain (technology dependant) rate.

The comparison, then, is between your INTERNAL execution rate,
measured in MIPS, and your available instruction-fetch bandwidth,
measured in MB/sec. The ratio of (MB/s) over MIPS yeilds the
number of BYTES/INSTRUCTION. First order, of course.
 
] And then there is the case of cache. Whenever you fetch an opcode out of cache
] memory, you have no delays anyhow.  Someone from ?Pyramid? posted about their
                   ^^^^^^^^^^^^^^^^ uh, really ? Don't you mean LESS delay ?

] architecture a long time ago. The cache is a very important part of that
] machine, and is integrated with a bus that's wider than the instruction/data
] width. For example, 32-bit instructions & data, with a 192-bit-wide memory
] bus. Considering the locality of data, that means the next 5 instructions will
] already be in cache, instantly available (nearbouts). I fail to see how this
] can slow the machine down any, except for the i/o overhead for loading it into
] main memory in the first place (which is not a cpu delay, but, rather, a
] response time delay during which some other process is running).

I'm just guessing, but it sounds like the Pyramid machine is NOT a
single chip microprocessor. Different horses for different course,
and all that. Using 192 pins on a micro for the memory bus would
be pushing package technology, I think, once you added other signals,
address bus, power and ground.

] So, while 16-bit variable-length instructions with a 32-bit data bus may be a
] win on a machine with slow memory access time and no cache (i.e. your typical
] microcomputer, at the moment), 32-bit fixed length instructions with a 64-bit
] instruction-fetch memory interface will blow it into the weeds come
] ultimate-performance time, because of the lack of instruction decode overhead.

There is NO intrinsic reason 16-bit instructions would decode slower than
32-bit instructions. In fact, they can ultimately decode FASTER :
the fewer bits your decoder has to look at, the faster it can be.
Barring other complications of course.

I think the assumption your making is that a smaller instruction set
has to be more complex to get the job done. There are plenty
of examples of this, but it's NOT an imutable law.

] All a matter of keeping the memory interface bigger than the instruction
] size.... 

Dedicating 64 pins purely to instruction fetch (assuming a Harvard
architecture) is quite a lot of a rather scarce resource. Sure
you wanna do this on a micro ?

] I also have some papers here from the original RISC guys at UC-Berkeley, and
] AMD's design team, which discuss the issue at great length. Their basic
] conclusion is that the locality of reference of large caches means that 32-bit
] fixed length instructions are a big win even in the absence of a big memory
] interface. Of course, their basic problem was justifying the larger flow of
] instructions in a RISC machine as vs. a CISC machine, instead of specifically
] addressing variable-length vs. fixed-length instructions, but their
] conclusions still apply. At least, if you accept the basic premise of RISC,
] that is. 

The appropriate measure of cache size, IMHO, is in INSTRUCTIONS.
Given you have some limited number of transistors to put into
a cache, then the smaller your instructions are, the "bigger"
your cache will be.

Also, instruction size affects several "second-order" performance
factors, like how quickly a program loads from a "low-speed" (like
disk) I/O device and how often you page-fault. This effect
is of course due to the fact that programs written in a 32-bit
RISC instruction set will be (according to our data) 65% larger
than the same program in a 16-bit RISC instruction set.

Sorry, we haven't published our data yet. It's just an analysis
(using information-theory) of existing data anyway.

] Eric Lee Green  elg@usl.CSNET     Asimov Cocktail,n., A verbal bomb


--
    Dennis O'Connor			      UUNET!steinmetz!sunset!oconnor
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

bcase@Apple.COM (Brian Case) (03/02/88)

In article <449@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>In article <2116@saturn.ucsc.edu> haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
>>I'm glad somebody brought this up, and while we're also at it let's debate
>>the value of including 16 bit data types in RISC machines.
>	One reason should be enough: compatibility.  If you want another:
>Ram space.  Ram, especially for RISC computers, is less than cheap.  Priced
>any 20-ns SRAMs lately?  Also note that DRAM prices are goin UP lately, in
>contradiction to all established precedent.
>
>	And no, being able to load and store halfwords doesn't really
>slow a machine down.  All internal operations can be done in natural word
>size.

Not all RISC machines require 20ns SRAMs in order to have good performance.
One thing is for sure though, code size is almost always much less important
that data size.
DRAM prices are going up instead of down because of increased demand (and
it is starting to get a little tight everywhere, so I hear).  Have you
tried to get megabit DRAMs lately?  Yikes!
"Being able to load and store halfwords doesn't really slow a machine down"
is a statement that does not take in to consideration certain reasonable
implementations.  The necessary alignment network can indeed slow a
machine's cycle.  Doing internal operations in natural word size does not
solve the whole (main?) problem.

lm@arizona.edu (Larry McVoy) (03/02/88)

In article <1697@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>-The UNIX kernel uses halfwords frequently, because it has large
>arrays of structs whosedatra must be tightly packed.  You might argue
>that some of these could be expanded (and they can), but others
>are painful.

You bet it's painful.  I recently worked on the Unix port to the ETA-10.
That machine does not support 16 bit ints.  Can you say OUCH?!?!  I knew
you could.
-- 

Larry McVoy	lm@arizona.edu or ...!{uwvax,sun}!arizona.edu!lm

bcase@Apple.COM (Brian Case) (03/03/88)

In article <9740@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>There is NO intrinsic reason 16-bit instructions would decode slower than
>32-bit instructions. In fact, they can ultimately decode FASTER :
>the fewer bits your decoder has to look at, the faster it can be.
>Barring other complications of course.

The first statement is true.  However, 32-bit instructions typically have
more bits dedicated to the opcode than 16-bit instructions.  This allows
32-bit instructions to have less-dense encodings and therefore faster
decodings.  For a real lesson in this aspect, see the Stanford MIPS-X and
Original Berkeley RISC instruction encodings.  Yeow!  The instruction decode
logic is almost impossible to see on the MIPS-X.

>Dedicating 64 pins purely to instruction fetch (assuming a Harvard
>architecture) is quite a lot of a rather scarce resource. Sure
>you wanna do this on a micro ?

Sounds like it might be a good idea.  Note that instruction-only-bus pins
are INPUT-only; thus their corresponding pads are much "faster" and simpler
than bidirectional bus pads.

>The appropriate measure of cache size, IMHO, is in INSTRUCTIONS.
>Given you have some limited number of transistors to put into
>a cache, then the smaller your instructions are, the "bigger"
>your cache will be.

This is quite true.  However, see the following.

>Also, instruction size affects several "second-order" performance
>factors, like how quickly a program loads from a "low-speed" (like
>disk) I/O device and how often you page-fault. This effect
>is of course due to the fact that programs written in a 32-bit
>RISC instruction set will be (according to our data) 65% larger
>than the same program in a 16-bit RISC instruction set.

Yes, the second order effects (not affects) can be very important in
certain environments.  As to code size differences:  I have a very large
program (30K lines of C code) that compiles to about 300K bytes of Am29000
code.  On the VAX, it compiles to about 225K bytes (yes, that's with PCC
and -O turned on and the 29K compiler is a wonderful thing from MetaWare).
That's roughly 33%.  That seems typical, although code size ratios can be
anywhere from 1.1 to over 2.0.  However, these kinds of percentages are not
terribly important unless they are much closer to 2 to 3 times as big.  The
question is "what is the cache miss ratio is real life?"  This is NOT
necessarily directly related to general code size!  One kind of
optimization that will be very important (and I hope commonplace) is loop
unrolling.  Yes, the code size ratios will still roughly scale, but the
point is that we are talking about, to some degree, space/time tradeoffs.
If you want a little less time, you can usally get it by giving up a little
space.  This works to a point.  Clearly, you wouldn't want your instruction
format to have one-bit per register.

>Sorry, we haven't published our data yet. It's just an analysis
>(using information-theory) of existing data anyway.

I do hope you guys publish such data.  The more information the better!

hansen@mips.COM (Craig Hansen) (03/03/88)

In article <9740@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>There is NO intrinsic reason 16-bit instructions would decode slower than
>32-bit instructions. In fact, they can ultimately decode FASTER :
>the fewer bits your decoder has to look at, the faster it can be.
>Barring other complications of course.

>Dedicating 64 pins purely to instruction fetch (assuming a Harvard
>architecture) is quite a lot of a rather scarce resource. Sure
>you wanna do this on a micro ?

I guess I've been kicking around the RISC world too long, because I've
heard these arguments all before, when people were saying RISC
machines weren't going to be any faster than CISC machines, and they
were just as wrong then as they are now. Instruction bandwidth is
important, but not so important that you should go back to compacted
instructions. 32-bit instructions aren't much larger than 16-bit
instructions, particularly when a register-allocating compiler is
used, and the benefit to permitting parallel decoding of instructions
with register fetching is a tremendous win.

Generally, optimized MIPS code about 10% to 50% larger than "optimized"
VAX code, as generated by 4.3 UNIX, and is often equal or smaller in
size than optimized 68k code, as generated by Sun compilers.

Here's some data for text size for these machines, noting that the
same source code is used for all machines. However, since the libraries
may be different, and the MIPS machine has a large page size,
the data on small benchmarks is a litle distorted:

		mips	sun3	vax
	awk	 61k	106k	 34k
	ccom	193k	156k	120k
	compress 29k	 25k	 14k
	diff	 45k	 33k	 25k
	hspice	860k	729k	635k
	wolf	266k	303k	212k

...those big 32-bit instructions don't look so bad next to
the machines design for compact encodings...

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com   408-991-0234

zs01+@andrew.cmu.edu (Zalman Stern) (03/03/88)

I believe that 16 bit instructions can be a win. Good code density saves disk 
space as well as memory space. Besides, the more instructions you get in N 
bits, the more performance your cache and bus bandwidth buy you. The trick is 
to keep from designing a CISC in the process. Mainly, multiple instruction 
lengths and hairy bit encodings should be avoided like the plague.

I actually tried to design an architecture using 16 bit instruction words. 
Being a complete amateur, it turned out as a complete mess. I did learn the 
following though:

    You lose on 32 bit immediate operands. This can be a performance loss.
    You do not have room for multiple forms of load (i.e. signed/unsigned half 
word and byte)
    More than 16 registers is hard.
    Addressing modes don't fit. (As the RISC advocates say, "So what?")

The RPM40 gets around these by using prefix instructions to extend immediates, 
a separate size register to specify non-word format, and an asymmetric 
register set to give you 21 registers. (Only 16 of the registers are 
accessible from every register operand slot.) Addressing modes are unnecessary 
since you can use some of the instruction bits you save for address 
calculation instructions. Basically, slick solutions to the above problems.

A good argument for 32 bit instructions is the MIPS R2000. Given a really 
tense compiler, having a full 32 entry register file may be worth the extra 
bits. Also, I recall seeing some statistics that a significant number of their 
load instructions (measured dynamically) use 16 bit offsets. (Notably to index 
off the global variable pointer.) There are also some special privileged 
instructions on the R2000 that make things like software page fault resolution 
go like hell. I don't see much room in a 16 bit instruction for stuff like 
this. (Although you could steal some bits from the coprocessor instruction on 
the RPM40.) In any event, the R2000 seems geared for large, reasonably fast, 
memory subsystems. In this case, speed is worth some wasted bits. 

One thing that is really useful is to be able to take a 32 bit constant and do 
something with it in 64 bits worth of instruction stream. For example, loading 
a value from an absolute (32 bit) address. The IBM RT, the MIPS R2000, and the 
RPM40 can all do this. The RT and the R2000 use one 32 bit instruction to load 
the high half of a register and a load with offset instruction to finish the 
job. The RPM40 uses three prefix instructions and a load. As near as I can 
tell, the AMD 29000 requires 3 32 bit instructions to do this. I find this 
useful because compilers should (almost) always generate 32 bit addresses. 
When we first compiled Scribe on an RT, it died during the link phase. Turned 
out that there was more than a megabyte of executable code and the compiler 
had chosen an instruction with a 20 bit absolute address... (This has long 
since been fixed.)

Sincerely,
Zalman Stern
Internet: zs01+@andrew.cmu.edu     Usenet: I'm soooo confused...
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890

gwu@clyde.ATT.COM (George Wu) (03/05/88)

In article <929@mtund.ATT.COM> bfb@mtund.UUCP (Barry Books) writes:
> . . .  However it had a 32 word stack cache which it seems to me is better
>than registers ( or even register windows which at least allow future
>versions to have more registers without recompiling to take advantage ).

> . . . The thing that impressed me the most was it didn't take
>a killer optimizing compiler ( read big, buggy and slow ) to make the
>machine run fast.  It seems for stack based langauges a stack cache is
>much more sensible than having a bunch of registers that the compiler
>has to figure out what to do with. . . .

     Both of these segments are based on the assumption that compilation takes
up a significant amount of time compared with other tasks. And this certainly
seems correct, especially to developers.

     However, I find that hard to believe. Once a program is correctly
compiled, it will be run many more times than it was compiled, and in the
long run, you are better off optimizing the application code, instead of the
compiler.

     Even in a developer's environment, how often is your machine extended
such that code needs to be recompiled? Remember, your compiler had to be
compiled too. I would expect a good optimizing compiler would yield better
overall system performance here too.

     Overall, I'd expect that sacrificing you're compiler's ability to optimize
code in return for a smaller, faster compiler would be a lose.

>
>Barry Books
>mtund!bfb		better is bigger

     Like you said, "better is bigger." :-)


					George J Wu	

UUCP: {ihnp4,ulysses,cbosgd,allegra}!clyde!gwu
ARPA: gwu%clyde.att.com@rutgers.edu or gwu@faraday.ece.cmu.edu
-- 
					George J Wu	

UUCP: {ihnp4,ulysses,cbosgd,allegra}!clyde!gwu
ARPA: gwu%clyde.att.com@rutgers.edu or gwu@faraday.ece.cmu.edu

jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) (03/05/88)

In article <7538@apple.Apple.Com> bcase@apple.UUCP (Brian Case) writes:
>>Dedicating 64 pins purely to instruction fetch (assuming a Harvard
>>architecture) is quite a lot of a rather scarce resource. Sure
>>you wanna do this on a micro ?

>Sounds like it might be a good idea.  Note that instruction-only-bus pins
>are INPUT-only; thus their corresponding pads are much "faster" and simpler
>than bidirectional bus pads.

	Pads aren't the limiting factor, but PINS (and bandwidth) are.
The RPM-40 has 144 'pins' (not PGA, but leadless chip carrier), which was the
biggest certifed package we could find.  We would KILL for more pins!
(Of course, by now there are probably larger certified packages.)

	Also, if this is NOT an embedded system, you have to worry about
bus bandwidth.  How many 40 Mhz busses are there?  Not many.  By using
16-bit instructions we cut the bandwidth required (at least at the chip
edge, corresponding drops farther down) for instructions from 160 megabytes
per sec to 80 MB/sec.  That means that on-board caches will fill twice
as fast (for the same number of instructions).

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/06/88)

An article by beowulf!lunge!jesup@steinmetz.UUCP says:
] 	Pads aren't the limiting factor, but PINS (and bandwidth) are.
] The RPM-40 has 144 'pins' (not PGA, but leadless chip carrier), which was the
] biggest certifed package we could find.  We would KILL for more pins!
] (Of course, by now there are probably larger certified packages.)
] 
]      //	Randell Jesup		      Lunge Software Development

Sorry, Randell, not quite right. The RPM40 CPU and FPU are packaged
in ** 132-pin ** leadless ceramic chip carriers, both for speed and so
that they could eventually be put in 132-pin surface-mount packages.

The impression of PGAs we had back in design time was that they were
slow, huge and NOT compatable with surface mounting. You wouldn't
want to put one in your space-born BM/CCC processor, we thought.
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

bcase@Apple.COM (Brian Case) (03/07/88)

In article <479@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>	Also, if this is NOT an embedded system, you have to worry about
>bus bandwidth.  How many 40 Mhz busses are there?  Not many.  By using

There aren't many 80 MB/sec system buses either.  Even without an embedded
environment, you'll have to have dedicated memory paths or caches.

>16-bit instructions we cut the bandwidth required (at least at the chip
>edge, corresponding drops farther down) for instructions from 160 megabytes
>per sec to 80 MB/sec.  That means that on-board caches will fill twice
>as fast (for the same number of instructions).

Yes, but my guess is that the number of instructions isn't the same; that is,
there are more of your shorter instructions.  You're right, there won't be
twice as many, but, as I have said before, I believe the solution is not
to cut the bandwidth requirements but to provide the required bandwidth.
Note that since the RPM40 has a TIB, you could have two-way interleaved
your external instruction memory.  This would have:  loosened the
requirement for 16-bit instructions, required twice as many external
instruction chips, increased the size of your TIB (but not double since
the tag store doesn't increase in size), and required that the I-bus run
at 40 MHz instead of 20 MHz.  I suspect that twice the external RAMs would
be the problem.

jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (03/07/88)

In article <1757@mips.mips.COM> hansen@mips.COM (Craig Hansen) writes:
> Instruction bandwidth is
>important, but not so important that you should go back to compacted
>instructions. 32-bit instructions aren't much larger than 16-bit
>instructions, particularly when a register-allocating compiler is
>used, and the benefit to permitting parallel decoding of instructions
>with register fetching is a tremendous win.

	Who said we were going back to compacted instructions to get to 16
bits?  What we have is a lot LESS instructions, and a minimal set of
formats for instructions.  With 32 bit instructions, we could have had less
formats (2 or 3 instead of 5 or 6), but since we can do a decode in a
single pipe-stage, what does it matter?  The decoder does not determine
the critical path and cycle time, the ALU does.  If the decoder had slowed
us up, we would have made it faster and/or reduced the number of formats.
(Keeping the number of formats down and alignment of fields did play a role
in our architecture design.)

	I don't see how having a register allocating compiler affects
instruction size.

	Concerning parallel decode with register fetch, is this anything
unusual?  Our pipeline looks like this:

	<IF> Instruction fetch - doesn't really exist per se.
	<ID> Instruction Decode/register fetch
	<ALU> ALU operation
	<WB> WriteBack - write Alu result to register file
	[ I'm ignoring the extra load stages here ]

I don't see what we're losing here.

>Generally, optimized MIPS code about 10% to 50% larger than "optimized"
>VAX code, as generated by 4.3 UNIX, and is often equal or smaller in
>size than optimized 68k code, as generated by Sun compilers.

[ many figures deleted ]

>...those big 32-bit instructions don't look so bad next to
>the machines design for compact encodings...

	68020? compact?  Surely you jest!  :-)  I know, it actually is fairly
compact, at least the 68000 part of it.  It just has SO many instructions and
addressing modes, it ends up larger than one would suspect.

	The proper comparison is not to CISCs, but to a 16-bit version of the
same general architecture, or at least the same class (RISCs).

	I agree that if cost is no object, a 32-bit RISC can probably run
faster (effective throughput (VIPS), not MIPS) than a 16-bit.  However, the
costs mentioned include a higher-bandwidth bus, more disk space for code,
more memory space for code, larger (expensive) caches, more power draw (more
pins being driven), etc, etc.  The typical current solution for the
bus bandwidth problem is to throw MUCH bigger caches onto the CPU board, to
try to increase hit rates, and reduce bandwidth required of the bus.

>Craig Hansen
>Manager, Architecture Development
>MIPS Computer Systems, Inc.

Glad to see in in the conversation.  I'm interested in hearing your opinions.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

haynes@ucscc.UCSC.EDU (99700000) (03/08/88)

In article <22746@clyde.ATT.COM> gwu@clyde.UUCP (George Wu) writes:
>
>     However, I find that hard to believe. Once a program is correctly
>compiled, it will be run many more times than it was compiled, and in the
>long run, you are better off optimizing the application code, instead of the
>compiler.

Well, now, in the educational environment a program is compiled (not
correctly, of course) more times than it is run.  I guess this could
be an excuse for having separate "checkout" and "production" compilers,
or for having optimization be optional.
haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes

dave@onfcanim.UUCP (Dave Martindale) (03/13/88)

Another place where 16-bit data is important: computer graphics.

1) Although 8 bits/pixel/colour is most common for "full colour" images,
   an 8-bit linear encoding just isn't good enough for critical work.
   For relatively dark portions of the image, the quantization error
   due to 8 bits causes visible band (Mach bands) or other errors.

   For image storage, you can use better encoding strategies that make
   better use of 8 bits, but then you have to encode and decode the
   pixels every time you touch them (for compositing, etc.).  But just
   using 12 or 16 bits with linear coding is faster.  Pixar uses
   12 bits/colour internally.

   Also, when you interface to the physical world, the ADCs and DACs
   are always linear, so you are forced to work with more than 8 bits
   if you care about quality.  Our digital camera and film recorder
   are both 12 bits per colour per pixel.

2) 16 bits is a good word width for storing screen coordinates in
   display lists.  You always want to store as many of these as will
   fit in the physical memory available.


Any machine that stores a C "short" in 4 bytes is going to make
my image data files twice as big, and allow only half the length of
display lists in memory.  I wouldn't buy it.

Any machine that stores a "short" in 2 bytes but requires multiple
instructions (shift/mask) to access the data is going to have a
difficult time executing the inner loops of some algorithms fast
enough to benchmark comparably to another machine of similar cost
which does support 16-bit data types.

I don't use 16-bit words for many types of data, but I use them in
quantities of millions when I do.

bcase@Apple.COM (Brian Case) (03/15/88)

In article <15580@onfcanim.UUCP> dave@onfcanim.UUCP (Dave Martindale) writes:
>Any machine that stores a C "short" in 4 bytes is going to make
>my image data files twice as big, and allow only half the length of
>display lists in memory.  I wouldn't buy it.

It's the compiler making the decision, but I definitely do understand
that the machine plus compiler make a pretty tightly coupled unit.  If
only one compiler is available, any compiler decisions look like architecture
decisions.

>Any machine that stores a "short" in 2 bytes but requires multiple
>instructions (shift/mask) to access the data is going to have a
>difficult time executing the inner loops of some algorithms fast
>enough to benchmark comparably to another machine of similar cost
>which does support 16-bit data types.

Unless the the second machine is slower *because* it supports 16-bit
data types directly (or slower for other reasons).  It should not be
assumed that accessing small data types (16-bit or 8-bit types) with
a sequence of instructions is always slower.  Sometimes it will be,
sometimes it won't.  I won't argue that it's faster (unless the basic
cycle is faster).

bron@olympus.SGI.COM (Bron C. Nelson) (03/15/88)

This topic has gone on for awhile, but I haven't noticed (or managed
to miss) the answer to the question I consider important: i.e. how
hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit
only?

Several respondents have said "its expensive" or "it takes more time
to decode the different formats"  and even "fetching the registers in
parallel with doing the instruction decode is a big win."  (None of
these are probably exact quotes.)  I sorta wonder at this.  Is the
instruction decode/register fetch a critical path?   From what I
can gather, the register fetch probably IS.  Can more hardware be
thrown at the problem to allow multiple formats?  If not, how
expensive (time wise) is it really to provide?  It seems that if done
"right" (oh no! not that word!) it would only add 1 gate delay
(see below).  This is maybe 10%?  (What the heck IS the length of the
critical path (in gates) of your favorite cpu?).  At worst it would
add another stage to the pipeline (plus the associated hardware to
support an extra stage).  How expensive is that?  Maybe 10%??

I'm just trying to get a feel for how much cpu performance people think
would have to be given up to get the more compact encoding.
-----------------------------------------------------------------------
Bron Nelson   bron@sgi.com
Don't blame my employers for my opinions.

p.s.  If we only have 2 formats, we can specify which one by using the
first bit in the instruction (much like CRISP uses the first bit(s)).
This should (?) let us select between the 2 possible register
encodings with only a single additional gate delay (and some more
silicon devoted to doing it).  What we buy is a 25%+ reduction in
program code size.  This seems like a good trade off to me since most
programs I run take longer to load off disk than they do to execute.
I admit my experiance may not be typical, and taking a performance hit
may not be a smart marketing decision to a cpu house, but it seems
like a good system tradeoff.

marc@oahu.cs.ucla.edu (Marc Tremblay) (03/16/88)

In article <12705@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes:
>
>Several respondents have said "its expensive" or "it takes more time
>to decode the different formats"  and even "fetching the registers in
>parallel with doing the instruction decode is a big win."  (None of
>these are probably exact quotes.)  I sorta wonder at this.  Is the
>instruction decode/register fetch a critical path?   From what I
>can gather, the register fetch probably IS.  

The processor that we are designing here at UCLA has a large register
file (64 locals + 10 globals) with overlapping windows. Because of 
fault tolerance aspect of the chip, we have to decode the full 
address (7 bits) of the register operands everytime an access is made
to the register file. The window pointer can not be pre-decoded.
This means that our critical path is composed of a fairly long decode
and of a quite long register fetch. It turned out that using an
efficient decoder (dedicating some extra area), the decode part
represents about 12% of the critical path.
The fact is that using a simple instruction encoding we are able
to overlap the register decoding/fetching with the instruction
decoding done by a separate unit. Adding complexity in the register
address decoding would affect the critical path directly.

>p.s.  If we only have 2 formats, we can specify which one by using the
>first bit in the instruction (much like CRISP uses the first bit(s)).

I assume that this "first" bit is available for encoding... 

>This should (?) let us select between the 2 possible register
>encodings with only a single additional gate delay (and some more
>silicon devoted to doing it).  

Using a tighter encoding, the time spent decoding the rest of the 
instruction (opcode) may turn out to shift the critical path to 
the control unit instead of the register file.

					Marc Tremblay
					marc@CS.UCLA.EDU
					...!(ihnp4,ucbvax)!ucla-cs!marc
					Computer Science Department, UCLA

henry@utzoo.uucp (Henry Spencer) (03/23/88)

> how hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit only?

Of course, if you cache decoded instructions, as some machines do, the
decoding overhead becomes considerably less important.

bcase@Apple.COM (Brian Case) (03/24/88)

In article <1988Mar22.202304.3077@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>> how hard/expensive is it to decode 16 & 32 bit instructions vs 32 bit only?
>
>Of course, if you cache decoded instructions, as some machines do, the
>decoding overhead becomes considerably less important.

But miss penalties increase.  This is not trivial.