[comp.arch] unaligned accesses and words-versus bytes long, much data

mash@mips.UUCP (04/12/87)

In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:
>
>If you are saying "Computers must have memory systems which allow arbitrary
>accesses, including unaligned and smaller-than-natural" then I dissagree
>completely.  If you are simply saying that there must be some way to get
>at smaller-than-natural and unaligned things, then clearly, almost
>tautologically, you are right.  While the Am29000's ability to support a
>word-oriented external memory system and still allow byte and half-word
UX >accessability isn't perfect (things still get tough (probably have to use
>the on-chip funnel shifter) when words are spanned), it is a step in the
>right direction (and, BTW, not really original:  A very similary thing was
>proposed (and I suppose implemented) for the original Stanford MIPS
>processor).
>
>Ok, I think maybe it *is* time to start the discussion of word-addressed
>vs. byte-addressed memory systems.  John Mashey, care to start us off?
>
Well, I've been staying away from this, but since you asked, OK.
But Brian, I'm sad to say you may not be happy when you're finished
reading this.

First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall
analysis of the issues, so I won't repeat that, except:
"Re bus width, byte extraction, unaligned operands, and memory speed:
  1) Byte extraction from words should be free in time; it'll cost a few gates.
	Basically this requires one or more cross-couplings in the memory path.

Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time
penalty for having load-byte, load-byte unsigned, load-half (signed and
unsigned), and even load-partial-word (left and right) for dealing with
unaligned operands).  It take some silicon, but it didn't add to the
critical path. [Don't ask me how this is done, but I assure you it is
possible.]

Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software
Papers for Increased Performance" by Hennessy, et al, which argued
fairly heavily for word-addressed memory with byte extract/insert.
Now, there are the following facts:
	a) Although Stanford MIPS did this, MIPSco didn't, even though
	it certainly had some people in common. There were specific
	reasons that we did this, including some observations I'd
	made while tuning C compilers for 68Ks. Also, the Stanford MIPS
	crew was mostly VLSI & compiler folks; the MIPSco crew included
	more board designers and operating systems people....
	b) Another offshoot of all of this was word-addressed DEC Titans
	[again, with some Stanford-related personnel].  Most of them have been
	converted since to byte-addressed ones...
	c) There is a "final MIPS evaluation report", which has been
	submitted to TOCS, I think.  Briefly, here's what it concludes on
	this topic:

		1. Memory should be byte-addressed.  Word-addressing [as
		opposed to word-paths to memory] is just too painful
		for software.
		2. Newer statistics seem to say that in the choice of
		word+extract/insert operations versus byte/half operations,
		the (previous) belief that the word+extract/insert choice
		is superior is probably wrong, especially because current
		compiler technology just isn't able to take advantage of
		those instructions well enough to make up the difference.
	I.e., they have changed their minds, at least somewhat.

The remainder of this will deal with some structural reasons why word
addressing with extract/insert is painful in certain environments,
followed by a bunch of statistical analyses that describe the performance
loss MIPS machines would suffer if we did it that way.

A. Structural reasons: (this is mostly systems, maybe not some controllers)
1) Memory system design:
	a) Memory system designs with dual-porting of memory, including
	an I/O bus, usually have to respond to partial-word operations
	to keep I/O controllers happy.  Having done that, it is perfectly
	feasible to deal with byte writes, either cheaply, with parity,
	or somewhat more expensively with ECC.
	b) Some systems use block-oriented buses, often with write-back
	caches.  If the system is doing write-back for you, doing
	load-word [causing WB-cache to fetch the cache line, if needed]
	insert byte
	store-word
	VERSUS JUST:
	store-byte [causing cache line to be fetched, if necessary]
	sure looks like there is at least a 2-cycle hit, maybe a 3-cycle
	hit, if you don't have 1-cycle cache accesses.
	c) Some systems use write-thru caches with write-buffers
	[VAX-780, I assume 8700s, etc, although not 8600/8650].
	Sometimes the write-buffers gather contiguous bytes, then send
	a whole word to memory. Again, having code that does lw/insert/sw
	just adds cycles.

2) I/O system design.  This is clearly not true of all systems, but
you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT
TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH
DEVICE CONTROLLERS.  Device controllers and interface chips often
have all sorts of weird layouts of their control registers [read
Lyon & Skudlarek's wonderful "All the Chips That Fit" article, in
The Feb 86 UNIX Refview, or earlier USENIX proceedings, for example].
Unfortunately, such things do NOT have the semantics of normal memory:
sometimes you cannot do extra fetches or stores without screwing things
up.  Here's a simple example:
	struct ... {
		ushort	silo;
		ushort status;
		....
The silo give you a new character each time it's read.  Now, if the
natural C compiler behavior is, when you write:
	if (status & FLAG)
is to fetch the word that contains both silo and status, you are
in trouble.  Sometimes there are ways to code around this junk, and sometimes
not, depending on the device.  In any case, devices are already awful
enough to deal with, without making one drop into assembler, rather than
C.  Remember, the layouts of these are whatever they are, and if you want
to use what may be otherwise very desirable controllers, this is what
you get.  Is this a weird example? I found several in 5 minutes.
Other OS folks might comment on whether we just happened to pick
controllers that are like this, or not.  In any case, people want to
describe register layouts in C structures, and be done with it.

As I read the 29000 specs, maybe it would be possible to use both modes,
where main memory uses word+insert/extract, and the I/O path has the
alignment network, and uses the load/store control fields to yield
partial-word ops.  It will be interesting to see how the C compiler
compiles a device driver that uses both memory and I/O addresses...
There's probably some way around it, but I do belive that it's more than
picking up an off-the-shelf controller and it's associated driver,
making a few tweaks, and running it.

B. Performance reasons.
Domain: running UNIX and UNIX programs well.

1. Some qualitative observations:

When I was at CT, I spent a bunch of time tuning 68K C compilers.
In particular, I looked at the prevalance of code like:
	move byte to register, extend, extend
	move byte to register, and with 255 to get byte alone OR
	clear register, move byte to register
I was able to get noticable improvements in at least some programs
by optimizing away some of the unnecesary cases.  IT was sure clear that
a lot of cycles were burnt by the extends, or the and/clear, i.e.,
one really wished for load byte [signed or unsigned].

If simulations are based only on user-level programs, you can get
some horrible surpises when you see what UNIX kernels do.  For example,
are halfword operations really necessary?
ANS: not if you look at their frequency in most UNIX C programs.
ANS: if you look at kernel: you bet! many kernel structures are packed
for efficiency, some are packed for necessity (you should see the pile
of halfword operations in Ethernet code... and you CANNOT sanely get
rid of them without rewriting everything).

2. Some quantitative observations.  As most people in this newsgroup
know, we do a huge amount of simulation on very large programs
to analyze performance, look at different board designs and future
chip tradeoffs.  We get complete instruction traces, so we get outputs
that look like:
Summary:
15179647 cycles (1.9s @ 8MHz)
14912988 instructions
2950828 basic blocks
176361 calls
2514780 loads
1521285 stores
900335 byte references
266659 multiply/divide interlock cycles
0 flops (0 mflop/s @ 8MHz)
0 floating point interlocks
0 1 cycle interlocks that become 2 cycle stalls
0 overlapped floating point cycles

0.17 loads per instruction
0.1 stores per instruction
0.38 stores per memory reference
0.22 byte references per reference
0.15 branches per instruction
0.2 backward branches per branch
5.1 instructions per basic block
85 instructions per call
0.12 nops per instruction

    spec    5529383        37%
      lw    1783177        12%
      sw    1299709       8.7%
    addu    1170190       7.8%
	......
(plus a lot of stats)
Thus, we have really precise statistics on what's going on, at least on
our machines, at the user-level, for anything form typical UNIX programs
(like nroff), to large simulators [spice, espresso],
parts of the compiler system [assembler, optimizer, debugger],
to benchmarks like whetstone, dhrystone, linpack.
I think one can find a gross cost [to us, in our architecture, no
necessary applicability to others] in user programs, as follows,
if we had done byte extract/insert, instead of what we did:

For each partial-word load, add 1 cycle. (for the extract)
For each partial-word store, add 2+N cycles (where you have a load,
insert added, and where N (might be 0) is the extra actual cycle cost
to get data from the cache, noting that some of the cost might be
taken care of by pipeline scheduling.

So here a re a few example: I'll give the % of instruction cycles
for each instruction, and compute the penalty by using N=0.
I'll ignore numbers too small to matter much.

as1 (assembler 1st pass)
opcode	%	penalty (dynamic)
lbu	4.6%	4.6%
sb	1.5%	3.0%
lh	0.27%	0.27%
lhu	0.07%	0.07%
sh	0.02%	0.02%
TOT		8%  penalty in instruction cycles, asssuming N=0 (best case)
There is also a static code-size penalty, I'll only do one since I don't
think this is a major issue, but it is interesting;
opcode	%	penalty (static)
lbu	4.7%	4.7%
sb	3.2%	6.4%
lh	0.27%	0.27%
sh	0.14%	0.28%
TOT		11.6%
Note the significance of the static numbers: the byte operations are all over
the place, i.e., the dynamic counts aren't substantial just because they're
in strcpy or something like that [actually we have tuned routines anyway],
but because there's partial word code all over the place.

Now, this is an ultra-simplistic analysis, because there are things like:
write-buffer effects, cache effects, memory system interference,
pipeline scheduling, etc, etc.  Consider this a first approximation.

Now, a few more examples:

Dhrystone:
lbu	6.9%	6.9%
lb	4.7%	4.7%
lwl	1.2%	1.2%	(unaligned word stuff)
lwr	1.2%	1.2%
sb	0.43%	0.8%
swl	0.14%	0.3%
TOT		14.1%
static:
lbu	1.6%	1.6%
lb	1.1%	1.1%
sb	1.1%	2.2%
lh	0.54%	1.1%
lwl	0.27%	0.5%
lwr	0.27%	0.5%

Note what's going on: Dhrystone does a lot of character manipulation
(dynamically), but statically, it has very few partial word instructions,
i.,e., unlike more normal programs, it does NOT have character-pushing
sprinkled around.

Observations on a few other benchmarks:
most of the floating -point ones didn't do much partial-word computing,
although I ran into one that had 9.9% (!) lh's, plus 2.2% sh's.,
i.e., for at least a 15% penalty.
Most of those that did any character-pushing at all would take a 5-10%
hit. [ccom, compress, terse (a simulator)]
nroff was only about a 2% hit, surprisingly.

In general, I'd expect the UNIX kernel to have a reasonably high
(both static and dynamic) use of partial-word ops. We'll see soon.

(This has nothing do to do with word-vs-byte, but I ran across it in
looking at these numbers).
QUIZ: how many load/stores use 0-displacements off the base register,
rather than non-zero ones?

ANS: a few were around 50%.
	most were in the 10-20% range.
	some were down in the 5-10% range.
	Dhrystone: 50%
I.e., Dhrystone uses zero-offset addressing considerably more than
most programs, although not more than all programs. [Relevant to 29000
discussion, if you remember how they did things.]

WHEW.  That was a lot of info.  Sorry about that, but architectural
arguments cannot be settled by intuition.  Note again that these are
the numbers we get, and you cannot analyze choices in a vacuum,
so they may or may not be relevant to other architectures and software.
In our case, this does say:
	a) Byte instructions are a substantial win on many real programs.
	b) Non-zero offsets are frequently-used.
and finally, for everybody:
	c) Be very, very careful on WHICH benchmarks you use to tune
	your architecture.  DON'T use Dhrystone.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

howard@cpocd2.UUCP (04/15/87)

In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall
>analysis of the issues, so I won't repeat that, except:
>"Re bus width, byte extraction, unaligned operands, and memory speed:
>  1) Byte extraction from words should be free in time; it'll cost a few gates.
>	Basically this requires one or more cross-couplings in the memory path.
>
>Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time
>penalty for having load-byte, load-byte unsigned, load-half (signed and
>unsigned), and even load-partial-word (left and right) for dealing with
>unaligned operands).  It take some silicon, but it didn't add to the
>critical path.

I don't wish to dispute John's excellent (and voluminous) analysis, but it
is worth mentioning that Geoff seems to be suffering from the normal logic
designer's misconception that you measure complexity/cost in gates.  On ICs,
gates are extremely cheap and small; it is wires (communication over distance)
that are expensive.  Byte extraction can frequently be performed with the same
circuitry that does shifts, but fast shifters are area-expensive (expansive?)
in 2 dimensions.  You don't want to use them without serious consideration.
Circular shifts are particularly expensive (they can double shifter size), but
fortunately neither byte extraction nor C require them.
-- 
Copyright (c) 1987 by Howard A. Landman.  You may copy this material for
any non-commercial purpose, or transmit this material to others and charge
for such transmission, as long as this notice is retained and you place no
additional restrictions on retransmission of the material by the recipients.

hansen@mips.UUCP (Craig Hansen) (04/17/87)

In article <576@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes:
> >First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall
> >analysis of the issues, so I won't repeat that, except:
> >"Re bus width, byte extraction, unaligned operands, and memory speed:
> >  1) Byte extraction from words should be free in time; it'll cost a few gates.
> >	Basically this requires one or more cross-couplings in the memory path.

> I don't wish to dispute John's excellent (and voluminous) analysis, but it
> is worth mentioning that Geoff seems to be suffering from the normal logic
> designer's misconception that you measure complexity/cost in gates.  On ICs,
> gates are extremely cheap and small; it is wires (communication over distance)
> that are expensive.  Byte extraction can frequently be performed with the same
> circuitry that does shifts, but fast shifters are area-expensive (expansive?)
> in 2 dimensions.  You don't want to use them without serious consideration.

As it turns out, the load aligner didn't pay an area penalty for the
horizontal shift wires, as the two relevant busses cross each other, for
reasons unrelated to alignment, right next to the load aligner.  The store
alignment is accomplished by the shift unit already, so it didn't cost extra
area either. (It did cost a few gates, but it was well worth it.) Howard's
statement is true in general: wires are expensive, and gates are relatively
cheap, but clever layout makes the best possible use of expensive resources.

I
really
enjoy
trying
to
be
wordy
in
my 
responses
to
articles
posted.
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen