[comp.arch] RISC vs unaligned data

andrew@frip.wv.tek.com (Andrew Klossner) (04/01/89)

[]

	"the historical trend is to be progressively more tolerant of
	misalignment, e.g. IBM /360 /370, Motorola 68K families. All
	the "tolerant" machines always attach a *penalty* to
	misalignment. It is only the very recent crop of so-called RISC
	chips that is requiring alignment again."

Many contributors to this discussion seem to hold the opinion that, if
alignment isn't supported by hardware, it isn't supported at all.  But
one of the points of RISC is to move complexity from hardware to
software.  Why not just let the compiler do it?

If the compiler knows the alignment of a word (the low two bits of the
address are a compile-time constant, as for an unaligned word within an
aligned structure), it can do a (slightly) better job than if it is
totally clueless about the runtime address.  PL/I provided the
"UNALIGNED" specifier to advantage on the 360/370 machines.  A system
supplier willing to extend their C language could add a similar
construct to C.

For example, on the 88k, an architecture that doesn't have particularly
good support for unaligned data, the compiler might generate code like
this to fetch a word from an address that it knows will be odd:

				; address of unaligned word to fetch is in r10
	ld.bu	r1,r10,0
	ld.hu	r2,r10,1
	ld.bu	r3,r10,3
	mak	r1,r1,8<24>
	mak	r2,r2,16<8>
	or	r1,r1,r2
	or	r1,r1,r3
				; word is in r1

If the word is in the data cache, this takes seven cycles and wastes
two scratch registers (r2 and r3).  (The code to fetch from an even but
unaligned address takes five cycles.)  With hardware support it could
do a better job ... but is it necessary to fetch an unaligned word in
fewer than seven cycles?  That fetch takes fewer nanoseconds than it
does on the modern, unalignment-forgiving CISC machine that I'm typing
this on, which after all is the bottom line in RISC vs CISC.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

mdr@reed.UUCP (Mike Rutenberg) (04/01/89)

Andrew Klossner writes:
>For example, on the 88k, an architecture that doesn't have particularly
>good support for unaligned data, the compiler might generate code like
>this to fetch a word from an address that it knows will be odd:
	[code example]
>If the word is in the data cache, this takes seven cycles and wastes
>two scratch registers (r2 and r3).  (The code to fetch from an even but
>unaligned address takes five cycles.)  With hardware support it could
>do a better job ... but is it necessary to fetch an unaligned word in
>fewer than seven cycles?  That fetch takes fewer nanoseconds than it
>does on the modern, unalignment-forgiving CISC machine that I'm typing
>this on, which after all is the bottom line in RISC vs CISC.


The main problem is that the 7 instruction sequence you gave takes up
i-cache space and as you indicated needs registers.  If the processor
can deal with unaligned accesses, with the clearly associated
performance hit it implies over aligned data, you may get better icache
performance.  This becomes a bigger deal with larger programs and too
much unaligned data.

The Intel 80960KA has a nice memory interface that is fast and allows
unaligned data references.  Among things which assist in unaligned
accesses is a 3*(1-8 byte) fifo for outstanding memory write requests
(a similar fifo is for read requests).

Mike
-- 
Mike Rutenberg      Reed College, Portland Oregon     (503)239-4434 (home)
BITNET: mdr@reed.bitnet      UUCP: uunet!tektronix!reed!mdr
Note: These are personal remarks and represent no known organization --mdr

mash@mips.COM (John Mashey) (04/02/89)

In article <11222@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes:
>[]

>Many contributors to this discussion seem to hold the opinion that, if
>alignment isn't supported by hardware, it isn't supported at all.  But
>one of the points of RISC is to move complexity from hardware to
>software.  Why not just let the compiler do it?

This is a reasonable generic argument.  As usual, whether it's a good
idea or not depends on the numbers.

>For example, on the 88k, an architecture that doesn't have particularly
>good support for unaligned data, the compiler might generate code like
>this to fetch a word from an address that it knows will be odd:
....
>If the word is in the data cache, this takes seven cycles and wastes
>two scratch registers (r2 and r3).  (The code to fetch from an even but
>unaligned address takes five cycles.)  With hardware support it could
>do a better job ... but is it necessary to fetch an unaligned word in
>fewer than seven cycles?  That fetch takes fewer nanoseconds than it
>does on the modern, unalignment-forgiving CISC machine that I'm typing
>this on, which after all is the bottom line in RISC vs CISC.

It would be useful to cite: 1) the cycle counts for stores,
2) the cycle counts for loads/stores where the compiler has no idea
(FORTRAN call-by-reference, for example).

I don't have much data on this, but I've heard that we've seen FORTRAN
programs where we see 10-15% hit, when compiled with the the
"unaligned-forgiving" attribute (I don't know which one, there are
several).  Let's try some back of the envelope numbers.
Let's suppose it costs us 1-2 extra cycles per load or store.
This means that if we're seeing 10-15%, that somewhere betwen
5-15%  (i.e., 10/2, 15/1 to get the maximum range) of the instructions
are incurring this penalty, which is about 16-50% of the load/store
instructions that are typical for such programs.
Then, if the average penalty is N cycles, one is looking at a
first-order estimate of additional run time (beyond the optimal base of
1.0), as .05*N - .15*N.  Suppose N == 7, which gives a range of .35-1.05
extra time.  The code size would also expand, although probably not
as much.

Of course, if N == 100 (if you were doing it with exceptions, perhaps),
you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and
only survivable for debugging.

What this says is:
1) As Andrew says, you can maybe survive by generating the extra code
to do this.
2) Depending on your code sequences, and the frequency of this problem,
you might get away with 10-15% hit (as on MIPS, with unaligned instructions),
or, in more typical cases, and looking at these numbers, I'd guess
that a 50% hit would be typical, if I had to pick a single number.
A 50% hit is either a) irrelevant, or b) Very Important, depending on
what you're doing.
50% hits in big CAD crunchers are sometimes considered Bad....

Since this is done from rather minimal input, maybe somebody with real
data might choose it post it?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

cquenel@polyslo.CalPoly.EDU (46 more school days) (04/04/89)

In article <16407@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Of course, if N == 100 (if you were doing it with exceptions, perhaps),
>you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and
>only survivable for debugging.

	Wait a sec, aren't you comparing the proverbial
	mis-matched fruits here ?

	The hit you take from compiler generated "safe" code
	is paid whether the specific reference is actually
	aligned or not.  As you mention, in FORTRAN, things
	in a common block can have arbitrary alignment in
	THEORY, but in practice the words are often aligned.

	If you are "doing it with exceptions" then the exception
	is only taken when the word is, in fact, mis-aligned.

	Using exceptions to handle mis-aligned data isn't
	quite as bad as your (admittedly back-of-the-envelope)
	figure of 6-16X would suggest.

	--chris
-- 
@---@  ------------------------------------------------------------------  @---@
\. ./  | Chris Quenelle (The First Lab Rat) cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |                + good; ++ good;     -- Annie Lennox            |   \ / 
==o==  ------------------------------------------------------------------  ==o==

mash@mips.COM (John Mashey) (04/04/89)

In article <9931@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU (46 more school days) writes:
>In article <16407@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>Of course, if N == 100 (if you were doing it with exceptions, perhaps),
>>you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and
>>only survivable for debugging.
>
>	Wait a sec, aren't you comparing the proverbial
>	mis-matched fruits here ?
....
>	Using exceptions to handle mis-aligned data isn't
>	quite as bad as your (admittedly back-of-the-envelope)
>	figure of 6-16X would suggest.

Good point!  It's certainly not as bad as that in practice.   I can't remember
how bad it really was in practice, except that we had a lot of pressure
to get the compiler support in, AFTER the handle-by-exception code was
already in use.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086