andrew@frip.wv.tek.com (Andrew Klossner) (04/01/89)
[] "the historical trend is to be progressively more tolerant of misalignment, e.g. IBM /360 /370, Motorola 68K families. All the "tolerant" machines always attach a *penalty* to misalignment. It is only the very recent crop of so-called RISC chips that is requiring alignment again." Many contributors to this discussion seem to hold the opinion that, if alignment isn't supported by hardware, it isn't supported at all. But one of the points of RISC is to move complexity from hardware to software. Why not just let the compiler do it? If the compiler knows the alignment of a word (the low two bits of the address are a compile-time constant, as for an unaligned word within an aligned structure), it can do a (slightly) better job than if it is totally clueless about the runtime address. PL/I provided the "UNALIGNED" specifier to advantage on the 360/370 machines. A system supplier willing to extend their C language could add a similar construct to C. For example, on the 88k, an architecture that doesn't have particularly good support for unaligned data, the compiler might generate code like this to fetch a word from an address that it knows will be odd: ; address of unaligned word to fetch is in r10 ld.bu r1,r10,0 ld.hu r2,r10,1 ld.bu r3,r10,3 mak r1,r1,8<24> mak r2,r2,16<8> or r1,r1,r2 or r1,r1,r3 ; word is in r1 If the word is in the data cache, this takes seven cycles and wastes two scratch registers (r2 and r3). (The code to fetch from an even but unaligned address takes five cycles.) With hardware support it could do a better job ... but is it necessary to fetch an unaligned word in fewer than seven cycles? That fetch takes fewer nanoseconds than it does on the modern, unalignment-forgiving CISC machine that I'm typing this on, which after all is the bottom line in RISC vs CISC. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
mdr@reed.UUCP (Mike Rutenberg) (04/01/89)
Andrew Klossner writes: >For example, on the 88k, an architecture that doesn't have particularly >good support for unaligned data, the compiler might generate code like >this to fetch a word from an address that it knows will be odd: [code example] >If the word is in the data cache, this takes seven cycles and wastes >two scratch registers (r2 and r3). (The code to fetch from an even but >unaligned address takes five cycles.) With hardware support it could >do a better job ... but is it necessary to fetch an unaligned word in >fewer than seven cycles? That fetch takes fewer nanoseconds than it >does on the modern, unalignment-forgiving CISC machine that I'm typing >this on, which after all is the bottom line in RISC vs CISC. The main problem is that the 7 instruction sequence you gave takes up i-cache space and as you indicated needs registers. If the processor can deal with unaligned accesses, with the clearly associated performance hit it implies over aligned data, you may get better icache performance. This becomes a bigger deal with larger programs and too much unaligned data. The Intel 80960KA has a nice memory interface that is fast and allows unaligned data references. Among things which assist in unaligned accesses is a 3*(1-8 byte) fifo for outstanding memory write requests (a similar fifo is for read requests). Mike -- Mike Rutenberg Reed College, Portland Oregon (503)239-4434 (home) BITNET: mdr@reed.bitnet UUCP: uunet!tektronix!reed!mdr Note: These are personal remarks and represent no known organization --mdr
mash@mips.COM (John Mashey) (04/02/89)
In article <11222@tekecs.GWD.TEK.COM> andrew@frip.wv.tek.com (Andrew Klossner) writes: >[] >Many contributors to this discussion seem to hold the opinion that, if >alignment isn't supported by hardware, it isn't supported at all. But >one of the points of RISC is to move complexity from hardware to >software. Why not just let the compiler do it? This is a reasonable generic argument. As usual, whether it's a good idea or not depends on the numbers. >For example, on the 88k, an architecture that doesn't have particularly >good support for unaligned data, the compiler might generate code like >this to fetch a word from an address that it knows will be odd: .... >If the word is in the data cache, this takes seven cycles and wastes >two scratch registers (r2 and r3). (The code to fetch from an even but >unaligned address takes five cycles.) With hardware support it could >do a better job ... but is it necessary to fetch an unaligned word in >fewer than seven cycles? That fetch takes fewer nanoseconds than it >does on the modern, unalignment-forgiving CISC machine that I'm typing >this on, which after all is the bottom line in RISC vs CISC. It would be useful to cite: 1) the cycle counts for stores, 2) the cycle counts for loads/stores where the compiler has no idea (FORTRAN call-by-reference, for example). I don't have much data on this, but I've heard that we've seen FORTRAN programs where we see 10-15% hit, when compiled with the the "unaligned-forgiving" attribute (I don't know which one, there are several). Let's try some back of the envelope numbers. Let's suppose it costs us 1-2 extra cycles per load or store. This means that if we're seeing 10-15%, that somewhere betwen 5-15% (i.e., 10/2, 15/1 to get the maximum range) of the instructions are incurring this penalty, which is about 16-50% of the load/store instructions that are typical for such programs. Then, if the average penalty is N cycles, one is looking at a first-order estimate of additional run time (beyond the optimal base of 1.0), as .05*N - .15*N. Suppose N == 7, which gives a range of .35-1.05 extra time. The code size would also expand, although probably not as much. Of course, if N == 100 (if you were doing it with exceptions, perhaps), you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and only survivable for debugging. What this says is: 1) As Andrew says, you can maybe survive by generating the extra code to do this. 2) Depending on your code sequences, and the frequency of this problem, you might get away with 10-15% hit (as on MIPS, with unaligned instructions), or, in more typical cases, and looking at these numbers, I'd guess that a 50% hit would be typical, if I had to pick a single number. A 50% hit is either a) irrelevant, or b) Very Important, depending on what you're doing. 50% hits in big CAD crunchers are sometimes considered Bad.... Since this is done from rather minimal input, maybe somebody with real data might choose it post it? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
cquenel@polyslo.CalPoly.EDU (46 more school days) (04/04/89)
In article <16407@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Of course, if N == 100 (if you were doing it with exceptions, perhaps), >you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and >only survivable for debugging. Wait a sec, aren't you comparing the proverbial mis-matched fruits here ? The hit you take from compiler generated "safe" code is paid whether the specific reference is actually aligned or not. As you mention, in FORTRAN, things in a common block can have arbitrary alignment in THEORY, but in practice the words are often aligned. If you are "doing it with exceptions" then the exception is only taken when the word is, in fact, mis-aligned. Using exceptions to handle mis-aligned data isn't quite as bad as your (admittedly back-of-the-envelope) figure of 6-16X would suggest. --chris -- @---@ ------------------------------------------------------------------ @---@ \. ./ | Chris Quenelle (The First Lab Rat) cquenel@polyslo.calpoly.edu | \. ./ \ / | + good; ++ good; -- Annie Lennox | \ / ==o== ------------------------------------------------------------------ ==o==
mash@mips.COM (John Mashey) (04/04/89)
In article <9931@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU (46 more school days) writes: >In article <16407@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>Of course, if N == 100 (if you were doing it with exceptions, perhaps), >>you now get + 5-15, i.e., 6-16X slower, which is clearly ungood, and >>only survivable for debugging. > > Wait a sec, aren't you comparing the proverbial > mis-matched fruits here ? .... > Using exceptions to handle mis-aligned data isn't > quite as bad as your (admittedly back-of-the-envelope) > figure of 6-16X would suggest. Good point! It's certainly not as bad as that in practice. I can't remember how bad it really was in practice, except that we had a lot of pressure to get the compiler support in, AFTER the handle-by-exception code was already in use. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086