[comp.arch] endian etc

ex2mike@cybaswan.UUCP ( m overton) (05/11/91)

A simple answer to all the problems with byte order etc, to the
authors mind, is to have a duplicate set of load and store instructions.
Since most RISC machines have very few of them, adding a set for the
opposite sense would surely be very easy. It wouldn't work with 
instructions, of course, but I assume that is not a problem. 

On a completely different subject, there has been a number of articles
recently on the subject of caches. One of my particular ideas on these
is as follows...

	Wouldn't it be very easy on a machine with a write back cache
to copy words simply by changing the internal cached address ( a little
like a form of cache aliasing). A lot of time is spent in most code
just copying things around. Would this not improve things ( you gain
immediately on cache occupancy).

Mike Overton - ex2mike@pyr.swan.ac.uk

zalman@mips.com (Zalman Stern) (05/11/91)

In article <2496@cybaswan.UUCP> ex2mike@cybaswan.UUCP ( m overton) writes:
>
>A simple answer to all the problems with byte order etc, to the
>authors mind, is to have a duplicate set of load and store instructions.
>Since most RISC machines have very few of them, adding a set for the
>opposite sense would surely be very easy. It wouldn't work with 
>instructions, of course, but I assume that is not a problem. 
>

No, it doesn't help at all. Current RISC chips which support bi-endian
operation simply xor a constant with the low order address bits on non-word
operations. The constant changes depending on the byte-order bit. Storing a
word, changing the byte order bit, and loading the same word will get
exactly the same value (it will not be byte swapped).  Words have a
constant format in memory and byte addresses are modified appropriately.
Hence a buffer full of bytes cannot be accessed by processes of different
byte sex without word swapping the buffer. Moving the byte order bit into
the opcode accomplishes nothing. (Except wasting a lot of opcode space. See
below.)

A real solution would be to add byte lane swapping hardware to the chip.
This hardware would actually swap the bytes on every load and store
depending on the byte order bit. (it can stay in the status register, it
doesn't matter.) That way, memory is laid out so that bytes are always in
the same place and word operations swap them around as necessary. If this
were the case, buffers would not need to be byte swapped at all.

So why don't we do this? The word over here in software is that the
hardware is expensive in terms of space and time. Its very likely to end up
on the critical path for loads and stores. Any impact on cycle time to
support bi-endianess is a lose. (This makes sense to me, but I write code
for a living. The guys on the other side of the building are probably
falling out of their chairs laughing as they read this.)

Another point about architecture: I wouldn't say RISC machines have
relatively few load and store instructions, but either way, these
instructions tend to have large (~16 bit) immediate offsets. Every such
instruction will take one major opcode.  A quick glance at an opcode table
for MIPS-I indicates that putting the byte order bit into the opcode field
would add 17 new opcodes. There are 24 free opcodes in MIPS-I. In MIPS-II,
the 17 goes up and the 24 goes down such that there would not be enough
opcodes. (And believe me, the opcode space got used for something a hell of
a lot more useful than making the instruction set byte-wise bisexual.) Its
even worse for a machine like the RS/6000 which has update and indexed
variants of its load/store instructions.

>	Wouldn't it be very easy on a machine with a write back cache
>to copy words simply by changing the internal cached address ( a little
>like a form of cache aliasing). A lot of time is spent in most code
>just copying things around. Would this not improve things ( you gain
>immediately on cache occupancy).

This is one of many cache tricks you can play to make memory copies (bcopy)
and memory fills (bzero) go fast. Mostly these sorts of things are only
used inside the operating system because they aren't safe for unprivelleged
code to use. One can imagine hardware designed to provide this
functionality for user processes.
-- 
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman     (408) 524 8395
  "Never rub another man's rhubarb" -- the Joker via Pop Will Eat Itself

tim@proton.amd.com (Tim Olson) (05/12/91)

In article <3407@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:
| In article <2496@cybaswan.UUCP> ex2mike@cybaswan.UUCP ( m overton) writes:
| >
| >A simple answer to all the problems with byte order etc, to the
| >authors mind, is to have a duplicate set of load and store instructions.
| >Since most RISC machines have very few of them, adding a set for the
| >opposite sense would surely be very easy. It wouldn't work with 
| >instructions, of course, but I assume that is not a problem. 
| >
| 
| No, it doesn't help at all. Current RISC chips which support bi-endian
| operation simply xor a constant with the low order address bits on non-word
| operations. The constant changes depending on the byte-order bit. Storing a
| word, changing the byte order bit, and loading the same word will get
| exactly the same value (it will not be byte swapped).  Words have a
| constant format in memory and byte addresses are modified appropriately.

Hmmm... I was under the impression that the 88K actually did swap
bytes around when switching between big and little endian.  The way
they do it, byte0 is always in the same location in 32-bit memory,
where the way the 29K (and apparently MIPS) does it, bit0 is always in
the same location, but the numbering of bytes changes:

	29K, MIPS				88K

        31              0                       31              0
	+---+---+---+---+			+---+---+---+---+
	| 0 | 1 | 2 | 3 |	Big Endian	| 0 | 1 | 2 | 3 |
	+---+---+---+---+			+---+---+---+---+

        31              0                        7-0    23-16
	+---+---+---+---+			+---+---+---+---+
	| 3 | 2 | 1 | 0 |     Little Endian	| 0 | 1 | 2 | 3 |
	+---+---+---+---+			+---+---+---+---+
						    15-8    31-24


Can someone from Mot confirm this?  What about some of the other
bi-endian processors out there (i960, i860)?

--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (05/13/91)

In article <3407@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:

>A real solution would be to add byte lane swapping hardware to the chip.
>This hardware would actually swap the bytes on every load and store
>depending on the byte order bit. (it can stay in the status register, it
>doesn't matter.) That way, memory is laid out so that bytes are always in
>the same place and word operations swap them around as necessary. If this
>were the case, buffers would not need to be byte swapped at all.
>
>So why don't we do this?

Surely, it is a real sulution to have a bi-endian OS. By doing so, external
bus (and OS and IO) can have some fixed endian which may or may not be
different from user processes.

But, with R3000, it is a bad idea to do so, because R3000 was designed
so that byte sex of the whole system (including bus) is changable. The
byte flipping of R3000 is done right.

R3000A, which allow dynamic changing of byte sex, should have designed
to allow byte sex flipping both of CPU and of bus independently.

>The word over here in software is that the
>hardware is expensive in terms of space and time. Its very likely to end up
>on the critical path for loads and stores.

I don't think so. A byte flipping multiplexer (for non-word access)
already exists on the path for loads and stores. So, it is only
necessary to add one more mode bit (to determin byte sex of bus) and
two more flipping patterns.

						Masataka Ohta

glew@pdx007.intel.com (Andy Glew) (05/13/91)

>>	Wouldn't it be very easy on a machine with a write back cache
>>to copy words simply by changing the internal cached address ( a little
>>like a form of cache aliasing). A lot of time is spent in most code
>>just copying things around. Would this not improve things ( you gain
>>immediately on cache occupancy).
>
>This is one of many cache tricks you can play to make memory copies (bcopy)
>and memory fills (bzero) go fast. Mostly these sorts of things are only
>used inside the operating system because they aren't safe for unprivelleged
>code to use. One can imagine hardware designed to provide this
>functionality for user processes.

What sort of associativity?
--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

abaum (Allen Baum) (05/13/91)

In article <3407@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:

>A real solution would be to add byte lane swapping hardware to the chip.
.....
>So why don't we do this? The word over here in software is that the
>hardware is expensive in terms of space and time. Its very likely to end up
>on the critical path for loads and stores. Any impact on cycle time to
>support bi-endianess is a lose.

I've thought about this a bit, and I'm not sure its true. All the byte lane
switching hardware already exists for the low order byte; you've got to be
able to select any of the four bytes to be placed in the low byte of the
register for load_bytes (and move the low byte to any of the four for 
store_bytes). The next byte has to select from only its own position and the
MSByte for load_halfword (& reverse for store_bytes). The upper halfword
doesn't get mucked with at all, except for getting sign extended or cleared.

So, the critical path already exists. The muxing isn't terribly symmetrical-
all the work goes into the LS byte, almost none into the MS byte. That means
the layout probably has holes in it, which could be filled with the rest of
the byte lane logic, at (here I'm speculating some) no extra cost in space, or
time.

Not quite true, of course. It will take some extra time to generate the
mux control signals, and some extra time to buffer them to cover all four
bytes. All this can be done in parallel with the load, however, and should
cost no extra time. So there. (Can anyone who actually knows what they're 
talking about please refute?)

By the way, note that while the buffering, etc. can go on in parallel, if
sign extension is required, it can't - you can have muxes set up for sign
selection, but you have to wait until it gets their, buffer it so it can
drive 24 loads, and then stick it into all those upper bit positions. 

As a sweeping generalization, the path from cache to registers/forwarding
path is THE critical path (if it isn't, you've probably done the design
wrong, or have a CISC architecture) (This is just a sweeping generalization,
mind you, not truth). So, this extra bit of gate delay can conceivably have
quite an impact on your cycle time. Which is why HP-PA doesn't have sign
extending loads.

>
>>	Wouldn't it be very easy on a machine with a write back cache
>>to copy words simply by changing the internal cached address ( a little
>>like a form of cache aliasing). A lot of time is spent in most code
>>just copying things around. Would this not improve things ( you gain
>>immediately on cache occupancy).
>

This can work, but only on entire lines, and only for line aligned addresses.
As noted, this is very cache organization sensitive, and you might want to 
restrict it to stuff in the kernal. It might work very well for frame buffer
and entire page moves.

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/13/91)

The real solution to a bi-endian environment is to have a good
typed file system, like Apollo Domain or the Mac Finder.  Then,
a file could have header information that tells what its
endian-ness is.  Further, for data files that are to be
shared with multiple environments, the data could then
either be converted to the native form "on the fly" (a
performance hit), or the data could be maintained in the
file in two copies with different endian-ness (a disk hog,
but disk space is cheap, what?).

I think most people who need dual-endian-ness want to share
data files, not binaries.  At least that is my experience.
What we do to get around this is to use transportable
self-describing file formats, notably HDF and netCDF.  I'd
like to see these built into an OS someday, as they're so
terribly useful.

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (05/14/91)

In article <166@armltd.uucp> abaum (Allen Baum) writes:

>I've thought about this a bit, and I'm not sure its true. All the byte lane
>switching hardware already exists for the low order byte;

Yes.

>The upper halfword
>doesn't get mucked with at all, except for getting sign extended or cleared.

And, thus, multiplexers already exist for the high order bytes.

>So, the critical path already exists. The muxing isn't terribly symmetrical-
>all the work goes into the LS byte, almost none into the MS byte. That means
>the layout probably has holes in it, which could be filled with the rest of
>the byte lane logic, at (here I'm speculating some) no extra cost in space, or
>time.

The possible problem is that, if things going to 64 bit, it will become a
little more complex.

>By the way, note that while the buffering, etc. can go on in parallel, if
>sign extension is required, it can't - you can have muxes set up for sign
>selection, but you have to wait until it gets their, buffer it so it can
>drive 24 loads, and then stick it into all those upper bit positions. 

With 32 bit word, 4 to 1 MUX is enough for the upper bit posisions, including
sign-extended/non-sign-extended half-word/byte loading, which is not so
different from the current 3 to 1 MUXing.

>As a sweeping generalization, the path from cache to registers/forwarding
>path is THE critical path (if it isn't, you've probably done the design
>wrong, or have a CISC architecture)

No. THE critical path is on register/ALU loop, which determines the maximuum
clock speed (see Jouppi). Transfer between cache and a register involves
TLB look up and cache access requiring a little more time. Thus, if it
requires 2.5 (or 1.5 or 2.3 or...) clock cycles, it in not on the critical
path.

							Masataka Ohta

abaum (Allen Baum) (05/16/91)

In article <186@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

>In article <166@armltd.uucp> abaum (Allen Baum) writes:
>

>>As a sweeping generalization, the path from cache to registers/forwarding
>>path is THE critical path (if it isn't, you've probably done the design
>>wrong, or have a CISC architecture)
>
>No. THE critical path is on register/ALU loop, which determines the maximuum
>clock speed (see Jouppi). Transfer between cache and a register involves
>TLB look up and cache access requiring a little more time. Thus, if it
>requires 2.5 (or 1.5 or 2.3 or...) clock cycles, it in not on the critical
>path.

Hmm, we have a difference of opinion here. In a sense, you can take as
many cycles for each 'step' in the pipe as you like. In fact, the R4000
did just as you suggested: an ALU loop is a single cycle, and the cache
access has 2 cycle latency, (but 1 cycle initiation, i.e. it's pipelined).
They have stated that they worked real hard to make sure that their ALU
cycle fit into one clock. They didn't have to do that, of course- it could
have taken 2 as well (in which case they would be superpipelined in the
now 'classical' sense of Jouppi).

Some SPARC implementation have gone in the other direction: cache access is
a single cycle, but they've merged the ALU+reg. writeback stages (or access
& ALU stages, or both, I don't quite remember). In effect, they've lengthened
the ALU cycle.

So, another way of saying what I meant to say is: if both ALU and Cache access
are going to fit into a cycle, the cache access will be the limiting case.
Perhaps not true with a trivial cache, but true for sizes that are useful.

Again, assuming that you intend everything to run in a single cycle, anything
that you put into that path will therefore lengthen your minimum cycle time.
Which was the point I was trying to make about sign extension.

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (05/17/91)

In article <173@armltd.uucp> abaum (Allen Baum) writes:

>>>As a sweeping generalization, the path from cache to registers/forwarding
>>>path is THE critical path
>>>wrong, or have a CISC architecture)

First, I have noticed that R3000 has LWL and LWR instruction which means
MUX of arbitrary flipping of bytes already exists on your "THE ciritical
path".

So, comments below has nothing to do with endian.

>>No. THE critical path is on register/ALU loop, which determines the maximuum
>>clock speed (see Jouppi).

>So, another way of saying what I meant to say is: if both ALU and Cache access
>are going to fit into a cycle, the cache access will be the limiting case.

I think cache access is much slower than ALU operation.

>Perhaps not true with a trivial cache, but true for sizes that are useful.

Are you sure?

Don't you know that access time increases with the size increase of cache?

It should also be noted that for a large cache to be useful (fast context
switch etc.), it should be tagged with physical address. So, you must also
do TLB lookup.

For example, on R4000, an ALU operation takes only 1 internal cycle
(10ns) but cache access takes three internal cycle including tag check.

>Again, assuming that you intend everything to run in a single cycle, anything
>that you put into that path will therefore lengthen your minimum cycle time.
>Which was the point I was trying to make about sign extension.

With todays technology, it is a bad idea to access cache in a single cycle.

							Masataka Ohta

abaum (Allen Baum) (05/20/91)

In article <199@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

>In article <173@armltd.uucp> abaum (Allen Baum) writes:
>
>>>>As a sweeping generalization, the path from cache to registers/forwarding
>>>>path is THE critical path
>
>First, I have noticed that R3000 has LWL and LWR instruction which means
>MUX of arbitrary flipping of bytes already exists on your "THE ciritical
>path".

I agree. They do. They can't be avoided if you have load_byte instructions.
That was my point. On the other hand you shouldn't do anything to make them
worse either.

>>>No. THE critical path is on register/ALU loop, which determines the maximuum
>>>clock speed (see Jouppi).
>>So, another way of saying what I meant to say is: if both ALU and Cache access
>>are going to fit into a cycle, the cache access will be the limiting case.

>I think cache access is much slower than ALU operation.

That is my point!!!!!!!

>>Perhaps not true with a trivial cache, but true for sizes that are useful.

>Are you sure?
>Don't you know that access time increases with the size increase of cache?

Of course I know! Again, that was my pint!!!

>It should also be noted that for a large cache to be useful (fast context
>switch etc.), it should be tagged with physical address. So, you must also
>do TLB lookup.
>For example, on R4000, an ALU operation takes only 1 internal cycle
>(10ns) but cache access takes three internal cycle including tag check.

Which proves my point (and yours, I guess)

>>Again, assuming that you intend everything to run in a single cycle, anything
>>that you put into that path will therefore lengthen your minimum cycle time.
>>Which was the point I was trying to make about sign extension.

>With todays technology, it is a bad idea to access cache in a single cycle.
                                 ^^^ ^^^     ^^^^^^ ^^^^       ^^^^^^ ^^^^^
Ah, now this is a very interesting position, and not one that I can dismiss out
of hand, either. All my arguments were based on single cycle cache access
(not including address formation, so most of todays RISCs have 2 internal cycle
cache access, by your definition).

 In order for this approach (taken by the R4000) to be effective,
you have to be able to schedule loads sufficiently far in advance so they
don't stall. In a 'N'cycle cache access, you need to schedule N-1 cycles in
advance. So, there is a law of diminishing returns there. Furthermore, you
can't block on a cache access in progress, so your cache has to be pipelined.

None of this is impossible- witness the R4000; they did it. They feel that the
gain in cycle time outweighs the extra stalls, and the extra design complexity.

So its probably time for a new thread - costs of multi-cycle cache access in
design complexity and number of stalls vs. the benefits. Any takers?