[comp.arch] RT/PC Unaligned Accesses

pgc+@andrew.cmu.edu (Paul G. Crumley) (04/02/89)

Hello,

For those that are not familiar with the design of the RT/PC processor, I will
describe what is done for unaligned acess to memory for data accesses and
instruction fetches.  All of this is described in detail in the "IBM RT PC
Hardware Technical Reference Volume I", part number 75X0232.

Instructions on the RT/PC must be half-word aligned.  (each word  contains four
8 bit bytes)  The Instruction Address Register (IAR) has bit 0, the LSB, forced
to a zero.  If you exectute a branch to an absolute location (most branches are
relative and contain the displacement in halfwords) thus loading the whole IAR,
the LSB is silently forced to a zero.  If one tries to load the IAR using the
privileged instruction MTS (Move To SCR (System Control Register)) the LSB is
forced to a value of zero.  These absolute branches and the MTS instruction are
the only ways that a program can attempt to get an odd value in the IAR.  In
both cases the resulting state of the CPU is completely defined.

The processor has instructions that load and store bytes, halfwords and words.
For halfword accesses (both load and store) the LSB is silently forced to a zero
and for word accesses the two LSBs are silently forced be zeroes.  Load and
store instructions have no effect on the condition bits and no exceptions are
generated when the lower order bit or bits are forced to zero.

There are two instructions which are used to move a group of registers to or
from storage as unit.  These are the Load Multiple(LM) and Store Multiple(STM)
instructions.  These instructions move N registers to an area of memory 4*N in
size.  This area must start on a word aligned boundry.

For both instruction fetches and data loads and stores it is possible to
generate an exception from the MMU (Memory Management Unit) or other storage
controllers on the RSC (ROMP (RT's CPU) Storage Channel).  It is possible for
instructions, and the source or target area for a LM or STM instruction, to span
a page boundry.  If such a condition causes a page fault exception all of the
information needed to restart the instruction fetch or the data access is
available to the processor.

Now that we all know what is being discussed, let's see if there really are
problems with this scheme.  I will attempt to discuss some  of Don's complaints
as these are fairly representative of what I have heard from other people in the
past about the way the RT/PC handles unaligned accesses.

"How can one write a correct program on a machine that drops the bits and
doesn't report any exceptions?"  If you have a program that runs correctly on
some other processor and you have used language supported ways to create new
objects, this program will run correctly on an RT/PC.  If your program assumes
you can stuff an integer into an array of floating point numbers you might have
problems.  If the language doesn't understand type coercions or other support
the allocation of new objects you are out of luck.  Anything you do is not
portable so it has to be isolated from the rest of the program anyway.

"OK, what about importing pointers to things?  How can one write correct code in
that case?"  Pointers that one gets from an untrusted object always have to be
checked.  They have to be checked to make sure they won't page fault.  They have
to be checked to be certain they are pointing to a valid instance of the class
of object to which they claim to point.  In most cases, this checking process is
non-trivial.  The work of checking the low order bits is simple and quick
compared to the work required to determine if the pointer you just got is really
a pointer to a valid object.

"Doesn't the need to pad all the data structures waste a lot of space?"  For
most languages, the compilers are free to rearrange the order of the elements of
a structure any way they desire.  With smart compilers, a worst-case of  3 bytes
per structure is used for padding.  In compensation, programs operate faster
when the data is loaded and stored in single memory system cycles rather than
requiring two such cycles.  If a particular language (COBOL?) really agrees to
order the elements of structures in exactly the way the programmer desires,
there are instructions in the RT/PC's repertoire that allow those elements to be
moved correctly.

"Oh, great...now we have to rewrite all our test cases."  No comment.

"Come on, there must be SOME case where a piece of code that works on a machine
that fixes up unaligned memory access doesn't work on the RT/PC and the RT/PC
doesn't complain!"  Well, yes, there are such cases.  One can imagine a program
that grabs an arbitrary instruction stream, smashes that instruction stream into
memory starting at an arbitrary location, and jumps to the instructions.
Another example is a simple-minded malloc that hands back arbitrary addresses.
Such programs might work on some machine but fail on the RT/PC.  Though it is
simple to image such program fragments, in practice, the code that does these
types of functions was isolated in language constructs, library routines and
operating system services long ago.

"Look, Paul, how can you possibly defend such a inelegant piece of garbage?
This seems so out of character for you!"  Well, what can I say?  The RT/PC
executes correct programs correctly.  I have never had a problem with the way
the RT/PC accesses memory and I have written lots of code that manipulates
imported pointers, implements device drivers, whips together instruction streams
on the fly, and other, non-trivial functions.  I believe the silicon saved by
the need for the shifters and more complex microcode was put to better use in a
variety of ways.  Some of these uses include:

-- Programs execute more reliably on an RT/PC than on many other processors.
The RT/PC implements paritiy generation and checking with automatic
retransmission on inter-module busses.  This causes transient errors that would
be undetect on other processors to be detected and fixed on the RT/PC.

-- The RT/PCs' MMU allows access to storage to be regulated with very fine
ganularity.  This allows options for software designs that were not feasible on
many other machines.  Debuggers can allow programs to execute at full speed and
trap on accesses to an arbitrarily large number of data objects.  It is possible
to allow multiple processes to share data areas and new ways of synchronizing
changes to that data.  Mapped files that must be consistent across multiple data
spaces or processors are possible.

-- The RT/PC's MMU provides many facilities that greatly enhance the system's
performance.  The MMU defines a very large virtual address space (2**40).  This
space can be used in a variety of ways by programs and operating systems.  There
are plenty of TLB entries (Translations Look-aside Buffers) thus providing good
virtual memory performance.  Hardware is provided that allows quick process
switches.  DMA can use real or virtual addresses.  Memory accesses on the RT/PC
are interleaved thus providing faster data loads and stores.

Please note that all of the above listed items refer to the MMU.  It turns out
that the CPU generates all the address bits for memory system access.  (The CPU
does force the low order bit of the IAR to zero thus limiting instruction
fetches to halfwords.)  It is the MMU chip that drops the low order bits for
unaligned accesses.  I have ignored this fact till now since it is an
implementation detail.  The system looks the same to the programmer no matter
where the bits are dropped.  Still, I want to point out that the way the RT/PC
implements unaligned accesses is not a product of the RISC processor, but of the
MMU.  Other devices on the RSC are free to support unaligned accesses in any
manner they like.

As with all implementations there are trade-offs.  The silicon saved in the MMU
by not supporting arbitrary alignment was put to good use.  The RT/PC operates
more reliably and with higher performance than many similar systems.  When good
compilers are used, the current RT/PCs, models 12X & 13X, perform very well.

In conclusion, I hope this note helps to clear up the issues involving the
manner in which the RT/PC implements unaligned accesses.  Though the dropping of
low order address bits may seem to prevent any hope of producing correct object
code, we see that such concerns are not a problem.  If there are problems with
the way a program accesses data on the RT/PC, there are problems with that code
on other machines too.

Best regards,

Paul

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (04/02/89)

In article <oYBLMhy00jaRA1ZlFZ@andrew.cmu.edu> pgc+@andrew.cmu.edu 
	(Paul G. Crumley) writes:
>For halfword accesses (both load and store) the LSB is silently forced
>to a zero and for word accesses the two LSBs are silently forced be
>zeroes.  Load and store instructions have no effect on the condition
>bits and no exceptions are generated when the lower order bit or bits
>are forced to zero.
>
>The RT/PC executes correct programs correctly.  I believe the silicon
>saved by the need for the shifters and more complex microcode was put
>to better use in a variety of ways.

You are probably right. However, when the RT/PC detects an incorrect
program, it performs actions that the programmer clearly did not
intend. In my previous posting, I argued that this leads to bugs which
are very hard to deal with. The amount of silicon required to raise an
interrupt is probably LESS THAN the amount required to zero the correct
number of address lines. The interrupt would make bugs much easier to
find, and the chip would be just as RISCish.
-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science
--

mash@mips.COM (John Mashey) (04/02/89)

In article <oYBLMhy00jaRA1ZlFZ@andrew.cmu.edu> pgc+@andrew.cmu.edu (Paul G. Crumley) writes:
....good tutorial on way RT/PC works....
>The processor has instructions that load and store bytes, halfwords and words.
>For halfword accesses (both load and store)the LSB is silently forced to a zero
>and for word accesses the two LSBs are silently forced be zeroes....
...
>"How can one write a correct program on a machine that drops the bits and
>doesn't report any exceptions?"  If you have a program that runs correctly on
>some other processor and you have used language supported ways to create new
>objects, this program will run correctly on an RT/PC...
...
>As with all implementations there are trade-offs.  The silicon saved in the MMU
>by not supporting arbitrary alignment was put to good use.  The RT/PC operates
>more reliably and with higher performance than many similar systems.  When good
>compilers are used, the current RT/PCs, models 12X & 13X, perform very well.

Is there any chance of posting some data in support of 1) "more reliably",
and 2) "with higher performance"  or at least perhaps specifying what
"many similar systems" refers to?  or what "performs very well" means?
(I'm sure it wasn't meant this way, but this read like: "It's better to do it
this way because it's better than (unspecified) Brand X or Y.  This sort
of dataoid is hard to evaluate.)
Note that at least some other RISCs support extensive use of parity or ECC,
for example.  Are there published numbers for 12X & 13X?

>.....  Though the dropping of
>low order address bits may seem to prevent any hope of producing correct object
>code, we see that such concerns are not a problem.  If there are problems with
>the way a program accesses data on the RT/PC, there are problems with that code
>on other machines too.

I think the discussion got inverted.  I can't imagine there being any problem
running CORRECT, portable code, on an RTPC, that ran correctly on machines that
enforced alignment.  But, that isn't the problem, which is in
debugging programs.  The RT/PC design is the only one I can think of offhand
that made this particular choice.  Most machines:
	a) Support unaligned data in general.
	OR
	b) Trap on unaligned references

I observe in my experience that a trap caused by an unaligned reference
is one of the commonest first indications of error in a program.  For what it's
worth, as an example, let me illustrate this with a story from a long time
ago.  At school, we used to have 360/xx computers, which used approach b).
We then got a 370/xxx, which used a).  Our computer center actually
requested a price from IBM for a mode that would retain the 360 behavior,
because they went thru their list of APARs (trouble reports), and discovered
that something like 50% of them were first discovered by encountering
an alignment error.  [The price was too high, unfortunately.]

Anyway, it seems that either a) or b) is viable; at least if you have a),
although you lose the debugging assist, some kinds of code are certainly
more convenient.  In this particular area, it seems that the RT PC
made an unusual tradeoff, very different from the rest of the RISCs,
for no reason that is yet obvious.  It's truly hard to believe
that with 2 whole VLSI chips to implement CPU + MMU, that a few gates
couldn't be found to implement the alignment check....so please say more,
maybe I've missed something.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) (04/03/89)

One advantage of silently ignoring the low bits in unaligned accesses is
that you can use the low order bits as free tag bits -- without runtime
overhead when dereferencing the data if you don't care to type-check.  This
contrasts with the use of alignment traps for run time error checking.

I don't know if the CMU RT Common Lisp actually does this (I'm not a Lisp
hacker), but it would be a reasonable thing to do.

-bsy
-- 
Internet:  bsy@cs.cmu.edu		Bitnet:	 bsy%cs.cmu.edu%smtp@interbit
CSnet:     bsy%cs.cmu.edu@relay.cs.net	Uucp:    ...!seismo!cs.cmu.edu!bsy
USPS:      Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890
Voice:     (412) 268-7571
--

jas@ernie.Berkeley.EDU (Jim Shankland) (04/04/89)

The cases for faulting on unaligned instructions and data accesses
(e.g., MIPS, 68010, et al.) rather than dealing with them correctly
in hardware (e.g., VAX) is stronger than the case for silently
forcing the 1 or 2 LSB's to 0, as the RT/PC does.  Sure, correct
programs will run correctly; but correct programs of any respectable
size are almost unheard of.  Consider the following reductio ad
absurdum:

My new processor does not fault on data references beyond the bounds
of a process's address space; instead, it silently reads random data.
This saved me a hunk of silicon.  Correct programs still run correctly;
only programs that dereference wild pointers get into trouble, and
those are broken anyway.

Jim Shankland
jas@ernie.berkeley.edu

"Blame it on the lies that killed us, blame it on the truth that ran us down"

jhallen@wpi.wpi.edu (Joseph H Allen) (04/04/89)

In article <28664@ucbvax.BERKELEY.EDU> jas@ernie.Berkeley.EDU (Jim Shankland) writes:
>The cases for faulting on unaligned instructions and data accesses
>(e.g., MIPS, 68010, et al.) rather than dealing with them correctly
>in hardware (e.g., VAX) is stronger than the case for silently
>forcing the 1 or 2 LSB's to 0, as the RT/PC does.  Sure, correct
>programs will run correctly; but correct programs of any respectable
>size are almost unheard of.  Consider the following reductio ad
>absurdum:
.
.
.
>Jim Shankland
>jas@ernie.berkeley.edu
>
>"Blame it on the lies that killed us, blame it on the truth that ran us down"

Also, consider the problem in the other direction:  Some programs may work
correctly on the RT because the RT silently zeros several bytes.  This program
will not, of course, work on a "non-braindamaged" machine.  On the other hand,
the RT's doing this is in sync with the general large-company philosphy of
getting you "locked into using our machine forever otherwise your gonna pay a
lot of money for new software."

"That's not a bug; it's a feature!"

sloan@june.cs.washington.edu (Kenneth Sloan) (04/05/89)

Doesn't anyone remember the good old IBM 1130? [now THERE was a personal
computer!]  As I (dimly) recall, loads(stores) of 16-bit "words" proceeded
by loading(storing) the high-order byte from(at) the specified address, and
then loading(storing) from(at) the address ORed with 1.

I found it a marvelous way to save on the storage of 0 & -1...

-Ken Sloan

bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) (04/05/89)

In article <28664@ucbvax.BERKELEY.EDU> jas@ernie.Berkeley.EDU (Jim Shankland) writes:
]....  Consider the following reductio ad
]absurdum:
]
]My new processor does not fault on data references beyond the bounds
]of a process's address space; instead, it silently reads random data.
]This saved me a hunk of silicon.  Correct programs still run correctly;
]only programs that dereference wild pointers get into trouble, and
]those are broken anyway.

I'm not arguing that the RT is a great machine... but you might as well say
use your reductio argument for the BSD Vax implementation -- NULL pointers
are too often dereferenced:  correct programs still run correctly; only
programs that dereference NULL gets into trouble (when ported to a Sun, for
example), and those are broken anyway....

-bsy
--

firth@sei.cmu.edu (Robert Firth) (04/05/89)

In article <4646@pt.cs.cmu.edu> bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) writes:

>I'm not arguing that the RT is a great machine... but you might as well say
>use your reductio argument for the BSD Vax implementation -- NULL pointers
>are too often dereferenced:  correct programs still run correctly; only
>programs that dereference NULL gets into trouble (when ported to a Sun, for
>example), and those are broken anyway....

In my opinion, the argument is still valid.  Dereferencing a null pointer
should cause a trap; if an implementation silently does the wrong thing
it is seriously stupid.

When I first started porting to Vax/VMS I found one or two null-pointer
dereferences in almost every major systems program.  Every case was an
accident waiting to happen.  An editor had three examples, one of which
would certainly have cost the hapless user her entire edit session.

It is so easy to map out page zero of the virtual address space.  Why
be so stupid for so little gain?

aglew@mcdurb.Urbana.Gould.COM (04/07/89)

>One advantage of silently ignoring the low bits in unaligned accesses is
>that you can use the low order bits as free tag bits -- without runtime
>overhead when dereferencing the data if you don't care to type-check.  This
>contrasts with the use of alignment traps for run time error checking.
>
>I don't know if the CMU RT Common Lisp actually does this (I'm not a Lisp
>hacker), but it would be a reasonable thing to do.
>
>-bsy
>-- 
>Internet:  bsy@cs.cmu.edu		Bitnet:	 bsy%cs.cmu.edu%smtp@interbit
>CSnet:     bsy%cs.cmu.edu@relay.cs.net	Uucp:    ...!seismo!cs.cmu.edu!bsy
>USPS:      Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890
>Voice:     (412) 268-7571

Time yet again for expression of one of my favorite architectural "gimmicks":
provide a mask that is implicitly ANDed with any {instruction,data} address
before use -- this way you could place tag bits in the low bits, in the
high bits, or in any bit you wanted (and not have to rely on features like
"The 68000 only implements 24 address bits", etc.).

Yes, this is a gimmick...


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

mac@uvacs.cs.Virginia.EDU (Alex Colvin) (04/14/89)

> >that you can use the low order bits as free tag bits -- without runtime

> provide a mask that is implicitly ANDed with any {instruction,data} address
> before use -- this way you could place tag bits in the low bits, in the
> high bits, or in any bit you wanted (and not have to rely on features like

Most instruction addressing modes provide an adder, some also will do
limited shifting.  As long as we want AND and OR, let's just put in the
whole ALU.  This is sort of what the WM design does with its two opcodes.
Think of one as the addressing mode.  You want auto-increment by 17?

aglew@mcdurb.Urbana.Gould.COM (04/22/89)

>> >that you can use the low order bits as free tag bits -- without runtime
>
>> provide a mask that is implicitly ANDed with any {instruction,data} address
>> before use -- this way you could place tag bits in the low bits, in the
>> high bits, or in any bit you wanted (and not have to rely on features like
>
>Most instruction addressing modes provide an adder, some also will do
>limited shifting.  As long as we want AND and OR, let's just put in the
>whole ALU.  This is sort of what the WM design does with its two opcodes.
>Think of one as the addressing mode.  You want auto-increment by 17?

Pass transistors.