pgc+@andrew.cmu.edu (Paul G. Crumley) (04/02/89)
Hello, For those that are not familiar with the design of the RT/PC processor, I will describe what is done for unaligned acess to memory for data accesses and instruction fetches. All of this is described in detail in the "IBM RT PC Hardware Technical Reference Volume I", part number 75X0232. Instructions on the RT/PC must be half-word aligned. (each word contains four 8 bit bytes) The Instruction Address Register (IAR) has bit 0, the LSB, forced to a zero. If you exectute a branch to an absolute location (most branches are relative and contain the displacement in halfwords) thus loading the whole IAR, the LSB is silently forced to a zero. If one tries to load the IAR using the privileged instruction MTS (Move To SCR (System Control Register)) the LSB is forced to a value of zero. These absolute branches and the MTS instruction are the only ways that a program can attempt to get an odd value in the IAR. In both cases the resulting state of the CPU is completely defined. The processor has instructions that load and store bytes, halfwords and words. For halfword accesses (both load and store) the LSB is silently forced to a zero and for word accesses the two LSBs are silently forced be zeroes. Load and store instructions have no effect on the condition bits and no exceptions are generated when the lower order bit or bits are forced to zero. There are two instructions which are used to move a group of registers to or from storage as unit. These are the Load Multiple(LM) and Store Multiple(STM) instructions. These instructions move N registers to an area of memory 4*N in size. This area must start on a word aligned boundry. For both instruction fetches and data loads and stores it is possible to generate an exception from the MMU (Memory Management Unit) or other storage controllers on the RSC (ROMP (RT's CPU) Storage Channel). It is possible for instructions, and the source or target area for a LM or STM instruction, to span a page boundry. If such a condition causes a page fault exception all of the information needed to restart the instruction fetch or the data access is available to the processor. Now that we all know what is being discussed, let's see if there really are problems with this scheme. I will attempt to discuss some of Don's complaints as these are fairly representative of what I have heard from other people in the past about the way the RT/PC handles unaligned accesses. "How can one write a correct program on a machine that drops the bits and doesn't report any exceptions?" If you have a program that runs correctly on some other processor and you have used language supported ways to create new objects, this program will run correctly on an RT/PC. If your program assumes you can stuff an integer into an array of floating point numbers you might have problems. If the language doesn't understand type coercions or other support the allocation of new objects you are out of luck. Anything you do is not portable so it has to be isolated from the rest of the program anyway. "OK, what about importing pointers to things? How can one write correct code in that case?" Pointers that one gets from an untrusted object always have to be checked. They have to be checked to make sure they won't page fault. They have to be checked to be certain they are pointing to a valid instance of the class of object to which they claim to point. In most cases, this checking process is non-trivial. The work of checking the low order bits is simple and quick compared to the work required to determine if the pointer you just got is really a pointer to a valid object. "Doesn't the need to pad all the data structures waste a lot of space?" For most languages, the compilers are free to rearrange the order of the elements of a structure any way they desire. With smart compilers, a worst-case of 3 bytes per structure is used for padding. In compensation, programs operate faster when the data is loaded and stored in single memory system cycles rather than requiring two such cycles. If a particular language (COBOL?) really agrees to order the elements of structures in exactly the way the programmer desires, there are instructions in the RT/PC's repertoire that allow those elements to be moved correctly. "Oh, great...now we have to rewrite all our test cases." No comment. "Come on, there must be SOME case where a piece of code that works on a machine that fixes up unaligned memory access doesn't work on the RT/PC and the RT/PC doesn't complain!" Well, yes, there are such cases. One can imagine a program that grabs an arbitrary instruction stream, smashes that instruction stream into memory starting at an arbitrary location, and jumps to the instructions. Another example is a simple-minded malloc that hands back arbitrary addresses. Such programs might work on some machine but fail on the RT/PC. Though it is simple to image such program fragments, in practice, the code that does these types of functions was isolated in language constructs, library routines and operating system services long ago. "Look, Paul, how can you possibly defend such a inelegant piece of garbage? This seems so out of character for you!" Well, what can I say? The RT/PC executes correct programs correctly. I have never had a problem with the way the RT/PC accesses memory and I have written lots of code that manipulates imported pointers, implements device drivers, whips together instruction streams on the fly, and other, non-trivial functions. I believe the silicon saved by the need for the shifters and more complex microcode was put to better use in a variety of ways. Some of these uses include: -- Programs execute more reliably on an RT/PC than on many other processors. The RT/PC implements paritiy generation and checking with automatic retransmission on inter-module busses. This causes transient errors that would be undetect on other processors to be detected and fixed on the RT/PC. -- The RT/PCs' MMU allows access to storage to be regulated with very fine ganularity. This allows options for software designs that were not feasible on many other machines. Debuggers can allow programs to execute at full speed and trap on accesses to an arbitrarily large number of data objects. It is possible to allow multiple processes to share data areas and new ways of synchronizing changes to that data. Mapped files that must be consistent across multiple data spaces or processors are possible. -- The RT/PC's MMU provides many facilities that greatly enhance the system's performance. The MMU defines a very large virtual address space (2**40). This space can be used in a variety of ways by programs and operating systems. There are plenty of TLB entries (Translations Look-aside Buffers) thus providing good virtual memory performance. Hardware is provided that allows quick process switches. DMA can use real or virtual addresses. Memory accesses on the RT/PC are interleaved thus providing faster data loads and stores. Please note that all of the above listed items refer to the MMU. It turns out that the CPU generates all the address bits for memory system access. (The CPU does force the low order bit of the IAR to zero thus limiting instruction fetches to halfwords.) It is the MMU chip that drops the low order bits for unaligned accesses. I have ignored this fact till now since it is an implementation detail. The system looks the same to the programmer no matter where the bits are dropped. Still, I want to point out that the way the RT/PC implements unaligned accesses is not a product of the RISC processor, but of the MMU. Other devices on the RSC are free to support unaligned accesses in any manner they like. As with all implementations there are trade-offs. The silicon saved in the MMU by not supporting arbitrary alignment was put to good use. The RT/PC operates more reliably and with higher performance than many similar systems. When good compilers are used, the current RT/PCs, models 12X & 13X, perform very well. In conclusion, I hope this note helps to clear up the issues involving the manner in which the RT/PC implements unaligned accesses. Though the dropping of low order address bits may seem to prevent any hope of producing correct object code, we see that such concerns are not a problem. If there are problems with the way a program accesses data on the RT/PC, there are problems with that code on other machines too. Best regards, Paul
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (04/02/89)
In article <oYBLMhy00jaRA1ZlFZ@andrew.cmu.edu> pgc+@andrew.cmu.edu (Paul G. Crumley) writes: >For halfword accesses (both load and store) the LSB is silently forced >to a zero and for word accesses the two LSBs are silently forced be >zeroes. Load and store instructions have no effect on the condition >bits and no exceptions are generated when the lower order bit or bits >are forced to zero. > >The RT/PC executes correct programs correctly. I believe the silicon >saved by the need for the shifters and more complex microcode was put >to better use in a variety of ways. You are probably right. However, when the RT/PC detects an incorrect program, it performs actions that the programmer clearly did not intend. In my previous posting, I argued that this leads to bugs which are very hard to deal with. The amount of silicon required to raise an interrupt is probably LESS THAN the amount required to zero the correct number of address lines. The interrupt would make bugs much easier to find, and the chip would be just as RISCish. -- Don D.C.Lindsay Carnegie Mellon School of Computer Science --
mash@mips.COM (John Mashey) (04/02/89)
In article <oYBLMhy00jaRA1ZlFZ@andrew.cmu.edu> pgc+@andrew.cmu.edu (Paul G. Crumley) writes: ....good tutorial on way RT/PC works.... >The processor has instructions that load and store bytes, halfwords and words. >For halfword accesses (both load and store)the LSB is silently forced to a zero >and for word accesses the two LSBs are silently forced be zeroes.... ... >"How can one write a correct program on a machine that drops the bits and >doesn't report any exceptions?" If you have a program that runs correctly on >some other processor and you have used language supported ways to create new >objects, this program will run correctly on an RT/PC... ... >As with all implementations there are trade-offs. The silicon saved in the MMU >by not supporting arbitrary alignment was put to good use. The RT/PC operates >more reliably and with higher performance than many similar systems. When good >compilers are used, the current RT/PCs, models 12X & 13X, perform very well. Is there any chance of posting some data in support of 1) "more reliably", and 2) "with higher performance" or at least perhaps specifying what "many similar systems" refers to? or what "performs very well" means? (I'm sure it wasn't meant this way, but this read like: "It's better to do it this way because it's better than (unspecified) Brand X or Y. This sort of dataoid is hard to evaluate.) Note that at least some other RISCs support extensive use of parity or ECC, for example. Are there published numbers for 12X & 13X? >..... Though the dropping of >low order address bits may seem to prevent any hope of producing correct object >code, we see that such concerns are not a problem. If there are problems with >the way a program accesses data on the RT/PC, there are problems with that code >on other machines too. I think the discussion got inverted. I can't imagine there being any problem running CORRECT, portable code, on an RTPC, that ran correctly on machines that enforced alignment. But, that isn't the problem, which is in debugging programs. The RT/PC design is the only one I can think of offhand that made this particular choice. Most machines: a) Support unaligned data in general. OR b) Trap on unaligned references I observe in my experience that a trap caused by an unaligned reference is one of the commonest first indications of error in a program. For what it's worth, as an example, let me illustrate this with a story from a long time ago. At school, we used to have 360/xx computers, which used approach b). We then got a 370/xxx, which used a). Our computer center actually requested a price from IBM for a mode that would retain the 360 behavior, because they went thru their list of APARs (trouble reports), and discovered that something like 50% of them were first discovered by encountering an alignment error. [The price was too high, unfortunately.] Anyway, it seems that either a) or b) is viable; at least if you have a), although you lose the debugging assist, some kinds of code are certainly more convenient. In this particular area, it seems that the RT PC made an unusual tradeoff, very different from the rest of the RISCs, for no reason that is yet obvious. It's truly hard to believe that with 2 whole VLSI chips to implement CPU + MMU, that a few gates couldn't be found to implement the alignment check....so please say more, maybe I've missed something. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) (04/03/89)
One advantage of silently ignoring the low bits in unaligned accesses is that you can use the low order bits as free tag bits -- without runtime overhead when dereferencing the data if you don't care to type-check. This contrasts with the use of alignment traps for run time error checking. I don't know if the CMU RT Common Lisp actually does this (I'm not a Lisp hacker), but it would be a reasonable thing to do. -bsy -- Internet: bsy@cs.cmu.edu Bitnet: bsy%cs.cmu.edu%smtp@interbit CSnet: bsy%cs.cmu.edu@relay.cs.net Uucp: ...!seismo!cs.cmu.edu!bsy USPS: Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890 Voice: (412) 268-7571 --
jas@ernie.Berkeley.EDU (Jim Shankland) (04/04/89)
The cases for faulting on unaligned instructions and data accesses (e.g., MIPS, 68010, et al.) rather than dealing with them correctly in hardware (e.g., VAX) is stronger than the case for silently forcing the 1 or 2 LSB's to 0, as the RT/PC does. Sure, correct programs will run correctly; but correct programs of any respectable size are almost unheard of. Consider the following reductio ad absurdum: My new processor does not fault on data references beyond the bounds of a process's address space; instead, it silently reads random data. This saved me a hunk of silicon. Correct programs still run correctly; only programs that dereference wild pointers get into trouble, and those are broken anyway. Jim Shankland jas@ernie.berkeley.edu "Blame it on the lies that killed us, blame it on the truth that ran us down"
jhallen@wpi.wpi.edu (Joseph H Allen) (04/04/89)
In article <28664@ucbvax.BERKELEY.EDU> jas@ernie.Berkeley.EDU (Jim Shankland) writes: >The cases for faulting on unaligned instructions and data accesses >(e.g., MIPS, 68010, et al.) rather than dealing with them correctly >in hardware (e.g., VAX) is stronger than the case for silently >forcing the 1 or 2 LSB's to 0, as the RT/PC does. Sure, correct >programs will run correctly; but correct programs of any respectable >size are almost unheard of. Consider the following reductio ad >absurdum: . . . >Jim Shankland >jas@ernie.berkeley.edu > >"Blame it on the lies that killed us, blame it on the truth that ran us down" Also, consider the problem in the other direction: Some programs may work correctly on the RT because the RT silently zeros several bytes. This program will not, of course, work on a "non-braindamaged" machine. On the other hand, the RT's doing this is in sync with the general large-company philosphy of getting you "locked into using our machine forever otherwise your gonna pay a lot of money for new software." "That's not a bug; it's a feature!"
sloan@june.cs.washington.edu (Kenneth Sloan) (04/05/89)
Doesn't anyone remember the good old IBM 1130? [now THERE was a personal computer!] As I (dimly) recall, loads(stores) of 16-bit "words" proceeded by loading(storing) the high-order byte from(at) the specified address, and then loading(storing) from(at) the address ORed with 1. I found it a marvelous way to save on the storage of 0 & -1... -Ken Sloan
bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) (04/05/89)
In article <28664@ucbvax.BERKELEY.EDU> jas@ernie.Berkeley.EDU (Jim Shankland) writes:
].... Consider the following reductio ad
]absurdum:
]
]My new processor does not fault on data references beyond the bounds
]of a process's address space; instead, it silently reads random data.
]This saved me a hunk of silicon. Correct programs still run correctly;
]only programs that dereference wild pointers get into trouble, and
]those are broken anyway.
I'm not arguing that the RT is a great machine... but you might as well say
use your reductio argument for the BSD Vax implementation -- NULL pointers
are too often dereferenced: correct programs still run correctly; only
programs that dereference NULL gets into trouble (when ported to a Sun, for
example), and those are broken anyway....
-bsy
--
firth@sei.cmu.edu (Robert Firth) (04/05/89)
In article <4646@pt.cs.cmu.edu> bsy@PLAY.MACH.CS.CMU.EDU (Flexi-thumbs) writes: >I'm not arguing that the RT is a great machine... but you might as well say >use your reductio argument for the BSD Vax implementation -- NULL pointers >are too often dereferenced: correct programs still run correctly; only >programs that dereference NULL gets into trouble (when ported to a Sun, for >example), and those are broken anyway.... In my opinion, the argument is still valid. Dereferencing a null pointer should cause a trap; if an implementation silently does the wrong thing it is seriously stupid. When I first started porting to Vax/VMS I found one or two null-pointer dereferences in almost every major systems program. Every case was an accident waiting to happen. An editor had three examples, one of which would certainly have cost the hapless user her entire edit session. It is so easy to map out page zero of the virtual address space. Why be so stupid for so little gain?
aglew@mcdurb.Urbana.Gould.COM (04/07/89)
>One advantage of silently ignoring the low bits in unaligned accesses is >that you can use the low order bits as free tag bits -- without runtime >overhead when dereferencing the data if you don't care to type-check. This >contrasts with the use of alignment traps for run time error checking. > >I don't know if the CMU RT Common Lisp actually does this (I'm not a Lisp >hacker), but it would be a reasonable thing to do. > >-bsy >-- >Internet: bsy@cs.cmu.edu Bitnet: bsy%cs.cmu.edu%smtp@interbit >CSnet: bsy%cs.cmu.edu@relay.cs.net Uucp: ...!seismo!cs.cmu.edu!bsy >USPS: Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890 >Voice: (412) 268-7571 Time yet again for expression of one of my favorite architectural "gimmicks": provide a mask that is implicitly ANDed with any {instruction,data} address before use -- this way you could place tag bits in the low bits, in the high bits, or in any bit you wanted (and not have to rely on features like "The 68000 only implements 24 address bits", etc.). Yes, this is a gimmick... Andy "Krazy" Glew aglew@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew Motorola Microcomputer Division, Champaign-Urbana Design Center 1101 E. University, Urbana, Illinois 61801, USA. My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
mac@uvacs.cs.Virginia.EDU (Alex Colvin) (04/14/89)
> >that you can use the low order bits as free tag bits -- without runtime > provide a mask that is implicitly ANDed with any {instruction,data} address > before use -- this way you could place tag bits in the low bits, in the > high bits, or in any bit you wanted (and not have to rely on features like Most instruction addressing modes provide an adder, some also will do limited shifting. As long as we want AND and OR, let's just put in the whole ALU. This is sort of what the WM design does with its two opcodes. Think of one as the addressing mode. You want auto-increment by 17?
aglew@mcdurb.Urbana.Gould.COM (04/22/89)
>> >that you can use the low order bits as free tag bits -- without runtime > >> provide a mask that is implicitly ANDed with any {instruction,data} address >> before use -- this way you could place tag bits in the low bits, in the >> high bits, or in any bit you wanted (and not have to rely on features like > >Most instruction addressing modes provide an adder, some also will do >limited shifting. As long as we want AND and OR, let's just put in the >whole ALU. This is sort of what the WM design does with its two opcodes. >Think of one as the addressing mode. You want auto-increment by 17? Pass transistors.