patrick@convex.com (Patrick F. McGehearty) (07/18/90)
For machines which are pushing the limits of technology, there are clearly some advantages to not handling offalignment accesses efficiently in hardware. Let us assume that some mechanism is provided for handling offalignment accesses (like 16 bit accesses on a byte boundary, or 32 bit accesses on a 16 bit boundary, or whatever), but making it significantly less efficient could reduce the number of gate delays in the primary path to memory for aligned accesses. The question I would be interested in seeing some discussion on is how much penalty is allowable for off-aligned accesses before the performance cost requires reprogramming to avoid the off-alignments from occurring. Trivial example: consider the std libc bcopy which takes two pointers and a count. Most machine specific implementations move the data in units larger than a character at time. Under what conditions should the implementor of this commonly used library worry about checking the alignment of the pointers before starting the copy?
henry@zoo.toronto.edu (Henry Spencer) (07/19/90)
In article <104037@convex.convex.com> patrick@convex.com (Patrick F. McGehearty) writes: >Trivial example: consider the std libc bcopy which takes two pointers and a >count. Most machine specific implementations move the data in units larger >than a character at time. Under what conditions should the implementor of >this commonly used library worry about checking the alignment of the >pointers before starting the copy? Essentially always, unless the count is very small. Even on machines that handle misalignment, if the alignment on the two areas is compatible, it is better to copy enough initial bytes to align the pointers and then do an aligned copy for the bulk of the data. (Also, a quibble: bcopy may be "std", but the *standard* routine for doing this is memcpy. :-)) -- NFS: all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology and its performance and security too. | henry@zoo.toronto.edu utzoo!henry
patrick@convex.com (Patrick F. McGehearty) (07/19/90)
In article <1990Jul18.190750.7282@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <104037@convex.convex.com> patrick@convex.com (Patrick F. McGehearty) writes: >>Trivial example: consider the std libc bcopy which takes two pointers and a >>count. ... > >Essentially always, unless the count is very small. Even on machines that >handle misalignment, if the alignment on the two areas is compatible, it >is better to copy enough initial bytes to align the pointers and then do >an aligned copy for the bulk of the data. > >(Also, a quibble: bcopy may be "std", but the *standard* routine for >doing this is memcpy. :-)) I agree with everything henry says above. I should have said "well known" instead of "std" bcopy :-) :-) I also see I did not make the question I wanted answered clear. Let me try again. I was assuming (but should have said as henry points out) that the source pointer was initially aligned with a partial word move before the block transfer began. What I intended to ask was when is it worth the trouble to load/shift/store or load partials or whatever it takes to avoid off-alignment accesses when the src and dest do not match alignment? But what I really was interested in is the following question: What guidelines would you give an architect about how slow he get away with making off-aligned accesses before it will start: (1) causing internal library people to rewrite code to avoid the problem? (2) causing compiler code generation oddities to avoid the problem? (3) causing customer application code people to recode assembly libraries to avoid the problem? (4) cause a general stink in the marketplace because of the problem? We already have some evidence that an infinite time for off-alignment does cause at least some customers to be unhappy. (like Sparc :-) :-)) However, such evidence shows that it is not fatal either. There are a lot of Sparc's out there. Part 2 of the question: does the answer depend on market segment? Are software performance expectation for a workstation different from a supercomputer? (I suspect they are)
dgr@hpfcso.HP.COM (Dave Roberts) (07/21/90)
Okay, with all this talk about off alignment, here's something that nobody has mentioned yet. How do processors that handle off alignment deal with getting a page fault in between the multiple transfers? Couldn't this get really hairy? I mean, consider this: $A : $XDDD <- Page 1 $A + 4 : $DXXX <- Page 2 Where "A" is the last address on page 1. "D" represents the misaligned word that you want to load and "X" is don't care. What if page 2 is paged out? When does the CPU notice that it's paged out and what does it do about it? Does it abort the transaction entirely, bring page 2 back in, and restart? If so, what if, because of the page replacement algorithm, page 1 gets blown away by bringing in page 2? If you do bring in the first three bytes and put them into one of the registers, what kind of state does that leave the paging code to deal with, since paging isn't handled by the hardware? This has nasty implications for instruction fetch of variable length instructions also. In this case, what if the misaligned transfer is an instruction fetch? I mean, ick! I've gotta think that all these corner cases have got to add a lot of checking to the CPU design which has to slow it down, not to mention the possiblity of having a case that you didn't think of and introducing a bug. I think I'd rather deal with the alignment, have my CPU run fast, and have a greater assurance that it was designed correctly. Dave Roberts dgr@hpfcla.fc.hp.com
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/21/90)
In article <1990Jul18.190750.7282@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: In article <104037@convex.convex.com> patrick@convex.com (Patrick F. McGehearty) writes: >Trivial example: consider the std libc bcopy which takes two pointers and a >count. Most machine specific implementations move the data in units larger >than a character at time. Under what conditions should the implementor of >this commonly used library worry about checking the alignment of the >pointers before starting the copy? Essentially always, unless the count is very small. Even on machines that handle misalignment, if the alignment on the two areas is compatible, it is better to copy enough initial bytes to align the pointers and then do an aligned copy for the bulk of the data. Doing aligned moves of aligned blocks of storage is a win on most machines, as Spencer says. Not only a memory copy routine should detect and exploit the (hopefully fairly common) case where the source and destination are already naturally aligned, it should also, on machines that make it easy, try to artificially align the bulk of the copy operation. One problem is that when destination and source are be aligned differently you have to choose whether to align the copy w.r.t. the source or the destination. It is best to align the destination, especially on write thru cache machines, and sometimes by a spectacular margin. Example: if we have to copy 73 bytes from address 102 to address 251, we should (assming 4 bytes is the optimal block copy word size): split the 73 bytes in three segments, of 3, 17x4=68, 2 bytes. copy 3 bytes from address 102 to 245 copy 17 words from address 105 to address 248 copy 2 bytes from address 173 to address 316 Note that the word by word copy has a source that is not word aligned, but the destination is. Many machines can cope with unaligned fetches fairly well, but unaligned stores are usually catastrophic. My usual example is the VAX-11/780, which had an 8 bytes buffer between the CPU and the system bus leading to memory, and write thru. Each byte written could cause an 8 byte read from memory, and an 8 byte write back to memory, ... As Spencer and myself have already remarked, this means that a suitable sw memory copy operation can easily beat hardware memory copies, for suitably large copy sizes, and by a large margin. Yet another reason for having simple CPUs and avoid microprograms (if you can afford the instruction fetch bandwidth, or use compact instruction encodings, e.g. a stack architecture). -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
henry@zoo.toronto.edu (Henry Spencer) (07/22/90)
In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: >How do processors that handle off alignment deal with getting a page >fault in between the multiple transfers? Couldn't this get really >hairy? ... In a word, yes. This is one of the major reasons why essentially all RISC processors insist on "natural" alignments: so that no operand can cross a page boundary. If you want *real* fun, consider that unaligned operands can overlap. Think about the implications of overlapping operands that span a page boundary between a normal page and a paged-out read-only page in a machine with two levels of virtual-address cache and a deep pipeline... with an exception-handling architecture that was nailed down in detail on much slower and simpler implementations, and can't be changed. This is the sort of problem that makes chip designers quietly start sending their resumes to RISC manufacturers... :-) -- NFS: all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology and its performance and security too. | henry@zoo.toronto.edu utzoo!henry
mo@messy.bellcore.com (Michael O'Dell) (07/22/90)
Once again Henry demonstrates that his powers of subtle understatement are exceeded only by his occassional correctness. Regretably, simply choosing to build a RISC doesn't obviate all those awful problems. In machines of any flavor (RISC or CISC) where the backend which detects the pagefaults can't talk to the front-end which needs to stop fetching the wrong instructions because, say, the speed of light on real circuit boards is so pokey, traps are nightmares of astounding proportions. Interrupts aren't as bad because they are not required to be particularly synchronized with the instruction stream. But those atomic, synchronous traps are genuine gut busters in both hardware and software. -Mike
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/23/90)
In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: | What if page 2 is paged out? When does the CPU notice that it's paged | out and what does it do about it? Does it abort the transaction | entirely, bring page 2 back in, and restart? If so, what if, because | of the page replacement algorithm, page 1 gets blown away by bringing | in page 2? You simply restart the instruction. Using an LRU scheme the 1st page would not get paged out unless the physical mapping was 1 page/process. As a reasonable constraint you want 8 pages/process anyway, so you avoid this. Note 1: yes "simply" is a relative term in this case, relative to doing half the instruction and then handling the fault. Note 2: pages for code, stack, source of copy instruction, dest of copy instruction. If you allow unalligned access you need two pages in each area to handle access over a boundary, that totals eight. No, I wouldn't want to actually run a system with that little memory. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/23/90)
In article <1990Jul22.001925.8979@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: If you want *real* fun, consider that unaligned operands can overlap. Think about the implications of overlapping operands that span a page boundary between a normal page and a paged-out read-only page in a machine with two levels of virtual-address cache and a deep pipeline... with an exception-handling architecture that was nailed down in detail on much slower and simpler implementations, and can't be changed. This is the sort of problem that makes chip designers quietly start sending their resumes to RISC manufacturers... :-) For the interested, the MU5 supercomputer from Manchester was virtually like this (except that they were designing it from scratch). They solved the problem by strict design discipline, as you can find in "The MU5 computer system", Ibbett & Morris (MacMillan). Basically they found out that the only sensible way out is to have restartable, not continuable, instructions. Saving processor state on a fault is a very bad idea if it is complicated. You substitute for this the easier problem of idempotency. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
dgr@hpfcso.HP.COM (Dave Roberts) (07/24/90)
Henry Spencer writes: > In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: > >How do processors that handle off alignment deal with getting a page > >fault in between the multiple transfers? Couldn't this get really > >hairy? ... > > In a word, yes. This is one of the major reasons why essentially all RISC > processors insist on "natural" alignments: so that no operand can cross > a page boundary. > > If you want *real* fun, consider that unaligned operands can overlap. > Think about the implications of overlapping operands that span a page > boundary between a normal page and a paged-out read-only page in a > machine with two levels of virtual-address cache and a deep pipeline... > with an exception-handling architecture that was nailed down in detail > on much slower and simpler implementations, and can't be changed. This > is the sort of problem that makes chip designers quietly start sending > their resumes to RISC manufacturers... :-) Sorry guys. I didn't mean to sound so dumb when I wrote that. :-) I with RISC products at a board level, so I already knew the answer was "yes", I was just looking for a somewhat quantitative explanation of what a designer has to go through to make this work and what techniques were used. How much of a slow down does this cause with the processor? The original argument was that not supporting misalignment was a misfeature (or bug in the original authors vocabulary). Anyway, I was just wondering how much faster you could go with a given design if you didn't have to support this stuff. What I was trying to convey was the idea that it isn't just as easy as supporting multiple accesses to memory but that the problem actually can go all the way down to the memory management and exception handling level, and that stuff seems to be the most gut wrenching and error prone level of a design. It also tends to impact performance pretty heavily the more complicated you make it. Anyway, I was just wondering if there was any CISC designer out there that could say something like, "Yea, well, we could have made that 25 MHz part run at 40 MHz if we hadn't had to support all the misalignment gunk." Some people will still want to have the misalignment, but it's nice to see what price you're paying to get it so that you can evaluate whether you really need it. Dave Roberts Hewlett-Packard Co. dgr@hpfcla.fc.hp.com
dgr@hpfcso.HP.COM (Dave Roberts) (07/25/90)
Bill writes: > In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: > > | What if page 2 is paged out? When does the CPU notice that it's paged > | out and what does it do about it? Does it abort the transaction > | entirely, bring page 2 back in, and restart? If so, what if, because > | of the page replacement algorithm, page 1 gets blown away by bringing > | in page 2? > > You simply restart the instruction. Using an LRU scheme the 1st page > would not get paged out unless the physical mapping was 1 page/process. > As a reasonable constraint you want 8 pages/process anyway, so you avoid > this. > > Note 1: yes "simply" is a relative term in this case, relative to doing > half the instruction and then handling the fault. > > Note 2: pages for code, stack, source of copy instruction, dest of > copy instruction. If you allow unalligned access you need two pages in > each area to handle access over a boundary, that totals eight. No, I > wouldn't want to actually run a system with that little memory. > > -- > bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) > "Stupidity, like virtue, is its own reward" -me > ---------- Yea, I agree with you if LRU is the strategy, but how do you know what page replacement strategy is being used? This isn't usually something that the hardware specifies. It is usually (read always :-) left to the O/S to implement. The O/S could implement a FIFO strategy and page 1 could have been the first page in. I'm not disagreeing, just raising some questions. I understand that the problem can be solved and many of the techniques for solving it. What I'm interested in is: how much you pay for the solution? How much time is spent trying to solve it? How much does it cost in terms of overall performance? How often is it used? If it's more expensive than doing aligned transactions even for processors that support it, do users tend to try and make all their transactions aligned? If so, why have it and thereby slow down everything? Why not leave software to deal with the case when a user has to do unaligned transactions? Do users really want to pay the price all the time for this support or would they rather take a big hit every so soften? Obviously some will and some won't but that's the case for any architectural decision. These are the questions I'm really trying to answer. Dave Roberts dgr@hpfcla.fc.hp.com
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/25/90)
Please note that I have trimmed the quotes in the previous posting heavily, I'm not trying to distort the meaning of the original poster, just keep the size reasonable. In article <8840016@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: | Yea, I agree with you if LRU is the strategy, but how do you know what | page replacement strategy is being used? This isn't usually something | that the hardware specifies. A fair question. Just as the hardware spec doesn't say the compilers have to allign things on word boundaries, if the hardware is such that certain software practices are needed, then they *are* implicitly given in the spec. | It is usually (read always :-) left to | the O/S to implement. The O/S could implement a FIFO strategy and | page 1 could have been the first page in. In which case the restart would cause a page fault, the first page would come back in, and the second restart would complete. | | I understand that the problem can be solved and many of the techniques | for solving it. What I'm interested in is: how much you pay for the | solution? How much time is spent trying to solve it? How much does | it cost in terms of overall performance? How often is it used? If | it's more expensive than doing aligned transactions even for | processors that support it, do users tend to try and make all their | transactions aligned? If so, why have it and thereby slow down | everything? Why not leave software to deal with the case when a user | has to do unaligned transactions? If you believe that when writing systems programs that sometimes you will have to access data which is not alligned, and I do, then the question is only if it should be done in hardware or software. This arbitrary data can come from another machine (not always even a computer), or be packed to keep volume down. If it is being done in software the source code has to contain a check for misallignment, which in turn means that the format of a pointer *on that machine* must be known, as well as the allignment requirements. Bad and non-portable. Or, you can simply access every data item larger than a byte using the "fetch a byte and shift" method. This requires that the byte order of the data, rather than the machine, be known. I think that's probably the only portable way. Alternatively the hardware can support unalligned fetch. It doesn't have to be efficient, because you would have to make an effort to make the fetch logic slower than software, it just has to work. This makes the program a bit smaller, and assuming that the chip logic is right, it prevents everyone from implementing their own try at access code. If the hardware could produce a clear trap for unalligned access (not the general bus fault, etc) the o/s could do software emulation. From the user's view that would look like a hardware solution. This is like emulating f.p. instructions in the o/s when the FPU is not present, and does not represent a major change in o/s technology. Note that this is not a RISC issue, in that the bus interface unit already may be doing things like cache interface, multiplexing lines, controlling status lines, etc. The BIU is not really RISC in that sense, it functions like a coprocessor if you draw a logic diagram, who's function is to provide data, which can go in the pipeline or into the CPU. | Do users really want to pay the | price all the time for this support or would they rather take a big | hit every so soften? Obviously some will and some won't but that's | the case for any architectural decision. These are the questions I'm | really trying to answer. You assume that there is a price all the time, and I'm pretty well convinced that the BIU in processors which have this capability, such as the 80486, don't have a greater latency for alligned access than the equivalent unit in SPARC or 88000. Not having the proprietary on chip timing I can't be totally certain, obviously. I think the real question is "should unalligned access be provided outside the user program?" I think the answer is yes. Obviously it can be done better in hardware, but if a chip is so tight on gates that it can't be without compromising performance elsewhere, then just a separate trap for quick identification of the problem by the o/s would be a reasonable alternative. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
davec@nucleus.amd.com (Dave Christie) (07/26/90)
In article <2370@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > > Alternatively the hardware can support unalligned fetch. It doesn't >have to be efficient, because you would have to make an effort to make >the fetch logic slower than software, it just has to work. This makes >the program a bit smaller, and assuming that the chip logic is right, it >prevents everyone from implementing their own try at access code. . . > Note that this is not a RISC issue, in that the bus interface unit >already may be doing things like cache interface, multiplexing lines, >controlling status lines, etc. The BIU is not really RISC in that sense, >it functions like a coprocessor if you draw a logic diagram, who's >function is to provide data, which can go in the pipeline or into the >CPU. Note that there are two degrees of misalignment: 1) within a word, and 2) crossing a word (& possible page) boundary. For 1): If the realignment hardware is not in your main fetch path because it would impact your cycle time, then it will likely mean an extra stage of processing for instructions which use it, which can add various bits of complexity. Considering that, plus 1) a 4-way mux isn't a serious time sink, and 2) how much, or even whether, it influences the cycle time is technology and implementation dependent then you are likely just going to stick it in the main fetch path and do it efficiently, w.r.t. layout, etc. Now, if the end user does pay for this, it isn't likely going to be in performance, because even though it might influence the cycle time, it won't. Chips come in "standard" operating frequencies these days (e.g. 16,20,25,30,40,50); The difference that a 4-way mux might make would tend to be taken care of by the process tweaking that's done to get to the desired frequency. In this case, the realignment hardware influences yield rather than cycle time, hence cost rather than the performance. I can't think of any processor that doesn't support this degree of realignment (some better than others). For 2): This, IMHO, is one of the more significant things that differentiates "RISC" from "CISC". The notion of one instruction making multiple references to memory tends to make RISC designers get red in the face and jump up and down. (Yes, I'm well aware of the 29K's load and store multiple instructions, and while I'm not fond of them, there are some significant differences between that and handling unaligned accesses.) The extra control complexity this introduces is a signficant increment, especially considering all the nightmarish endcases that have already been described in this thread. The added complexity is dependent on architecture and implementation, and tends to be worse for stores, but at any rate it tends to increase design/debug time, and more importantly can cause much hair pulling and resume writing when one attempts really high performance implementations. (I've know people who thrive on such complexity, for complexity's sake - they should be removed from the gene pool (0.5 :-). With the realestate one has to play with these days, you can find room for the complexity to keep the performance up, but it still influences the cost (and number of errata after release). I don't know of any "new" architecture chips with decent performance that support realignment across words in one instruction. Why do the common CISC chips support it? 1) it's not as big an increment in complexity (no smiley) 2) backwards compatibility (i.e. they have no choice) In summary, the cost you will tend to see will be $ more than performance, although at the high end of the performance spectrum you might pay in performance as well - that's hard to say, since processors which support word-crossing accesses tend to have a lot of other complexities which influence cost/performance as well. What makes sense depends on the intended applications, of course. It may indeed make some network software run significantly faster, for instance. But if that network software consumed 5% of all the cycles of all the processors I had sold, and such hardware support would *double* the n/w sfw performance, I still wouldn't risk screwing up everything else to go for an aggregate 2.5% performance improvement. ---------------------------- Dave Christie My humble opinions only.
baum@Apple.COM (Allen J. Baum) (07/26/90)
[] >In article <1990Jul25.223437.15301@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: >Note that there are two degrees of misalignment: > 1) within a word, and > 2) crossing a word (& possible page) boundary. > >For 1): >If the realignment hardware is not in your main fetch path because it >would impact your cycle time, then it will likely mean an extra stage >of processing for instructions which use it, which can add various bits >of complexity. Considering that, plus > 1) a 4-way mux isn't a serious time sink, and > 2) how much, or even whether, it influences the cycle time is > technology and implementation dependent ... >"standard" operating frequencies these days (e.g. 16,20,25,30,40,50); >The difference that a 4-way mux might make would tend to be taken care I've wondered about that. Getting naturally aligned operands requires a 4:1 mux on the low byte, a 2:1 on the next byte, and nothing on the upper bytes. To get all alignments requires a 4:1 mux on all bytes. It merely makes all bytes be just as bad as the low byte. In theory. In practice, loading and wire delays probably have as much impact as logic. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/27/90)
No one has mentioned the solution used by the MIPS R3000. It takes very little hardware, runs as fast as a middling amount of hardware, and can be easily emitted by the compiler whenever it doesn't know a datum's alignment. Quite simply, they have two "partial load" instructions: between them they constitute an unaligned load. If unaligned data is untypical, this sounds like an ideal compromise. Besides, maybe Herman Rubin can use it in his assembler programs ... -- Don D.C.Lindsay
richard@aiai.ed.ac.uk (Richard Tobin) (07/27/90)
In article <10018@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >No one has mentioned the solution used by the MIPS R3000. >Quite simply, they have two "partial load" instructions: between them >they constitute an unaligned load. I hate to sound like Herman Rubin, but it would be nice if they provided a way to access these instructions from C. "#pragma misaligned" perhaps. -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
chris@mimsy.umd.edu (Chris Torek) (08/06/90)
In article <2357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: (in response to `what if the word being loaded spans a page boundary and the second page is invalid) >... restart the instruction. Using an LRU scheme the 1st page >would not get paged out unless the physical mapping was 1 page/process. >... pages for code, stack, source of copy instruction, dest of >copy instruction. If you allow unalligned access you need two pages in >each area to handle access over a boundary, that totals eight. No, I >wouldn't want to actually run a system with that little memory. Actually, it is (or can be) even worse than that. Consider the following VAX gem: addl3 *0(r1),*0(r2),*0(r3) Assume that the instruction itself (which is 7 bytes long) crosses a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff respectively, that the longword at 0x1ff..0x203 contains 0x7ff, that the longword at 0x3ff..0x403 contains 0x9ff, and that the longword at 0x5ff..0x603 contains 0xbff. Then we need: 2 pages for the instruction 2 pages for 0(r1) (0x1ff..0x203) 2 pages for 0(r2) (0x3ff..0x403) 2 pages for 0(r3) (0x5ff..0x603) 2 pages for *1ff (0x7ff..0x803) 2 pages for *3ff (0x9ff..0xa03) 2 pages for *5ff (0xbff..0xc03) -- total 14 pages for one `simple' `addl3' instruction. (Imagine the fun with a six-argument instruction like `index'!) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris (New campus phone system, active sometime soon: +1 301 405 2750)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/06/90)
In article <25900@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: | Actually, it is (or can be) even worse than that. Consider the | following VAX gem: | | addl3 *0(r1),*0(r2),*0(r3) | | Assume that the instruction itself (which is 7 bytes long) crosses | a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff | [ details ] | total 14 pages for one `simple' `addl3' instruction. I suspect that you could actually get something like this in actual programs. It happens on machines which don't force the instructions to be an even size. This is probably a worst worst case, but I could believe that it would happen. There seem to be enough advantages and penalties to unalligned access to prevent a definitive good or bad decision. Certainly it will take a few gates on the chip, but probably will not slow alligned access. It will be a lot faster than doing the same thing in software, but it's not a common thing to do. There doesn't seem to be a portable way to decide if an address is alligned or not, since pointer formats vary, so a software solution has to assume all accesses are unalligned. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
gideony@microsoft.UUCP (Gideon YUVAL) (08/07/90)
In article <25900@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >In article <2357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM >(Wm E Davidsen Jr) writes: ... ... ... ... ... ... ... ... ... ... ... > Consider the >following VAX gem: > addl3 *0(r1),*0(r2),*0(r3) ... > total 14 pages for one `simple' `addl3' instruction. >(Imagine the fun with a six-argument instruction like `index'!) I'm told this is why the VAX has a 512-byte page -- a 128KB box was planned (but never made), and in the worst-case (an instruction straddling 40 pages), would have deadlocked at <40 pages available. The page-size thus had to be less than one-fortieth of the page-list of a bare-minimum (1977!) system. -- Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)
mcgrath@homer.Berkeley.EDU (Roland McGrath) (08/08/90)
On machines that "don't handle misaligned accesses", what do they do when one happens anyway? Assuming they don't just go up in a puff of blue smoke (which would be alright with me, as long as the color of smoke is documented), or cease functioning and need to be power-cycled (which would be alright too, if documented), the least useful thing I can think of for them to do (within the bounds of reason) is to do the access wrong (which seems likely since if they want the low-order two bits of the address to always be zero, they might well not pay attention to them). Is this what is done? The second least useful thing I can think of is to cause a hardware trap to software, which would presumably be horrendously slow. I would be completely happy with that. Write a software trap handler to deal with it, and write your compiler such that it never happens if avoidable. -- Roland McGrath Free Software Foundation, Inc. roland@ai.mit.edu, uunet!ai.mit.edu!roland
jkenton@pinocchio.encore.com (Jeff Kenton) (08/08/90)
From article <MCGRATH.90Aug7220232@homer.Berkeley.EDU>, by mcgrath@homer.Berkeley.EDU (Roland McGrath): > On machines that "don't handle misaligned accesses", what do they do when one > happens anyway? On the Motorola 88000 a misaligned access causes a trap into the kernel. There is a bit in the status word which can override this, in which case the CPU assumes zero for the appropriate number of low order bits. Kernels can either signal the offending process on a fault, or complete the offending access and continue. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - jeff kenton --- temporarily at jkenton@pinocchio.encore.com --- always at (617) 894-4508 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/08/90)
In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes: | documented), the least useful thing I can think of for them to do (within the | bounds of reason) is to do the access wrong (which seems likely since if they | want the low-order two bits of the address to always be zero, they might well | not pay attention to them). Is this what is done? On some machines, yes. On others a trap results, allowing the kernel to complete the access in software if desired. I believe that the 88K has a flag to trap or just zero the low bits or the address, I know the Honeywell DPS series zeroed the low bit of the address on a doubleword access. This resulted in many amusing "learning experiences." -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
jputnam@raptor.Eng.Sun.COM (James M. Putnam) (08/08/90)
Roland McGrath (in <MCGRATH.90Aug7220232@homer.Berkeley.EDU>, I think) writes: >On machines that "don't handle misaligned accesses", what do they do when one >happens anyway? Assuming they don't just go up in a puff of blue smoke (which >would be alright with me, as long as the color of smoke is documented), or >cease functioning and need to be power-cycled (which would be alright too, if >documented), the least useful thing I can think of for them to do (within the >bounds of reason) is to do the access wrong (which seems likely since if they >want the low-order two bits of the address to always be zero, they might well >not pay attention to them). Is this what is done? The VAX does this for quad word loads. VAX LISP takes advantage of this feature to get two free tag bits on LISP pointers that don't need to be masked before being dereferenced. Seems useful to me. >The second least useful thing I can think of is to cause a hardware trap to >software, which would presumably be horrendously slow. I would be completely >happy with that. I'd like this behavior to be defined by the user process. Perhaps as a signal? Seems similar to SIGBUS. Anyway, set a bit someplace so you trap or not, based on what kind of process you are. You probably always want to take the trap in system mode, and panic (at least in UN*X) since it indicates a bug of some kind. >Write a software trap handler to deal with it, and write your compiler such >that it never happens if avoidable. If your compiler is doing the right thing, the only time you'll take an interrupt is if your code is buggy. This sort of thing can be caught statically by lint, but there may be some situations in which dynamic behaviour can produce the same event, like in programs that generate their own code. Interesting idea. Jim
cprice@mips.COM (Charlie Price) (08/09/90)
In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes: >On machines that "don't handle misaligned accesses", what do they do when one >happens anyway? MIPS processors generate a trap with an Address Error Exception. In RISC/os, this is turned into a SIGBUS signal. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086
henry@zoo.toronto.edu (Henry Spencer) (08/09/90)
In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes: >On machines that "don't handle misaligned accesses", what do they do when one >happens anyway?... Mostly they trap. There are probably one or two design groups that were foolish enough to just ignore the low bits. -- The 486 is to a modern CPU as a Jules | Henry Spencer at U of Toronto Zoology Verne reprint is to a modern SF novel. | henry@zoo.toronto.edu utzoo!henry
mash@mips.COM (John Mashey) (08/09/90)
In article <12409@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes: > >On the Motorola 88000 a misaligned access causes a trap into the kernel. There >is a bit in the status word which can override this, in which case the CPU >assumes zero for the appropriate number of low order bits. Kernels can either >signal the offending process on a fault, or complete the offending access and >continue. Just out of curiosity, can anyone give some live examples where software takes advantage of the mode where the CPU just zeroes the low-oorder bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jkenton@pinocchio.encore.com (Jeff Kenton) (08/09/90)
From article <40711@mips.mips.COM>, by mash@mips.COM (John Mashey): > > Just out of curiosity, can anyone give some live examples where software > takes advantage of the mode where the CPU just zeroes the low-oorder > bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC). The only case I've seen is a low level, PROM based, debugger on the 88k which runs in this mode. Presumably, this is done to save error checking or trap handling when the user types a misaligned address. I don't see that this gains much, and it runs the risk of masking real bugs in the program itself. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - jeff kenton --- temporarily at jkenton@pinocchio.encore.com --- always at (617) 894-4508 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/09/90)
In article <40711@mips.mips.COM> mash@mips.COM (John Mashey) writes: >Just out of curiosity, can anyone give some live examples where software >takes advantage of the mode where the CPU just zeroes the low-oorder >bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC). The optimizing compilers I used to work on, did all of their dynamic memory management though IDL (Interface Description Language). Our IDL implementation gave us nice things - garbage collection, debug support, and the ability to move any rooted data structure to/from a file. IDL objects were tagged, and tags contained a two-bit code which was stored in the low end of a pointer. The IDL runtimes had to pack and unpacked them. (Guy Steele determined which code was commonest, and represented it as 00.) Would we have used the hardware mode? No. The IDL runtimes initially consumed about half the cycles of a big compile, but I fixed that in the conventional way (amortization,inlining,caching,etc). After the fix, our cycles were elswhere, and a special hardware mode would have made no difference. Plus, I would not want to turn off the hardware checks while the rest of the code was running. If this meant constantly turning a mode on and off, then forget it. -- Don D.C.Lindsay
news@ism780c.isc.com (News system) (08/10/90)
In article <1990Aug8.212255.3555@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes: >>On machines that "don't handle misaligned accesses", what do they do when one >>happens anyway?... > >Mostly they trap. There are probably one or two design groups that were >foolish enough to just ignore the low bits. There was at least one design group (an SEL machine) that noticed that they could encode the operand width using the low order address bits. In this way they were able to save a bit in the instruction thus providing an additional address bit. Of course, there was no such concept as misaligned access reference using this scheme. Marv Rubinstein
aglew@dual.crhc.uiuc.edu (Andy Glew) (08/10/90)
>There was at least one design group (an SEL machine) that noticed that they >could encode the operand width using the low order address bits. In this >way they were able to save a bit in the instruction thus providing an >additional address bit. Of course, there was no such concept as misaligned >access reference using this scheme. > > Marv Rubinstein SEL became Gould CSD, and then... The last Gould machines used the low order bits of the offset field (which was present on all memory access instructions) as part of the width encoding. They did not, however, use the low order bits of the registers, and they did produce misaligned traps if the low order bits of the final address, ignoring the low order bits of the offset literal, were incorrect. This meant that you could not have an odd register, and add 1 to it via the addressing mode to make a correctly aligned even address. It never caused me any problems - it even found a few bugs. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
davecb@yunexus.YorkU.CA (David Collier-Brown) (08/10/90)
In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes: | On machines that "don't handle misaligned accesses", what do they do when one | happens anyway?... henry@zoo.toronto.edu (Henry Spencer) writes: | Mostly they trap. There are probably one or two design groups that were | foolish enough to just ignore the low bits. Ah yes, many older mainframes... I remember DPS-8s with affection, but not for all the decisions they made. A doubleword register load from an odd address used the floor(address), causing an optimization that used double loads to load the wrong two variables. **MY** that was hard to spot. --dave -- David Collier-Brown, | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or 72 Abitibi Ave., | {toronto area...}lethe!dave Willowdale, Ontario, | "And the next 8 man-months came up like CANADA. 416-223-8968 | thunder across the bay" --david kipling
steves@sv001.SanDiego.NCR.COM (Steve Schlesinger x3711) (08/11/90)
From article <40711@mips.mips.COM>, by mash@mips.COM (John Mashey): > > Just out of curiosity, can anyone give some live examples where software > takes advantage of the mode where the CPU just zeroes the low-oorder > bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC). > When emulating other (older) architectures that permit non-aligned accesses, there are two basic choices with architectures of this type: 1 examine the nonaligned memory access and read ALIGNED bytes, ALIGNED halfwords and/or ALIGNED words as appropriate and shift and mask them together. 2 do two NON-ALIGNED reads of the two memory words the nonaligned word straddles ( non-aligned-addr and non-aligned-addr+4 ) and then shift and mask. The low order bits are ignored by the read, but are used by the masking and shifting code. My memory of this is fuzzy, but in most cases, (2) is more efficient. =============================================================================== Steve Schlesinger, NCR/Teradata Joint Development 619-597-3711 11010 Torreyana Rd, San Diego, CA 92121 steve.schlesinger@sandiego.ncr.com ===============================================================================
jon@hitachi.uucp (Jon Ryshpan) (08/12/90)
jkenton@pinocchio.encore.com (Jeff Kenton) writes: >mash@mips.COM (John Mashey) writes: >> Just out of curiosity, can anyone give some live examples where software >> takes advantage of the mode where the CPU just zeroes the low-oorder >> bits and conitnues, as in the 88K? (or, I think(?), in the RT/PC). >The only case I've seen is a low level, PROM based, debugger on the 88k which >runs in this mode. Presumably, this is done to save error checking or trap >handling when the user types a misaligned address. I can't imagine a *worse* place to turn off a trap detecting possible errors than in a debugger. Jonathan Ryshpan <...!uunet!hitachi!jon>
colin@array.UUCP (Colin Plumb) (08/12/90)
>>mash@mips.COM (John Mashey) writes: >>> Can anyone give some live examples where software takes advantage of the >>> mode where the CPU just zeroes the low-order bits and conitnues... >jkenton@pinocchio.encore.com (Jeff Kenton) writes: >> The only case I've seen is a low level, PROM based, debugger on the 88k >> which runs in this mode. In article <467@hitachi.uucp> jon@hitachi.UUCP (Jon Ryshpan) writes: > I can't imagine a *worse* place to turn off a trap detecting possible > errors than in a debugger. I think we can assume that it restores the state while running the debugged code; it just turns off the trap for internal use. I don't think detecting errors in the PROM monitor does you much good, anyway, so why worry? :-} -- -Colin
jkenton@pinocchio.encore.com (Jeff Kenton) (08/12/90)
From article <488@array.UUCP>, by colin@array.UUCP (Colin Plumb): >>>mash@mips.COM (John Mashey) writes: >>>> Can anyone give some live examples where software takes advantage of the >>>> mode where the CPU just zeroes the low-order bits and conitnues... > >>jkenton@pinocchio.encore.com (Jeff Kenton) writes: >>> The only case I've seen is a low level, PROM based, debugger on the 88k >>> which runs in this mode. > > In article <467@hitachi.uucp> jon@hitachi.UUCP (Jon Ryshpan) writes: >> I can't imagine a *worse* place to turn off a trap detecting possible >> errors than in a debugger. > > I think we can assume that it restores the state while running the debugged > code; it just turns off the trap for internal use. I don't think detecting > errors in the PROM monitor does you much good, anyway, so why worry? :-} Unfortunately, it doesn't. As I said in my (partially quoted) posting, I'm not convinced that the small saving in debugger code is worth the loss of general error checking it costs you. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - jeff kenton --- temporarily at jkenton@pinocchio.encore.com --- always at (617) 894-4508 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
chris@mimsy.umd.edu (Chris Torek) (09/05/90)
(Someone is recirculating old news through the net. The last time this came around, I did not comment, but this time I cannot resist :-) ) [Someone asks what machines that require alignment do with unaligned addresses in loads. His `least desirable' scenario is that the processor completely ignores the low address bits.] In article <140356@sun.Eng.Sun.COM> jputnam@raptor.Eng.Sun.COM (James M. Putnam) writes: >The VAX does this for quad word loads. VAX LISP takes advantage of >this feature to get two free tag bits on LISP pointers that don't >need to be masked before being dereferenced. Since there is no mention of this in the VAX architecture handbook, I tried it out to make sure. An unaligned movq reads the unaligned quadword. If r0 contains 0x1003 and memory at 0x1000 looks like this: 0x1000: 0x00 0x1001: 0x01 0x1002: 0x02 r0 -> 0x1003: 0x03 0x1004: 0x04 0x1005: 0x05 0x1006: 0x06 0x1007: 0x07 0x1008: 0x08 0x1009: 0x09 0x100a: 0x0a then after a `movq (r0),r0', r0 contains the value 0x06050403 and r1 contains the value 0x0a090807. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
jputnam@raptor.Eng.Sun.COM (James M.) (09/06/90)
In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >(Someone is recirculating old news through the net. The last time this >came around, I did not comment, but this time I cannot resist :-) ) > >[Someone asks what machines that require alignment do with unaligned >addresses in loads. His `least desirable' scenario is that the >processor completely ignores the low address bits.] > >In article <140356@sun.Eng.Sun.COM> jputnam@raptor.Eng.Sun.COM >(James M. Putnam) writes: Something he shouldn't have about VAX quad-word loads ignoring low-order bits, and this being a useful feature for LISP. >Since there is no mention of this in the VAX architecture handbook, I >tried it out to make sure. An unaligned movq reads the unaligned quadword. >-- >In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) >Domain: chris@cs.umd.edu Path: uunet!mimsy!chris I stand corrected and thanks to Chris for pointing it out. During a conversation I had with Barry Margolin some time ago about this very topic (non-lispm architecture implementations of LISP pointers), I got the impression that VAX quad-word loads ignored bits in the address. Sincere apologies for the misinformation. Does anyone know of a machine that has this "feature", or I did construct this out of whole cloth? Jim
jkenton@pinocchio.encore.com (Jeff Kenton) (09/06/90)
From article <141881@sun.Eng.Sun.COM>, by jputnam@raptor.Eng.Sun.COM (James M.): > > Something he shouldn't have about VAX quad-word loads ignoring > low-order bits, and this being a useful feature for LISP. > > Sincere apologies for the misinformation. Does anyone know of a machine > that has this "feature", or I did construct this out of whole cloth? > The 88000 will let you run in this mode. It still seems like laziness to me. ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----- ----- jeff kenton --- temporarily at jkenton@pinocchio.encore.com ----- ----- --- always at (617) 894-4508 --- ----- ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
bengsig@oracle.nl (Bjorn Engsig) (09/07/90)
Article <141881@sun.Eng.Sun.COM> by jputnam@raptor.Eng.Sun.COM (James M.) says: | | Does anyone know of a machine that has this "feature" [ignoring low bits | in unaligned stores and load] Yes, the IBM RT does this. As far as I remember there is a well hidden switch to turn this to give an exception on unaligned addressing, but the default is simply to ignore the last one (two) bits of the address on halfword (fullword) stores and loads. -- Bjorn Engsig, Domain: bengsig@oracle.nl, bengsig@oracle.com Path: uunet!mcsun!orcenl!bengsig
srg@quick.com (Spencer Garrett) (09/09/90)
In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes: > Something he shouldn't have about VAX quad-word loads ignoring > low-order bits, and this being a useful feature for LISP. > > In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: > >Since there is no mention of this in the VAX architecture handbook, I > >tried it out to make sure. An unaligned movq reads the unaligned quadword. Jim, What may well have happened is that some early LISP implementer just "tried it" and found that on his vax the low order bits were ignored. This makes a dandy feature if you're trying to fake a tagged architecture, so maybe he went ahead and used it. *BUT* since it isn't in the manual, DEC is free to change it whenever they feel the need, and they *do* feel that need from time to time, so the code in question wouldn't even be portable within the vax line. You may both have been right. :-)
srg@quick.com (Spencer Garrett) (09/09/90)
In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes: > Something he shouldn't have about VAX quad-word loads ignoring > low-order bits, and this being a useful feature for LISP. > > In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) wrides: > >Since there is no mention of this in the VAX architecture handbook, I > >tried it out to make sure. An unaligned movq reads the unaligned quadword. Jim, What may well have happened is that some early LISP implementer just "tried it" and found that on his vax the low order bits were ignored. This makes a dandy feature if you're trying to fake a tagged architecture, so maybe he went ahead and used it. *BUT* since it isn't in the manual, DEC is free to change it whenever they feel the need, and they *do* feel that need from time to time, so the code in question wouldn't even be portable within the vax line. You may both have been right. :-)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/10/90)
In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett) writes: | Jim, | What may well have happened is that some early LISP implementer just | "tried it" and found that on his vax the low order bits were ignored. With the wonders of loadable control store, every VAX can have it's own instruction set. For a while we had had one with an FFT hardware instruction. I don't think they worked this was from the factory, though, the bits were used on the early 780s we had. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
przemek@liszt.helios.nd.edu (Przemek Klosowski) (09/11/90)
In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett) writes: >In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes: >> Something he shouldn't have about VAX quad-word loads ignoring >> low-order bits, and this being a useful feature for LISP. >> >> In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >> >Since there is no mention of this in the VAX architecture handbook, I > > What may well have happened is that some early LISP implementer just >"tried it" and found that on his vax the low order bits were ignored. > >so maybe he went ahead and used it. *BUT* since it isn't in the manual, But, IT IS IN THE MANUAL! second page of the chapter on data representation of the 1981 edition of the 'VAX architecture handbook' prominently says that o Quadword A quadword is eight contiguous bytes starting on an arbitrary byte boundary And it really took me about 60 sec to find it ... przemek -- przemek klosowski (przemek@ndcva.cc.nd.edu) Physics Dept University of Notre Dame IN 46556
chris@mimsy.umd.edu (Chris Torek) (09/12/90)
>[misstatement about VAX quad-word loads ignoring low-order bits] In article <26376@mimsy.umd.edu> I wrote: >Since there is no mention of this in the VAX architecture handbook ... >In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett) suggested: >> What may well have happened is that some early LISP implementer just >>"tried it" and found that on his vax the low order bits were ignored. >>so maybe he went ahead and used it. *BUT* since it isn't in the manual, In article <410@news.nd.edu>, przemek@liszt.helios.nd.edu (Przemek Klosowski) writes: >But, IT IS IN THE MANUAL! [see page 33 of the VAX architecture book] I probably should have followed up to Spencer Garrett's posting myself. What I meant in <26376@mimsy.umd.edu> was `no mention of ignoring low order bits on movq', not `no mention of alignment requirements'. It is well-known that the VAX architecture (and therefore its handbook :-) ) allows arbitrary alignment for word and longword operations. Note, however, that there *are* some (exactly four, as far as I know) instructions that do require strict alignment, namely the interlocked queue instructions: insqhi remqhi insqti remqti All of these require that their queue be on a quadword boundary (and, further, that the relative offsets that make these objects into queues be multiples of 8 as well). If the address handed to one of these instructions is not valid, or if the queue offsets are invalid, you get a reserved operand fault (again, the bits are not ignored). Incidentally, these instructions are excruciatingly slow---about 150% of the time for an integer divide with FPA, or twice as long as a subroutine call, on an 11/780. (NB: I have no explanation as to why an interlocked instruction should be faster on a 750.) [begin text from an article in net.unix that I saved back in 1983] The following VAX instruction timings were obtained from a former DEC employee. I cannot vouch for their accuracy and have no idea how they were obtained. VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITH FPA INSTRUCTION <EXECUTION TIME MICROSECS> <TIMES 780> 780 750 730 750 730 INTERLOCKED INSERT + REMOVE 30.43 26.43 41.02 1.151 0.742 versus, e.g., c MOVL Reg, Reg 0.40 0.93 1.69 0.430 0.237 MOVL mem, Reg 0.84 1.67 4.94 0.503 0.170 MOVL Reg, mem 1.31 2.28 4.88 0.575 0.268 CMPL AND BLEQ 1.16 2.32 4.26 0.500 0.272 CMPL mem, Reg AND BLEQ 1.88 3.24 7.31 0.580 0.257 TSTL AND BLEQ 1.00 2.42 4.25 0.413 0.235 BRW 0.80 2.01 2.57 0.398 0.311 MULL2 Reg, Reg 1.85 5.68 12.05 0.326 0.154 MULL2 mem, Reg 2.50 6.55 15.14 0.382 0.165 MULL2 Reg, mem 2.48 6.41 15.11 0.387 0.164 DIVL3 Reg, Reg, Reg 9.64 8.88 16.15 1.086 0.597 CALLS #0, ROUTINE + RET 14.75 20.87 36.61 0.707 0.403 -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris