[comp.arch] Is handling off-alignment important?

patrick@convex.com (Patrick F. McGehearty) (07/18/90)

For machines which are pushing the limits of technology, there are clearly
some advantages to not handling offalignment accesses efficiently in
hardware.  Let us assume that some mechanism is provided for handling
offalignment accesses (like 16 bit accesses on a byte boundary, or 32 bit
accesses on a 16 bit boundary, or whatever), but making it significantly
less efficient could reduce the number of gate delays in the primary path to
memory for aligned accesses.  The question I would be interested in seeing
some discussion on is how much penalty is allowable for off-aligned accesses
before the performance cost requires reprogramming to avoid the
off-alignments from occurring.

Trivial example: consider the std libc bcopy which takes two pointers and a
count.  Most machine specific implementations move the data in units larger
than a character at time.  Under what conditions should the implementor of
this commonly used library worry about checking the alignment of the
pointers before starting the copy?

henry@zoo.toronto.edu (Henry Spencer) (07/19/90)

In article <104037@convex.convex.com> patrick@convex.com (Patrick F. McGehearty) writes:
>Trivial example: consider the std libc bcopy which takes two pointers and a
>count.  Most machine specific implementations move the data in units larger
>than a character at time.  Under what conditions should the implementor of
>this commonly used library worry about checking the alignment of the
>pointers before starting the copy?

Essentially always, unless the count is very small.  Even on machines that
handle misalignment, if the alignment on the two areas is compatible, it
is better to copy enough initial bytes to align the pointers and then do
an aligned copy for the bulk of the data.

(Also, a quibble:  bcopy may be "std", but the *standard* routine for
doing this is memcpy. :-))
-- 
NFS:  all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology
and its performance and security too.  |  henry@zoo.toronto.edu   utzoo!henry

patrick@convex.com (Patrick F. McGehearty) (07/19/90)

In article <1990Jul18.190750.7282@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <104037@convex.convex.com> patrick@convex.com (Patrick F. McGehearty) writes:
>>Trivial example: consider the std libc bcopy which takes two pointers and a
>>count. ...
>
>Essentially always, unless the count is very small.  Even on machines that
>handle misalignment, if the alignment on the two areas is compatible, it
>is better to copy enough initial bytes to align the pointers and then do
>an aligned copy for the bulk of the data.
>
>(Also, a quibble:  bcopy may be "std", but the *standard* routine for
>doing this is memcpy. :-))

I agree with everything henry says above.
I should have said "well known" instead of "std" bcopy :-) :-)

I also see I did not make the question I wanted answered clear.  Let me try
again.  I was assuming (but should have said as henry points out) that the
source pointer was initially aligned with a partial word move before the
block transfer began.  What I intended to ask was when is it worth the
trouble to load/shift/store or load partials or whatever it takes to avoid
off-alignment accesses when the src and dest do not match alignment?

But what I really was interested in is the following question:
What guidelines would you give an architect about how slow he
get away with making off-aligned accesses before it will start:
(1) causing internal library people to rewrite code to avoid the problem?
(2) causing compiler code generation oddities to avoid the problem?
(3) causing customer application code people to recode assembly libraries
	to avoid the problem?
(4) cause a general stink in the marketplace because of the problem?

We already have some evidence that an infinite time for off-alignment
does cause at least some customers to be unhappy. (like Sparc :-) :-))
However, such evidence shows that it is not fatal either.  There are a lot
of Sparc's out there.

Part 2 of the question: does the answer depend on market segment?
Are software performance expectation for a workstation different from a
supercomputer?  (I suspect they are)

dgr@hpfcso.HP.COM (Dave Roberts) (07/21/90)

Okay, with all this talk about off alignment, here's something that nobody
has mentioned yet.

How do processors that handle off alignment deal with getting a page
fault in between the multiple transfers?  Couldn't this get really
hairy?  I mean, consider this:


$A	: $XDDD  <- Page 1
$A + 4	: $DXXX  <- Page 2

Where "A" is the last address on page 1.  "D" represents the
misaligned word that you want to load and "X" is don't care.

What if page 2 is paged out?  When does the CPU notice that it's paged
out and what does it do about it?  Does it abort the transaction
entirely, bring page 2 back in, and restart?  If so, what if, because
of the page replacement algorithm, page 1 gets blown away by bringing
in page 2?  If you do bring in the first three bytes and put them into
one of the registers, what kind of state does that leave the paging
code to deal with, since paging isn't handled by the hardware?  This
has nasty implications for instruction fetch of variable length
instructions also.  In this case, what if the misaligned transfer is
an instruction fetch?  I mean, ick!

I've gotta think that all these corner cases have got to add a lot of
checking to the CPU design which has to slow it down, not to mention
the possiblity of having a case that you didn't think of and
introducing a bug.  I think I'd rather deal with the alignment, have
my CPU run fast, and have a greater assurance that it was designed
correctly.

Dave Roberts
dgr@hpfcla.fc.hp.com

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/21/90)

In article <1990Jul18.190750.7282@zoo.toronto.edu> henry@zoo.toronto.edu
(Henry Spencer) writes:

   In article <104037@convex.convex.com> patrick@convex.com (Patrick F.
   McGehearty) writes:

   >Trivial example: consider the std libc bcopy which takes two pointers and a
   >count.  Most machine specific implementations move the data in units larger
   >than a character at time.  Under what conditions should the implementor of
   >this commonly used library worry about checking the alignment of the
   >pointers before starting the copy?

   Essentially always, unless the count is very small.  Even on machines that
   handle misalignment, if the alignment on the two areas is compatible, it
   is better to copy enough initial bytes to align the pointers and then do
   an aligned copy for the bulk of the data.

Doing aligned moves of aligned blocks of storage is a win on most
machines, as Spencer says. Not only a memory copy routine should detect
and exploit the (hopefully fairly common) case where the source and
destination are already naturally aligned, it should also, on machines
that make it easy, try to artificially align the bulk of the copy
operation.

One problem is that when destination and source are be aligned
differently you have to choose whether to align the copy w.r.t. the
source or the destination. It is best to align the destination,
especially on write thru cache machines, and sometimes by a spectacular
margin.

Example: if we have to copy 73 bytes from address 102 to address 251, we
should (assming 4 bytes is the optimal block copy word size):

	split the 73 bytes in three segments, of 3, 17x4=68, 2 bytes.

	copy 3 bytes from address 102 to 245
	copy 17 words from address 105 to address 248
	copy 2 bytes from address 173 to address 316

Note that the word by word copy has a source that is not word aligned,
but the destination is. Many machines can cope with unaligned fetches
fairly well, but unaligned stores are usually catastrophic.

My usual example is the VAX-11/780, which had an 8 bytes buffer
between the CPU and the system bus leading to memory, and write thru.
Each byte written could cause an 8 byte read from memory, and an 8 byte
write back to memory, ...

As Spencer and myself have already remarked, this means that a suitable
sw memory copy operation can easily beat hardware memory copies, for
suitably large copy sizes, and by a large margin.

Yet another reason for having simple CPUs and avoid microprograms (if
you can afford the instruction fetch bandwidth, or use compact
instruction encodings, e.g. a stack architecture).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

henry@zoo.toronto.edu (Henry Spencer) (07/22/90)

In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
>How do processors that handle off alignment deal with getting a page
>fault in between the multiple transfers?  Couldn't this get really
>hairy?  ...

In a word, yes.  This is one of the major reasons why essentially all RISC
processors insist on "natural" alignments:  so that no operand can cross
a page boundary.

If you want *real* fun, consider that unaligned operands can overlap.
Think about the implications of overlapping operands that span a page
boundary between a normal page and a paged-out read-only page in a
machine with two levels of virtual-address cache and a deep pipeline...
with an exception-handling architecture that was nailed down in detail
on much slower and simpler implementations, and can't be changed.  This
is the sort of problem that makes chip designers quietly start sending
their resumes to RISC manufacturers... :-)
-- 
NFS:  all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology
and its performance and security too.  |  henry@zoo.toronto.edu   utzoo!henry

mo@messy.bellcore.com (Michael O'Dell) (07/22/90)

Once again Henry demonstrates that his powers of subtle understatement
are exceeded only by his occassional correctness.

Regretably, simply choosing to build a RISC doesn't obviate all those
awful problems. In machines of any flavor (RISC or CISC) where the backend
which detects the pagefaults can't talk to the front-end which needs
to stop fetching the wrong instructions because, say, the speed of light
on real circuit boards is so pokey, traps are nightmares of astounding
proportions.  Interrupts aren't as bad because they are not required
to be particularly synchronized with the instruction stream.
But those atomic, synchronous traps are genuine gut busters in both
hardware and software.

	-Mike

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/23/90)

In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:

| What if page 2 is paged out?  When does the CPU notice that it's paged
| out and what does it do about it?  Does it abort the transaction
| entirely, bring page 2 back in, and restart?  If so, what if, because
| of the page replacement algorithm, page 1 gets blown away by bringing
| in page 2?  

  You simply restart the instruction. Using an LRU scheme the 1st page
would not get paged out unless the physical mapping was 1 page/process.
As a reasonable constraint you want 8 pages/process anyway, so you avoid
this.

  Note 1: yes "simply" is a relative term in this case, relative to doing
half the instruction and then handling the fault.

  Note 2: pages for code, stack, source of copy instruction, dest of
copy instruction. If you allow unalligned access you need two pages in
each area to handle access over a boundary, that totals eight. No, I
wouldn't want to actually run a system with that little memory.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/23/90)

In article <1990Jul22.001925.8979@zoo.toronto.edu> henry@zoo.toronto.edu
(Henry Spencer) writes:

   If you want *real* fun, consider that unaligned operands can overlap.
   Think about the implications of overlapping operands that span a page
   boundary between a normal page and a paged-out read-only page in a
   machine with two levels of virtual-address cache and a deep pipeline...
   with an exception-handling architecture that was nailed down in detail
   on much slower and simpler implementations, and can't be changed.  This
   is the sort of problem that makes chip designers quietly start sending
   their resumes to RISC manufacturers... :-)

For the interested, the MU5 supercomputer from Manchester was virtually
like this (except that they were designing it from scratch). They solved
the problem by strict design discipline, as you can find in "The MU5
computer system", Ibbett & Morris (MacMillan).  Basically they found out
that the only sensible way out is to have restartable, not continuable,
instructions. Saving processor state on a fault is a very bad idea if it
is complicated. You substitute for this the easier problem of
idempotency.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

dgr@hpfcso.HP.COM (Dave Roberts) (07/24/90)

Henry Spencer writes:

> In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
> >How do processors that handle off alignment deal with getting a page
> >fault in between the multiple transfers?  Couldn't this get really
> >hairy?  ...
> 
> In a word, yes.  This is one of the major reasons why essentially all RISC
> processors insist on "natural" alignments:  so that no operand can cross
> a page boundary.
> 
> If you want *real* fun, consider that unaligned operands can overlap.
> Think about the implications of overlapping operands that span a page
> boundary between a normal page and a paged-out read-only page in a
> machine with two levels of virtual-address cache and a deep pipeline...
> with an exception-handling architecture that was nailed down in detail
> on much slower and simpler implementations, and can't be changed.  This
> is the sort of problem that makes chip designers quietly start sending
> their resumes to RISC manufacturers... :-)

Sorry guys.  I didn't mean to sound so dumb when I wrote that.  :-) I
with RISC products at a board level, so I already knew the answer was
"yes", I was just looking for a somewhat quantitative explanation of
what a designer has to go through to make this work and what
techniques were used.  How much of a slow down does this cause with
the processor?  The original argument was that not supporting
misalignment was a misfeature (or bug in the original authors
vocabulary).  Anyway, I was just wondering how much faster you could
go with a given design if you didn't have to support this stuff.  What
I was trying to convey was the idea that it isn't just as easy as
supporting multiple accesses to memory but that the problem actually
can go all the way down to the memory management and exception
handling level, and that stuff seems to be the most gut wrenching and
error prone level of a design.  It also tends to impact performance
pretty heavily the more complicated you make it.

Anyway, I was just wondering if there was any CISC designer out there
that could say something like, "Yea, well, we could have made that 25
MHz part run at 40 MHz if we hadn't had to support all the
misalignment gunk."  Some people will still want to have the
misalignment, but it's nice to see what price you're paying to get it
so that you can evaluate whether you really need it.

Dave Roberts
Hewlett-Packard Co.
dgr@hpfcla.fc.hp.com

dgr@hpfcso.HP.COM (Dave Roberts) (07/25/90)

Bill writes:

> In article <8840014@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
> 
> | What if page 2 is paged out?  When does the CPU notice that it's paged
> | out and what does it do about it?  Does it abort the transaction
> | entirely, bring page 2 back in, and restart?  If so, what if, because
> | of the page replacement algorithm, page 1 gets blown away by bringing
> | in page 2?  
> 
>   You simply restart the instruction. Using an LRU scheme the 1st page
> would not get paged out unless the physical mapping was 1 page/process.
> As a reasonable constraint you want 8 pages/process anyway, so you avoid
> this.
> 
>   Note 1: yes "simply" is a relative term in this case, relative to doing
> half the instruction and then handling the fault.
> 
>   Note 2: pages for code, stack, source of copy instruction, dest of
> copy instruction. If you allow unalligned access you need two pages in
> each area to handle access over a boundary, that totals eight. No, I
> wouldn't want to actually run a system with that little memory.
> 
> -- 
> bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
>             "Stupidity, like virtue, is its own reward" -me
> ----------


Yea, I agree with you if LRU is the strategy, but how do you know what
page replacement strategy is being used?  This isn't usually something
that the hardware specifies.  It is usually (read always :-) left to
the O/S to implement.  The O/S could implement a FIFO strategy and
page 1 could have been the first page in.  I'm not disagreeing, just
raising some questions.

I understand that the problem can be solved and many of the techniques
for solving it.  What I'm interested in is: how much you pay for the
solution?  How much time is spent trying to solve it?  How much does
it cost in terms of overall performance?  How often is it used?  If
it's more expensive than doing aligned transactions even for
processors that support it, do users tend to try and make all their
transactions aligned?  If so, why have it and thereby slow down
everything?  Why not leave software to deal with the case when a user
has to do unaligned transactions?  Do users really want to pay the
price all the time for this support or would they rather take a big
hit every so soften?  Obviously some will and some won't but that's
the case for any architectural decision.  These are the questions I'm
really trying to answer.

Dave Roberts
dgr@hpfcla.fc.hp.com

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/25/90)

Please note that I have trimmed the quotes in the previous posting
heavily, I'm not trying to distort the meaning of the original poster,
just keep the size reasonable.

In article <8840016@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:

| Yea, I agree with you if LRU is the strategy, but how do you know what
| page replacement strategy is being used?  This isn't usually something
| that the hardware specifies.

  A fair question. Just as the hardware spec doesn't say the compilers
have to allign things on word boundaries, if the hardware is such that
certain software practices are needed, then they *are* implicitly given
in the spec.

|                               It is usually (read always :-) left to
| the O/S to implement.  The O/S could implement a FIFO strategy and
| page 1 could have been the first page in.

  In which case the restart would cause a page fault, the first page
would come back in, and the second restart would complete.
| 
| I understand that the problem can be solved and many of the techniques
| for solving it.  What I'm interested in is: how much you pay for the
| solution?  How much time is spent trying to solve it?  How much does
| it cost in terms of overall performance?  How often is it used?  If
| it's more expensive than doing aligned transactions even for
| processors that support it, do users tend to try and make all their
| transactions aligned?  If so, why have it and thereby slow down
| everything?  Why not leave software to deal with the case when a user
| has to do unaligned transactions?  

  If you believe that when writing systems programs that sometimes you
will have to access data which is not alligned, and I do, then the
question is only if it should be done in hardware or software. This
arbitrary data can come from another machine (not always even a
computer), or be packed to keep volume down.

  If it is being done in software the source code has to contain a check
for misallignment, which in turn means that the format of a pointer *on
that machine* must be known, as well as the allignment requirements. Bad
and non-portable. Or, you can simply access every data item larger than
a byte using the "fetch a byte and shift" method. This requires that the
byte order of the data, rather than the machine, be known. I think
that's probably the only portable way.

  Alternatively the hardware can support unalligned fetch. It doesn't
have to be efficient, because you would have to make an effort to make
the fetch logic slower than software, it just has to work. This makes
the program a bit smaller, and assuming that the chip logic is right, it
prevents everyone from implementing their own try at access code.

  If the hardware could produce a clear trap for unalligned access (not
the general bus fault, etc) the o/s could do software emulation. From
the user's view that would look like a hardware solution. This is like
emulating f.p. instructions in the o/s when the FPU is not present, and
does not represent a major change in o/s technology.

  Note that this is not a RISC issue, in that the bus interface unit
already may be doing things like cache interface, multiplexing lines,
controlling status lines, etc. The BIU is not really RISC in that sense,
it functions like a coprocessor if you draw a logic diagram, who's
function is to provide data, which can go in the pipeline or into the
CPU.

|                                      Do users really want to pay the
| price all the time for this support or would they rather take a big
| hit every so soften?  Obviously some will and some won't but that's
| the case for any architectural decision.  These are the questions I'm
| really trying to answer.

  You assume that there is a price all the time, and I'm pretty well
convinced that the BIU in processors which have this capability, such as
the 80486, don't have a greater latency for alligned access than the
equivalent unit in SPARC or 88000. Not having the proprietary on chip
timing I can't be totally certain, obviously.

  I think the real question is "should unalligned access be provided
outside the user program?" I think the answer is yes. Obviously it can
be done better in hardware, but if a chip is so tight on gates that it
can't be without compromising performance elsewhere, then just a
separate trap for quick identification of the problem by the o/s would
be a reasonable alternative.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

davec@nucleus.amd.com (Dave Christie) (07/26/90)

In article <2370@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>
>  Alternatively the hardware can support unalligned fetch. It doesn't
>have to be efficient, because you would have to make an effort to make
>the fetch logic slower than software, it just has to work. This makes
>the program a bit smaller, and assuming that the chip logic is right, it
>prevents everyone from implementing their own try at access code.
   .
   .
>  Note that this is not a RISC issue, in that the bus interface unit
>already may be doing things like cache interface, multiplexing lines,
>controlling status lines, etc. The BIU is not really RISC in that sense,
>it functions like a coprocessor if you draw a logic diagram, who's
>function is to provide data, which can go in the pipeline or into the
>CPU.

Note that there are two degrees of misalignment: 
	1) within a word, and
	2) crossing a word (& possible page) boundary. 

For 1):
If the realignment hardware is not in your main fetch path because it
would impact your cycle time, then it will likely mean an extra stage
of processing for instructions which use it, which can add various bits
of complexity.  Considering that, plus
	1) a 4-way mux isn't a serious time sink, and
	2) how much, or even whether, it influences the cycle time is 
	   technology and implementation dependent
then you are likely just going to stick it in the main fetch path
and do it efficiently, w.r.t. layout, etc.  Now, if the end user
does pay for this, it isn't likely going to be in performance, because
even though it might influence the cycle time, it won't.  Chips come in
"standard" operating frequencies these days (e.g. 16,20,25,30,40,50);
The difference that a 4-way mux might make would tend to be taken care
of by the process tweaking that's done to get to the desired frequency.
In this case, the realignment hardware influences yield rather than
cycle time, hence cost rather than the performance.  I can't think of
any processor that doesn't support this degree of realignment (some
better than others).

For 2):
This, IMHO, is one of the more significant things that differentiates
"RISC" from "CISC".  The notion of one instruction making multiple
references to memory tends to make RISC designers get red in the face
and jump up and down.  (Yes, I'm well aware of the 29K's load and store
multiple instructions, and while I'm not fond of them, there are some
significant differences between that and handling unaligned accesses.)
The extra control complexity this introduces is a signficant increment,
especially considering all the nightmarish endcases that have already 
been described in this thread.  The added complexity is dependent on
architecture and implementation, and tends to be worse for stores, but
at any rate it tends to increase design/debug time, and more importantly
can cause much hair pulling and resume writing when one attempts really
high performance implementations.  (I've know people who thrive on such
complexity, for complexity's sake - they should be removed from the gene 
pool (0.5 :-).  With the realestate one has to play with these days,
you can find room for the complexity to keep the performance up, but it
still influences the cost (and number of errata after release).

I don't know of any "new" architecture chips with decent performance
that support realignment across words in one instruction. Why do
the common CISC chips support it?
	1) it's not as big an increment in complexity (no smiley)
	2) backwards compatibility (i.e. they have no choice)

In summary, the cost you will tend to see will be $ more than performance,
although at the high end of the performance spectrum you might pay in
performance as well - that's hard to say, since processors which support
word-crossing accesses tend to have a lot of other complexities which
influence cost/performance as well.

What makes sense depends on the intended applications, of course.  It 
may indeed make some network software run significantly faster, for 
instance.  But if that network software consumed 5% of all the cycles 
of all the processors I had sold, and such hardware support would 
*double* the n/w sfw performance, I still wouldn't risk screwing
up everything else to go for an aggregate 2.5% performance improvement.

----------------------------
Dave Christie
My humble opinions only.

baum@Apple.COM (Allen J. Baum) (07/26/90)

[]
>In article <1990Jul25.223437.15301@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>Note that there are two degrees of misalignment: 
>	1) within a word, and
>	2) crossing a word (& possible page) boundary. 
>
>For 1):
>If the realignment hardware is not in your main fetch path because it
>would impact your cycle time, then it will likely mean an extra stage
>of processing for instructions which use it, which can add various bits
>of complexity.  Considering that, plus
>	1) a 4-way mux isn't a serious time sink, and
>	2) how much, or even whether, it influences the cycle time is 
>	   technology and implementation dependent
...
>"standard" operating frequencies these days (e.g. 16,20,25,30,40,50);
>The difference that a 4-way mux might make would tend to be taken care

I've wondered about that. Getting naturally aligned operands requires a
4:1 mux on the low byte, a 2:1 on the next byte, and nothing on the
upper bytes. To get all alignments requires a 4:1 mux on all bytes. It
merely makes all bytes be just as bad as the low byte. In theory.
In practice, loading and wire delays probably have as much impact as logic.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/27/90)

No one has mentioned the solution used by the MIPS R3000.  It takes
very little hardware, runs as fast as a middling amount of hardware,
and can be easily emitted by the compiler whenever it doesn't know
a datum's alignment.

Quite simply, they have two "partial load" instructions: between them
they constitute an unaligned load.

If unaligned data is untypical, this sounds like an ideal compromise.
Besides, maybe Herman Rubin can use it in his assembler programs ...
-- 
Don		D.C.Lindsay

richard@aiai.ed.ac.uk (Richard Tobin) (07/27/90)

In article <10018@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

>No one has mentioned the solution used by the MIPS R3000.  

>Quite simply, they have two "partial load" instructions: between them
>they constitute an unaligned load.

I hate to sound like Herman Rubin, but it would be nice if they provided
a way to access these instructions from C.  "#pragma misaligned" perhaps.

-- Richard

-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

chris@mimsy.umd.edu (Chris Torek) (08/06/90)

In article <2357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM
(Wm E Davidsen Jr) writes:
(in response to `what if the word being loaded spans a page boundary and
the second page is invalid)
>... restart the instruction. Using an LRU scheme the 1st page
>would not get paged out unless the physical mapping was 1 page/process.
>... pages for code, stack, source of copy instruction, dest of
>copy instruction. If you allow unalligned access you need two pages in
>each area to handle access over a boundary, that totals eight. No, I
>wouldn't want to actually run a system with that little memory.

Actually, it is (or can be) even worse than that.  Consider the
following VAX gem:

	addl3	*0(r1),*0(r2),*0(r3)

Assume that the instruction itself (which is 7 bytes long) crosses
a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff
respectively, that the longword at 0x1ff..0x203 contains 0x7ff, that
the longword at 0x3ff..0x403 contains 0x9ff, and that the longword at
0x5ff..0x603 contains 0xbff.  Then we need:

	2 pages for the instruction
	2 pages for 0(r1) (0x1ff..0x203)
	2 pages for 0(r2) (0x3ff..0x403)
	2 pages for 0(r3) (0x5ff..0x603)
	2 pages for *1ff  (0x7ff..0x803)
	2 pages for *3ff  (0x9ff..0xa03)
	2 pages for *5ff  (0xbff..0xc03)
       --
 total 14 pages for one `simple' `addl3' instruction.

(Imagine the fun with a six-argument instruction like `index'!)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/06/90)

In article <25900@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:

| Actually, it is (or can be) even worse than that.  Consider the
| following VAX gem:
| 
| 	addl3	*0(r1),*0(r2),*0(r3)
| 
| Assume that the instruction itself (which is 7 bytes long) crosses
| a page boundary, that r1, r2, and r3 contain 0x1ff, 0x3ff, and 0x5ff
|     [ details ]
|  total 14 pages for one `simple' `addl3' instruction.

  I suspect that you could actually get something like this in actual
programs. It happens on machines which don't force the instructions to
be an even size. This is probably a worst worst case, but I could
believe that it would happen.

  There seem to be enough advantages and penalties to unalligned access
to prevent a definitive good or bad decision. Certainly it will take a
few gates on the chip, but probably will not slow alligned access. It
will be a lot faster than doing the same thing in software, but it's not
a common thing to do. There doesn't seem to be a portable way to decide
if an address is alligned or not, since pointer formats vary, so a
software solution has to assume all accesses are unalligned.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

gideony@microsoft.UUCP (Gideon YUVAL) (08/07/90)

In article <25900@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>In article <2357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM
>(Wm E Davidsen Jr) writes:
... ... ... ... ... ... ... ... ... ... ...  > Consider the
>following VAX gem:

>	addl3	*0(r1),*0(r2),*0(r3)
...
> total 14 pages for one `simple' `addl3' instruction.

>(Imagine the fun with a six-argument instruction like `index'!)

I'm told this is why the VAX has a 512-byte page -- a 128KB box
was planned (but never made), and in the worst-case (an instruction
straddling 40 pages), would have deadlocked at <40 pages available.

The page-size thus had to be less than one-fortieth of the page-list
of a bare-minimum (1977!) system. 
-- 
Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)

mcgrath@homer.Berkeley.EDU (Roland McGrath) (08/08/90)

On machines that "don't handle misaligned accesses", what do they do when one
happens anyway?  Assuming they don't just go up in a puff of blue smoke (which
would be alright with me, as long as the color of smoke is documented), or
cease functioning and need to be power-cycled (which would be alright too, if
documented), the least useful thing I can think of for them to do (within the
bounds of reason) is to do the access wrong (which seems likely since if they
want the low-order two bits of the address to always be zero, they might well
not pay attention to them).  Is this what is done?

The second least useful thing I can think of is to cause a hardware trap to
software, which would presumably be horrendously slow.  I would be completely
happy with that.  Write a software trap handler to deal with it, and write your
compiler such that it never happens if avoidable.
--
	Roland McGrath
	Free Software Foundation, Inc.
roland@ai.mit.edu, uunet!ai.mit.edu!roland

jkenton@pinocchio.encore.com (Jeff Kenton) (08/08/90)

From article <MCGRATH.90Aug7220232@homer.Berkeley.EDU>, by mcgrath@homer.Berkeley.EDU (Roland McGrath):
> On machines that "don't handle misaligned accesses", what do they do when one
> happens anyway?

On the Motorola 88000 a misaligned access causes a trap into the kernel.  There
is a bit in the status word which can override this, in which case the CPU
assumes zero for the appropriate number of low order bits.  Kernels can either
signal the offending process on a fault, or complete the offending access and
continue.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/08/90)

In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes:

| documented), the least useful thing I can think of for them to do (within the
| bounds of reason) is to do the access wrong (which seems likely since if they
| want the low-order two bits of the address to always be zero, they might well
| not pay attention to them).  Is this what is done?

  On some machines, yes. On others a trap results, allowing the kernel
to complete the access in software if desired. I believe that the 88K
has a flag to trap or just zero the low bits or the address, I know the
Honeywell DPS series zeroed the low bit of the address on a doubleword
access. This resulted in many amusing "learning experiences."
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

jputnam@raptor.Eng.Sun.COM (James M. Putnam) (08/08/90)

Roland McGrath (in <MCGRATH.90Aug7220232@homer.Berkeley.EDU>, I think) writes:
>On machines that "don't handle misaligned accesses", what do they do when one
>happens anyway?  Assuming they don't just go up in a puff of blue smoke (which
>would be alright with me, as long as the color of smoke is documented), or
>cease functioning and need to be power-cycled (which would be alright too, if
>documented), the least useful thing I can think of for them to do (within the
>bounds of reason) is to do the access wrong (which seems likely since if they
>want the low-order two bits of the address to always be zero, they might well
>not pay attention to them).  Is this what is done?

	The VAX does this for quad word loads. VAX LISP takes advantage of
	this feature to get two free tag bits on LISP pointers that don't 
	need to be masked before being dereferenced. Seems useful to me.

>The second least useful thing I can think of is to cause a hardware trap to
>software, which would presumably be horrendously slow.  I would be completely
>happy with that.

	I'd like this behavior to be defined by the user process. Perhaps as
	a signal? Seems similar to SIGBUS. Anyway, set a bit someplace so you 
	trap or not, based on what kind of process you are.  You probably 
	always want to take the trap in system mode, and panic (at least 
	in UN*X) since it indicates a bug of some kind.

>Write a software trap handler to deal with it, and write your compiler such 
>that it never happens if avoidable.

	If your compiler is doing the right thing, the only time you'll take
	an interrupt is if your code is buggy. This sort of thing can be caught
	statically by lint, but there may be some situations in which
	dynamic behaviour can produce the same event, like in programs 
	that generate their own code.

	Interesting idea.
	
	Jim

cprice@mips.COM (Charlie Price) (08/09/90)

In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes:
>On machines that "don't handle misaligned accesses", what do they do when one
>happens anyway?

MIPS processors generate a trap with an Address Error Exception.

In RISC/os, this is turned into a SIGBUS signal.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

henry@zoo.toronto.edu (Henry Spencer) (08/09/90)

In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes:
>On machines that "don't handle misaligned accesses", what do they do when one
>happens anyway?...

Mostly they trap.  There are probably one or two design groups that were
foolish enough to just ignore the low bits.
-- 
The 486 is to a modern CPU as a Jules  | Henry Spencer at U of Toronto Zoology
Verne reprint is to a modern SF novel. |  henry@zoo.toronto.edu   utzoo!henry

mash@mips.COM (John Mashey) (08/09/90)

In article <12409@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>
>On the Motorola 88000 a misaligned access causes a trap into the kernel.  There
>is a bit in the status word which can override this, in which case the CPU
>assumes zero for the appropriate number of low order bits.  Kernels can either
>signal the offending process on a fault, or complete the offending access and
>continue.

Just out of curiosity, can anyone give some live examples where software
takes advantage of the mode where the CPU just zeroes the low-oorder
bits and conitnues, as in the 88K?  (or, I think(?), in the RT/PC).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jkenton@pinocchio.encore.com (Jeff Kenton) (08/09/90)

From article <40711@mips.mips.COM>, by mash@mips.COM (John Mashey):
> 
> Just out of curiosity, can anyone give some live examples where software
> takes advantage of the mode where the CPU just zeroes the low-oorder
> bits and conitnues, as in the 88K?  (or, I think(?), in the RT/PC).

The only case I've seen is a low level, PROM based, debugger on the 88k which
runs in this mode.  Presumably, this is done to save error checking or trap
handling when the user types a misaligned address.

I don't see that this gains much, and it runs the risk of masking real bugs
in the program itself.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/09/90)

In article <40711@mips.mips.COM> mash@mips.COM (John Mashey) writes:

>Just out of curiosity, can anyone give some live examples where software
>takes advantage of the mode where the CPU just zeroes the low-oorder
>bits and conitnues, as in the 88K?  (or, I think(?), in the RT/PC).

The optimizing compilers I used to work on, did all of their dynamic
memory management though IDL (Interface Description Language). Our
IDL implementation gave us nice things - garbage collection, debug
support, and the ability to move any rooted data structure to/from a
file.

IDL objects were tagged, and tags contained a two-bit code which was
stored in the low end of a pointer.  The IDL runtimes had to pack and
unpacked them. (Guy Steele determined which code was commonest, and
represented it as 00.)

Would we have used the hardware mode?  No. The IDL runtimes initially
consumed about half the cycles of a big compile, but I fixed that in
the conventional way (amortization,inlining,caching,etc).  After the
fix, our cycles were elswhere, and a special hardware mode would have
made no difference.

Plus, I would not want to turn off the hardware checks while the rest
of the code was running.  If this meant constantly turning a mode on
and off, then forget it.
-- 
Don		D.C.Lindsay

news@ism780c.isc.com (News system) (08/10/90)

In article <1990Aug8.212255.3555@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes:
>>On machines that "don't handle misaligned accesses", what do they do when one
>>happens anyway?...
>
>Mostly they trap.  There are probably one or two design groups that were
>foolish enough to just ignore the low bits.

There was at least one design group (an SEL machine) that noticed that they
could encode the operand width using the low order address bits.  In this
way they were able to save a bit in the instruction thus providing an
additional address bit.  Of course, there was no such concept as misaligned
access reference using this scheme.

    Marv Rubinstein

aglew@dual.crhc.uiuc.edu (Andy Glew) (08/10/90)

>There was at least one design group (an SEL machine) that noticed that they
>could encode the operand width using the low order address bits.  In this
>way they were able to save a bit in the instruction thus providing an
>additional address bit.  Of course, there was no such concept as misaligned
>access reference using this scheme.
>
>    Marv Rubinstein

SEL became Gould CSD, and then...

The last Gould machines used the low order bits of the offset field
(which was present on all memory access instructions) as part of the
width encoding.  They did not, however, use the low order bits of the
registers, and they did produce misaligned traps if the low order bits
of the final address, ignoring the low order bits of the offset
literal, were incorrect.
    This meant that you could not have an odd register, and add 1 to
it via the addressing mode to make a correctly aligned even address.
It never caused me any problems - it even found a few bugs.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

davecb@yunexus.YorkU.CA (David Collier-Brown) (08/10/90)

In article <MCGRATH.90Aug7220232@homer.Berkeley.EDU> mcgrath@homer.Berkeley.EDU (Roland McGrath) writes:
| On machines that "don't handle misaligned accesses", what do they do when one
| happens anyway?...
henry@zoo.toronto.edu (Henry Spencer) writes:
| Mostly they trap.  There are probably one or two design groups that were
| foolish enough to just ignore the low bits.

  Ah yes, many older mainframes... I remember DPS-8s with affection, but
not for all the decisions they made.  A doubleword register load from an
odd address used the floor(address), causing an optimization that used
double loads to load the wrong two variables.
  **MY** that was hard to spot. 

--dave
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | "And the next 8 man-months came up like
CANADA. 416-223-8968  |   thunder across the bay" --david kipling

steves@sv001.SanDiego.NCR.COM (Steve Schlesinger x3711) (08/11/90)

From article <40711@mips.mips.COM>, by mash@mips.COM (John Mashey):
> 
> Just out of curiosity, can anyone give some live examples where software
> takes advantage of the mode where the CPU just zeroes the low-oorder
> bits and conitnues, as in the 88K?  (or, I think(?), in the RT/PC).
>

When emulating other (older) architectures that permit non-aligned accesses,
there are two basic choices with architectures of this type:

1	examine the nonaligned memory access and read ALIGNED bytes,
	ALIGNED halfwords and/or ALIGNED words as appropriate and
	shift and mask them together.

2	do two NON-ALIGNED reads of the two memory words the nonaligned word
	straddles ( non-aligned-addr and non-aligned-addr+4 ) and then
	shift and mask.

	The low order bits are ignored by the read, but are used by the
	masking and shifting code.

My memory of this is fuzzy, but in most cases, (2) is more efficient.


===============================================================================
Steve Schlesinger, NCR/Teradata Joint Development				   619-597-3711
11010 Torreyana Rd, San Diego, CA 92121		 steve.schlesinger@sandiego.ncr.com
===============================================================================

jon@hitachi.uucp (Jon Ryshpan) (08/12/90)

jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>mash@mips.COM (John Mashey) writes:
>> Just out of curiosity, can anyone give some live examples where software
>> takes advantage of the mode where the CPU just zeroes the low-oorder
>> bits and conitnues, as in the 88K?  (or, I think(?), in the RT/PC).

>The only case I've seen is a low level, PROM based, debugger on the 88k which
>runs in this mode.  Presumably, this is done to save error checking or trap
>handling when the user types a misaligned address.

I can't imagine a *worse* place to turn off a trap detecting possible
errors than in a debugger.

Jonathan Ryshpan		<...!uunet!hitachi!jon>

colin@array.UUCP (Colin Plumb) (08/12/90)

>>mash@mips.COM (John Mashey) writes:
>>> Can anyone give some live examples where software takes advantage of the
>>> mode where the CPU just zeroes the low-order bits and conitnues...

>jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>> The only case I've seen is a low level, PROM based, debugger on the 88k
>> which runs in this mode.

In article <467@hitachi.uucp> jon@hitachi.UUCP (Jon Ryshpan) writes:
> I can't imagine a *worse* place to turn off a trap detecting possible
> errors than in a debugger.

I think we can assume that it restores the state while running the debugged
code; it just turns off the trap for internal use.  I don't think detecting
errors in the PROM monitor does you much good, anyway, so why worry? :-}
-- 
	-Colin

jkenton@pinocchio.encore.com (Jeff Kenton) (08/12/90)

From article <488@array.UUCP>, by colin@array.UUCP (Colin Plumb):
>>>mash@mips.COM (John Mashey) writes:
>>>> Can anyone give some live examples where software takes advantage of the
>>>> mode where the CPU just zeroes the low-order bits and conitnues...
> 
>>jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>>> The only case I've seen is a low level, PROM based, debugger on the 88k
>>> which runs in this mode.
> 
> In article <467@hitachi.uucp> jon@hitachi.UUCP (Jon Ryshpan) writes:
>> I can't imagine a *worse* place to turn off a trap detecting possible
>> errors than in a debugger.
> 
> I think we can assume that it restores the state while running the debugged
> code; it just turns off the trap for internal use.  I don't think detecting
> errors in the PROM monitor does you much good, anyway, so why worry? :-}

Unfortunately, it doesn't.  As I said in my (partially quoted) posting, I'm
not convinced that the small saving in debugger code is worth the loss of
general error checking it costs you.






















- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

chris@mimsy.umd.edu (Chris Torek) (09/05/90)

(Someone is recirculating old news through the net.  The last time this
came around, I did not comment, but this time I cannot resist :-) )

[Someone asks what machines that require alignment do with unaligned
addresses in loads.  His `least desirable' scenario is that the
processor completely ignores the low address bits.]

In article <140356@sun.Eng.Sun.COM> jputnam@raptor.Eng.Sun.COM
(James M. Putnam) writes:
>The VAX does this for quad word loads. VAX LISP takes advantage of
>this feature to get two free tag bits on LISP pointers that don't 
>need to be masked before being dereferenced.

Since there is no mention of this in the VAX architecture handbook, I
tried it out to make sure.  An unaligned movq reads the unaligned quadword.
If r0 contains 0x1003 and memory at 0x1000 looks like this:

	0x1000: 0x00
	0x1001: 0x01
	0x1002: 0x02
  r0 ->	0x1003: 0x03
	0x1004: 0x04
	0x1005: 0x05
	0x1006: 0x06
	0x1007: 0x07
	0x1008: 0x08
	0x1009: 0x09
	0x100a: 0x0a

then after a `movq (r0),r0', r0 contains the value 0x06050403 and
r1 contains the value 0x0a090807.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

jputnam@raptor.Eng.Sun.COM (James M.) (09/06/90)

In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>(Someone is recirculating old news through the net.  The last time this
>came around, I did not comment, but this time I cannot resist :-) )
>
>[Someone asks what machines that require alignment do with unaligned
>addresses in loads.  His `least desirable' scenario is that the
>processor completely ignores the low address bits.]
>
>In article <140356@sun.Eng.Sun.COM> jputnam@raptor.Eng.Sun.COM
>(James M. Putnam) writes:

    Something he shouldn't have about VAX quad-word loads ignoring
    low-order bits, and this being a useful feature for LISP.

>Since there is no mention of this in the VAX architecture handbook, I
>tried it out to make sure.  An unaligned movq reads the unaligned quadword.
>-- 
>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
>Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

    I stand corrected and thanks to Chris for pointing it out. During a 
    conversation I had with Barry Margolin some time ago about this very 
    topic (non-lispm architecture implementations of LISP pointers), I got 
    the impression that VAX quad-word loads ignored bits in the address.

    Sincere apologies for the misinformation. Does anyone know of a machine
    that has this "feature", or I did construct this out of whole cloth?

	Jim

jkenton@pinocchio.encore.com (Jeff Kenton) (09/06/90)

From article <141881@sun.Eng.Sun.COM>, by jputnam@raptor.Eng.Sun.COM (James M.):
>
>     Something he shouldn't have about VAX quad-word loads ignoring
>     low-order bits, and this being a useful feature for LISP.
> 
>     Sincere apologies for the misinformation. Does anyone know of a machine
>     that has this "feature", or I did construct this out of whole cloth?
> 

The 88000 will let you run in this mode.  It still seems like laziness to me.






----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
----- jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com ----- 
-----		   ---  always at (617) 894-4508  ---		    -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----

bengsig@oracle.nl (Bjorn Engsig) (09/07/90)

Article <141881@sun.Eng.Sun.COM> by jputnam@raptor.Eng.Sun.COM (James M.) says:
|
|     Does anyone know of a machine that has this "feature" [ignoring low bits
| in unaligned stores and load]
Yes, the IBM RT does this.  As far as I remember there is a well hidden switch
to turn this to give an exception on unaligned addressing, but the default is
simply to ignore the last one (two) bits of the address on halfword (fullword)
stores and loads.
-- 
Bjorn Engsig,	Domain:		bengsig@oracle.nl, bengsig@oracle.com
		Path:		uunet!mcsun!orcenl!bengsig

srg@quick.com (Spencer Garrett) (09/09/90)

In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes:
>     Something he shouldn't have about VAX quad-word loads ignoring
>     low-order bits, and this being a useful feature for LISP.
> 
> In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
> >Since there is no mention of this in the VAX architecture handbook, I
> >tried it out to make sure.  An unaligned movq reads the unaligned quadword.

Jim,
   What may well have happened is that some early LISP implementer just
"tried it" and found that on his vax the low order bits were ignored.
This makes a dandy feature if you're trying to fake a tagged architecture,
so maybe he went ahead and used it.  *BUT* since it isn't in the manual,
DEC is free to change it whenever they feel the need, and they *do* feel
that need from time to time, so the code in question wouldn't even be
portable within the vax line.  You may both have been right.  :-)

srg@quick.com (Spencer Garrett) (09/09/90)

In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes:
>     Something he shouldn't have about VAX quad-word loads ignoring
>     low-order bits, and this being a useful feature for LISP.
> 
> In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) wrides:
> >Since there is no mention of this in the VAX architecture handbook, I
> >tried it out to make sure.  An unaligned movq reads the unaligned quadword.

Jim,
   What may well have happened is that some early LISP implementer just
"tried it" and found that on his vax the low order bits were ignored.
This makes a dandy feature if you're trying to fake a tagged architecture,
so maybe he went ahead and used it.  *BUT* since it isn't in the manual,
DEC is free to change it whenever they feel the need, and they *do* feel
that need from time to time, so the code in question wouldn't even be
portable within the vax line.  You may both have been right.  :-)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/10/90)

In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett) writes:

| Jim,
|    What may well have happened is that some early LISP implementer just
| "tried it" and found that on his vax the low order bits were ignored.

  With the wonders of loadable control store, every VAX can have it's
own instruction set. For a while we had had one with an FFT hardware
instruction. I don't think they worked this was from the factory,
though, the bits were used on the early 780s we had.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

przemek@liszt.helios.nd.edu (Przemek Klosowski) (09/11/90)

In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett) writes:
>In article <141881@sun.Eng.Sun.COM>, jputnam@raptor.Eng.Sun.COM (James M.) writes:
>>     Something he shouldn't have about VAX quad-word loads ignoring
>>     low-order bits, and this being a useful feature for LISP.
>> 
>> In article <26376@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>> >Since there is no mention of this in the VAX architecture handbook, I
>
>   What may well have happened is that some early LISP implementer just
>"tried it" and found that on his vax the low order bits were ignored.
>
>so maybe he went ahead and used it.  *BUT* since it isn't in the manual,

But, IT IS IN THE MANUAL! second page of the chapter on data representation
of the 1981 edition of the 'VAX architecture handbook' prominently says that

o  Quadword
A quadword is eight contiguous bytes starting on an arbitrary byte boundary

And it really took me about 60 sec to find it ...
	przemek
--
			przemek klosowski (przemek@ndcva.cc.nd.edu)
			Physics Dept
			University of Notre Dame IN 46556

chris@mimsy.umd.edu (Chris Torek) (09/12/90)

>[misstatement about VAX quad-word loads ignoring low-order bits]

In article <26376@mimsy.umd.edu> I wrote:
>Since there is no mention of this in the VAX architecture handbook ...

>In article <1990Sep8.225345.745@quick.com> srg@quick.com (Spencer Garrett)
suggested:
>>   What may well have happened is that some early LISP implementer just
>>"tried it" and found that on his vax the low order bits were ignored.
>>so maybe he went ahead and used it.  *BUT* since it isn't in the manual,

In article <410@news.nd.edu>, przemek@liszt.helios.nd.edu (Przemek Klosowski)
writes:
>But, IT IS IN THE MANUAL!  [see page 33 of the VAX architecture book]

I probably should have followed up to Spencer Garrett's posting myself.
What I meant in <26376@mimsy.umd.edu> was `no mention of ignoring low
order bits on movq', not `no mention of alignment requirements'.  It is
well-known that the VAX architecture (and therefore its handbook :-) )
allows arbitrary alignment for word and longword operations.

Note, however, that there *are* some (exactly four, as far as I know)
instructions that do require strict alignment, namely the interlocked
queue instructions:

	insqhi
	remqhi
	insqti
	remqti

All of these require that their queue be on a quadword boundary (and,
further, that the relative offsets that make these objects into queues
be multiples of 8 as well).  If the address handed to one of these
instructions is not valid, or if the queue offsets are invalid, you
get a reserved operand fault (again, the bits are not ignored).

Incidentally, these instructions are excruciatingly slow---about 150% of
the time for an integer divide with FPA, or twice as long as a subroutine
call, on an 11/780.

(NB: I have no explanation as to why an interlocked instruction should be
faster on a 750.)

[begin text from an article in net.unix that I saved back in 1983]

The following VAX instruction timings were obtained from a former
DEC employee.  I cannot vouch for their accuracy and have no idea
how they were obtained.

  VAX-11/780 vs. VAX-11/750 vs. VAX-11/730 WITH FPA
  INSTRUCTION			      <EXECUTION TIME MICROSECS> <TIMES 780>
					  780	  750	  730	 750	 730

INTERLOCKED INSERT + REMOVE		 30.43	 26.43	 41.02	1.151	0.742

versus, e.g.,
c
MOVL Reg, Reg				  0.40	  0.93	  1.69	0.430	0.237
MOVL mem, Reg				  0.84	  1.67	  4.94	0.503	0.170
MOVL Reg, mem				  1.31	  2.28	  4.88	0.575	0.268
CMPL AND BLEQ				  1.16	  2.32	  4.26	0.500	0.272
CMPL mem, Reg AND BLEQ			  1.88	  3.24	  7.31	0.580	0.257
TSTL AND BLEQ				  1.00	  2.42	  4.25	0.413	0.235
BRW					  0.80	  2.01	  2.57	0.398	0.311
MULL2 Reg, Reg				  1.85	  5.68	 12.05	0.326	0.154
MULL2 mem, Reg				  2.50	  6.55	 15.14	0.382	0.165
MULL2 Reg, mem				  2.48	  6.41	 15.11	0.387	0.164
DIVL3 Reg, Reg, Reg			  9.64	  8.88	 16.15	1.086	0.597
CALLS #0, ROUTINE + RET			 14.75	 20.87	 36.61	0.707	0.403
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris