[comp.arch] Unaligned Accesses

kck@dana.UUCP (04/13/87)

> 
>>>
>>>> ...the unaligned-access trap facility exists in order to provide some
>>>> level of support for "old" databases.  That is, lots of machines allow
>>>> access to any size of data on any boundary (unaligned 32-bit words,
>>>> unaligned 16-bit halfwords).  There must (we guess) be many databases
>>>> out there that were created under the assumption that such access will
>>>> always be possible...
>>>
>>>     1) Do such databases *really* exist?
> 
> Er, well, I have no proof, but gosh, we just took it on faith....  We should
> probably punish ourselves.
> 


Such databases do exist.  Specifically, HHB Softron has/had a CAE package
called CADAT which required 32 bit loads/stores from arbitrary 16 bit
alignment.


>>>     2) Are they difficult to convert?
> 
> This is a good question.  With proper documentation, the conversion job
> might be easy and quite perferrable to "support" of the old format.
> 


When Ridge Computers (my previous employer) tried to port CADAT to the
Ridge architecture, which requires 16 bit alignment for 16 bit loads/stores,
32 bit alignment for 32 bit loads/stores, and 64 bit alignment for 64 bit
load/stores, they failed due to these alignment problems.  When Pyramid
Technology tried to port CADAT to their architecture they also failed due
to alignment problems.  Pyramid actually provided HHB Softron with a special
version of their microcode which allowed non-aligned accesses.  Last I heard
HHB was still running with this "non-standard" Pyramid microcode.


>>>     3) Will they perform acceptably with all that trap overhead?
>>>

Depends on your definition of acceptable.  In the case of CADAT on a
Ridge we decided the performance degradation wasn't worth the effort
to provide for non-aligned accesses.


>>>     4) Would it be easier to get the compiler to generate 
>>>        whatever sequence of instructions would be required to
>>>        access unaligned operands without a trap?
> 
> The compiler (often/always) can't know the alignment of operands.  Thus,
> it would have to generate slow code for all accesses.  The trap mechanism
> lets at least 32-bit-aligned-32-bit accesses run at full speed.
> 

Correct.


>>>     5) Did the idiots who designed such non-portable databases
>>>        make other mistakes that will kill you anyway? (Byte order,
>>>        anyone?)
> 

The people at HHB weren't idiots.  They were programmers constrained
by small address space machines who did what they could to provide
some useful functionality.  Would you rather they have postponed 
development of their product until they had an architecture that
supported a decent sized address space?  Or would you suggest they
had spent more money for a machine with a larger address space?
Sometimes the real world throws constraints at you that you can't
ignore.

Ken Klingman
Dana Computer
hplabs!dana!kck

baum@apple.UUCP (04/13/87)

--------
[]

In reference to making complicated things go as fast as simple
things: it is rumored that IBM deliberately put wait loops into the
370/145 microcode, so that when the virtual memory was added to make
them 370/148s, they wouldn't run any slower!

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

franka@mmintl.UUCP (04/15/87)

In article <16038@amdcad.AMD.COM> bcase@amdcad.UUCP writes:
>Ok, I think maybe it *is* time to start the discussion of word-addressed
>vs. byte-addressed memory systems.  John Mashey, care to start us off?

While we're at it, lets talk about bit-addressed systems, too.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

mlord@bnr-public.uucp (Mark Lord) (03/23/89)

w-colinp@microsoft.uucp (Colin Plumb) writes:
>In any structure, if you rearrange the components, you can lose at most
>n-1 bytes to padding, where n is the strictest alignment restriction.  For
>most processors, the worst case is a double and a char, 7 bytes out of 16
>wasted.  But if this is a major concern, rewrite the code to use two parallel
>arrays.  You'll waste at most 7 bytes total (in your 100Meg).
>

The "break into multiple aligned arrays" approach can work fine as long
as one is dealing with arrays (of structures) to begin with.  Unfortunately,
I suspect that this is not of much use with dynamically allocated linked
lists of structures, which is very likely when we are talking about multiple
megabytes of data.

Granted, by careful (human) reordering of the fields within a complex
structure, one can almost always reduce wasted space to less than one
machine word/line per allocated item.  This is a good thing that we all do
when possible on new code, but it is much more difficult (and risky)
to go back and try to do this with code in a very large existing base
of (otherwise) good, working software.

The debate for devoting more silicon space to support efficient handling
of misaligned accesses seems to hinge around supporting the huge amounts
of perfectly good software that already exist, written before programmers
in general became keenly aware of alignment/efficiency tradeoffs.

I feel that this must be addressed by chip designers NOW.  The MIPS R2000
R3000 family have included instructions for this purpose, LWL/LWR, SWL/SWR,
but these must usually be used in pairs to achieve the desired effect.
A big problem with such is atomicity of load/store operations, especially
nasty when the software in question is also being used on one or more CISC
machines which never exhibit this problem.  Still, MIPS has at least
provided a starting point for misaligned accesses, while other chip makers
have yet to address the problem at all (ie. MC88100).

-Mark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*  The Company doesn't even know I'm reading this stuff, let alone  *
*  writing it  (maybe THAT's why I'm not working on it anymore!).   *
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

henry@utzoo.uucp (Henry Spencer) (03/25/89)

In article <362@bnr-fos.UUCP> mlord@bnr-public.UUCP (Mark Lord) writes:
>The debate for devoting more silicon space to support efficient handling
>of misaligned accesses seems to hinge around supporting the huge amounts
>of perfectly good software that already exist, written before programmers
>in general became keenly aware of alignment/efficiency tradeoffs.

How quickly we forget.  The lack of attention to alignment/efficiency
tradeoffs was a temporary aberration in the late 70s and early 80s.
I assure you that those of us who wrote code for such unimportant,
commercially-insignificant machines as the 360, the PDP11, and the
68000/010 were keenly aware of such tradeoffs from the beginning, because
the hardware gave us no choice.  I continue to think that the quantity
of good software which disregards alignment is grossly overestimated.
(It is not zero, and attention to it is necessary, but the idea that
a random software package has a 90% chance of suffering from such
carelessness is unjustified.)
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

w-colinp@microsoft.UUCP (Colin Plumb) (03/25/89)

mlord@bnr-public.UUCP (Mark Lord) wrote:
> The "break into multiple aligned arrays" approach can work fine as long
> as one is dealing with arrays (of structures) to begin with.  Unfortunately,
> I suspect that this is not of much use with dynamically allocated linked
> lists of structures, which is very likely when we are talking about multiple
> megabytes of data.

Yes, but if you have at least one 32-bit pointer per code item, odds are
very good your other data is much more than 32 bits long.  So again, the
waste (limited to 3 bytes, usually) is trivial.

> Granted, by careful (human) reordering of the fields within a complex
> structure, one can almost always reduce wasted space to less than one
> machine word/line per allocated item.  This is a good thing that we all do
> when possible on new code, but it is much more difficult (and risky)
> to go back and try to do this with code in a very large existing base
> of (otherwise) good, working software.

Actually, it can be done by machine.  A simple working algorithm is to sort
the elements by size, starting with the largest.  And, as I said, the wasted
space has an upper bound of (greatest alignment restriction)-(size of smallest
data item), usually 4 and 1 bytes, respectively.  Sometimes 8 and 1.

If the structures aren't dumped, raw, to a file, I would have no hesitation
reordering structures in my code.

> The debate for devoting more silicon space to support efficient handling
> of misaligned accesses seems to hinge around supporting the huge amounts
> of perfectly good software that already exist, written before programmers
> in general became keenly aware of alignment/efficiency tradeoffs.

Given that that software needs to be recompiled anyway, and is resistant
to big/little endian changes (i.e. doesn't use disgusting tricks), I
don't see how it can have much dependency on struct packing.

> I feel that this must be addressed by chip designers NOW.  The MIPS R2000
> R3000 family have included instructions for this purpose, LWL/LWR, SWL/SWR,
> but these must usually be used in pairs to achieve the desired effect.
> A big problem with such is atomicity of load/store operations, especially
> nasty when the software in question is also being used on one or more CISC
> machines which never exhibit this problem.  Still, MIPS has at least
> provided a starting point for misaligned accesses, while other chip makers
> have yet to address the problem at all (ie. MC88100).

CISC processors have exactly the same problems.  They still have to break
it up into two accesses.  The only problem is that you don't have the
flexibility.  If the second one fails, the VAX will restart both accesses,
while a 68020 will only reexecute the second.  In both cases, an arbitrarily
long time can pass between the first (the first first, in the case of the
VAX) access and the second.  How is this different from what a RISC chip
provides?

I don't quite understand the problem.  Alignment restrictions are almost
zero hassle in most code, are enforced by compilers even on processors
which don't beed them for efficiency reasons (the Microsoft C compiler
inserts pad bytes in the instruction stream to get the target of a branch
instruction word-aligned), and simplify the hardware tremendously.
It seems like win-win.  I never really used what I'm giving up, anyway.
-- 
	-Colin (uunet!microsoft!w-colinp)

cik@l.cc.purdue.edu (Herman Rubin) (03/25/89)

In article <59@microsoft.UUCP>, w-colinp@microsoft.UUCP (Colin Plumb) writes:
> mlord@bnr-public.UUCP (Mark Lord) wrote:
> > The "break into multiple aligned arrays" approach can work fine as long
> > as one is dealing with arrays (of structures) to begin with.  Unfortunately,
> > I suspect that this is not of much use with dynamically allocated linked
> > lists of structures, which is very likely when we are talking about multiple
> > megabytes of data.

			..........................

Here is an actual example where unaligned is at least highly desirable.  One
has an array of bits; we may assume byte alignment is free.  It is necessary
to take a 24-bit block at a time to the right 24 bits of a 32-bit word.  These
24 bits can be anywhere in the block, but this has to be done repeatedly, and
can be vectorized if desired.  Bits used may not be reused.  It is also
necessary to keep track of pointers, as the input array may need to be
refilled, almost without warning and possibly at an unpredictable time. 
The refill procedure is somewhat costly.

I believe that the penalty for unalignment will be cheaper than anything else
if the necessary hardware is there.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

greg@sce.carleton.ca (Greg Franks) (03/26/89)

People debating the merits of unaligned accesses with various
people at BNR saying they are useful...

Having worked in the telephone industry (though not at BNR), I can
relate to their desire to have the processor handle unaligned
accesses.  Why?  Simple - the applications that they are dealing with
are pieces of switching equipment that have hundreds of processors of
various shapes and sizes.  They all like to pass binary data to and
from each other.  Needless to say, the alignment requirements on a
6800 are rather different from those on a 88000.  Furthemore, parts of
the system are upgraded at different times - why toss out $1,000,000 of
smart line cards when you are are updating your $10,000 processor.  

Just to make things difficult, communication channels may complicate
life.  On the switch that I worked on, we could fire 32 bytes of data
through the message switch to any other processor on the system.
Aligning the data may cause the message to grow above 32 bytes thus
requiring another message.  Our message system did not guarantee
delivery, so just imagine the extra cost of tacking a protocol onto
the system to ensure both messages arrived at their destination.  The
loss of one message is no big deal, the loss of half a message is.

To continue, lots and lots of code on the switch was written in
assembler language.  If the world were perfect, one could change
offsets by simply tweaking .h files.  But we all know that the world
is not perfect. :-(.

Finally, telephone switches that I am familiar with do not have
virtual memory.  It costs money and it complicates real-time response.
Oops - we just paged out dial tone - we'll have to wait a couple of
seconds... :-) (not a real example...).  Adding memory is expensive
and a pain.  We spent a good half year PACKING data (by shuffing
fields around, changing the compiler etc) just so we would not have to
jam in more memory cards.  Memory isn't cheap, and bus slots may not
be available.  (All cards are designed in-house, and fit on a special
bus.  Economies of scale aren't there like they are in klones-R-us).

Sorry that I've bored you to death.
-- 
Greg Franks                  Systems and Computer Engineering, 
utzoo!dciem!nrcaer!sce!greg  Carleton University, Ottawa, Ontario, Canada.
       uunet!mitel!sce!greg          greg@sce.carleton.ca
from Ottawa Ont, where life in the fast lane is skating two inners!

mlord@bnr-rsc.UUCP (Mark Lord) (03/27/89)

In article <1989Mar25.013342.22214@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <362@bnr-fos.UUCP> mlord@bnr-public.UUCP (Mark Lord) writes:
>>The debate for devoting more silicon space to support efficient handling
>>of misaligned accesses seems to hinge around supporting the huge amounts
>>of perfectly good software that already exist, written before programmers
>>in general became keenly aware of alignment/efficiency tradeoffs.
>
 [stuff I agree with deleted]
>...  I continue to think that the quantity
>of good software which disregards alignment is grossly overestimated.
>(It is not zero, and attention to it is necessary, but the idea that
>a random software package has a 90% chance of suffering from such
>carelessness is unjustified.)

I partially agree with Henry here, but would like to point out that over
the past six years or so, a lot of systems, including a very large (many
millions of lines) system with which I am intimately familiar, have taken
the migration path from 16-bit to true 32-bit processors.  Such software
tends to be quite well aligned on 16-bit boundaries, but not always well
aligned on 32-bit boundaries.  Again, this thread is dealing with very
large (10's of megabytes) of allocated data, tightly packed into structures
of related fields.  Yes, such software could be revisited and aligned
properly for 32-bit or 64-bit (or whatever-bit) processors, but only at
great expense in both manpower and risk to a perfectly good working package.

From my experience, I do agree that alignment on 16-bit boundaries is 
probably not at issue here.  But a lot of software is still being moved
into the 32-bit age from 16-bit systems.  This will probably not continue
beyond another ten years or so, but it would be nice to have the hardware
support for it while it is still important.

Cheers

-Mark

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If I were writing this for the Company, I'd demand a lot more money for it.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mlord@bnr-rsc.UUCP (Mark Lord) (03/27/89)

In article <59@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
 [lines from my article deleted]
>
>Yes, but if you have at least one 32-bit pointer per code item, odds are
>very good your other data is much more than 32 bits long.  So again, the
>waste (limited to 3 bytes, usually) is trivial.
Very trivial indeed for a single data item.  Now consider a large number
of such data items, dynamically allocated and maintained in data structures
based on linked lists.  Say.. ten million of them.  This could cost between
10 and 30 megabytes of store, simply because everything was 32-bit aligned.

 [lines from my article deleted]
>
>Actually, it can be done by machine.  A simple working algorithm is to sort
>the elements by size, starting with the largest.  And, as I said, the wasted
>space has an upper bound of (greatest alignment restriction)-(size of smallest
>data item), usually 4 and 1 bytes, respectively.  Sometimes 8 and 1.
>
Okay, no problem.  At least part of the process can be automated.  Be careful,
though, quite often structures are used as sub-fields of other structures,
or are declared in minimal (public) form in interface modules, and then in
extended (private) detail in the actual implementation modules (yes, there
are very high level languages which support this and more..).  Also, there
have been clever programmers who did tricky and nasty things, such as over-
laying one structure with another to achieve certain effects, such as outputing
a stream of bytes to a block device very fast.  Such things can be done in
more portable ways at little speed expense, but remember, this is existing
working code we are talking about, and the tricky programmers may have moved
on to other areas/companies.

Note the case of structures being used as sub-fields of other structures has
potential to increase the alignment overhead (or else add considerable 
compiler overhead/headaches) in some cases, as one would probably want to
ensure correct alignment of both structures.

>If the structures aren't dumped, raw, to a file, I would have no hesitation
>reordering structures in my code.
>
Or used as part of messages which are passed on to other devices/systems..
or overlaid with other structures by nasty code (see above).  Such code is 
not always easy to find in a forest of 14 million lines.  Especially when
*none* of it is *my* code.

 [lines from my article deleted]
>
>Given that that software needs to be recompiled anyway, and is resistant
>to big/little endian changes (i.e. doesn't use disgusting tricks), I
>don't see how it can have much dependency on struct packing.
>
That's right.  Just recompile it and it works.  I like it!  How come real
life is seldom this simple?

 [lines from my article deleted]
>
>CISC processors have exactly the same problems.  They still have to break
>it up into two accesses.  The only problem is that you don't have the
>flexibility.  If the second one fails, the VAX will restart both accesses,
>while a 68020 will only reexecute the second.  In both cases, an arbitrarily
>long time can pass between the first (the first first, in the case of the
>VAX) access and the second.  How is this different from what a RISC chip
>provides?
I didn't actually have data faults in mind here.  Of more concern were 
interrupts occuring between the multiple accesses, with the interrupt code
using or modifying the data in question.  Sloppy programming, by modern
standards, perhaps, but we are talking about not-so-modern code here.

>
>I don't quite understand the problem.  Alignment restrictions are almost
>zero hassle in most code, are enforced by compilers even on processors
>which don't beed them for efficiency reasons (the Microsoft C compiler
>inserts pad bytes in the instruction stream to get the target of a branch
>instruction word-aligned), and simplify the hardware tremendously.
>It seems like win-win.  I never really used what I'm giving up, anyway.
Perhaps I've just had to deal with such problems a little too often over the
past five years.  Thanks for your viewpoint.  Perhaps the world IS mostly sane.

>-- 
>	-Colin (uunet!microsoft!w-colinp)
-Mark
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I made all of this up for my own benefit, not the Company's.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

les@unicads.UUCP (Les Milash) (03/28/89)

here's a question about "the large body of existing C code"...
how many programmers assume that the address of a struct is the same
as the address of its first member?  or that the members are arranged
in address space in the same order they appear in the struct definition
in the source code?  i wrote a program once that did that first evil,
but now i realize that K&R never claims it'll work, and also there
are legal ways around it that take no time or space.  it seems like
as long as programmers don't do that kind of junk, then (having a
compiler) sort the members of a struct by size is painless and free.

i personally have always been willing to do bizarre things in the name
of performance (like keep 2 copies of registers--now That's unconventional,
but if it helps, great.)

advSORRYance if i screw this all up; i've barely ever posted before.

let me say however, what a pleasure it's been to read your group for
the past half-year; it's so refreshing to hear people who know how it
works talk.  most programmers (even highly-paid ones) wouldn't know
which end of the soldering iron to hold (at least for about 60 seconds);
most folks around here, if they bought more dynamic memory chips
for their mac would ask why you have to malloc() it and can't use
it for stack.

thanks again for illuminating talk and the HOT hardware.  those 98X,
those R2000, those 88K, 29K, those Sun4, that VLIW, killer stuff.  i can't 
believe how far we/YOU've come in the last 5 years.  Running apps on
a Sun4 with an Iris4D as the XServer hauls butt.  Way faster than the
Cyber I used to keypunch for, seems like.

  T--T   Les Milash (w)303-443-6961 graphix s/w slave, Unicad Inc.
 /| /|   (h)303-444-8552348 Arapahoe #17 Boulder CO USA 80302
T-+T |   my opinions are so different from those of my employer :-(
| T+-T   that i often look for a different employer :-)
|/ |/    and that's what these are.
T--T

tim@crackle.amd.com (Tim Olson) (03/28/89)

In article <343@unicads.UUCP> les@unicads.UUCP (Les Milash) writes:
| here's a question about "the large body of existing C code"...
| how many programmers assume that the address of a struct is the same
| as the address of its first member?  or that the members are arranged
| in address space in the same order they appear in the struct definition
| in the source code?  i wrote a program once that did that first evil,
| but now i realize that K&R never claims it'll work, and also there
| are legal ways around it that take no time or space.  it seems like
| as long as programmers don't do that kind of junk, then (having a
| compiler) sort the members of a struct by size is painless and free.

Unfortunately (?) it is not allowed.  Both K&R (section 8.5) and the
pANS (section 3.5.2.1) say that within a structure, the objects declared
have addresses that increase in the order in which they are declared.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

bruce@tigger.wv.tek.com (Bruce Robertson;685-2912;;;tigger) (03/29/89)

Historically, reordering of structure members was disallowed because C
is so often used to talk to hardware.

However, nothing says you can't write a utility to reorder your
structure members at the C source level, optimizing for a particular
architecture.  Obviously, you would have to avoid using this utility
if the order of structure elements matters (you wouldn't want to do it
to the Unix kernel!)
--

	Bruce Robertson
	Domain: bruce@tigger.wv.tek.com
	UUCP:   tektronix!tekecs!tigger!bruce
--

	Bruce Robertson
	Domain: bruce@tigger.wv.tek.com
	UUCP:   tektronix!tekecs!tigger!bruce

slackey@bbn.com (Stan Lackey) (03/29/89)

In article <844@bnr-rsc.UUCP> mlord@bnr-rsc.UUCP (Mark Lord) writes:
>In article <59@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:

>>CISC processors have exactly the same problems.  They still have to break
>>[unaligned accesses] up into two accesses.  
>>flexibility.  If the second one fails, the VAX will restart both accesses,
>>while a 68020 will only reexecute the second.  In both cases, an arbitrarily
>>long time can pass between the first (the first first, in the case of the
>>VAX) access and the second.  How is this different from what a RISC chip
>>provides?
>I didn't actually have data faults in mind here.  Of more concern were 
>interrupts occuring between the multiple accesses, with the interrupt code

Nope!  Processors typically test for interrupts between instructions,
not during.  Unless it is one of the VAX continuable instructions, in
which case it can take one but only after all specifiers are evaluated
(and fetched if appropriate).  This is architecturally defined.
(Disclaimer: Other architectures that support continuable instructions
may do other things.)

Page faults are the only case (that I can think of right now anyway) that
can cause this class of situation.  An interrupt could be taken (after the
fault frame is pushed), but by now that's beside to point.

I can't imagine any processor that would go to the trouble of taking
an interrupt at any arbitrary time.  It would save a few cycles to
recognize the interrupt earlier, at the cost of many many cycles to
save all internal state.  Much added complexity for only a slight
reduction (not a typo) in performance!

>>I don't quite understand the problem.  Alignment restrictions are almost
>>zero hassle in most code, are enforced by compilers even on processors
Then, why were the suppliers (IBM, DEC, Motorola, etc) pressured so much
by programmers to add this feature in the first place?  What has changed?

>>which don't beed them for efficiency reasons (the Microsoft C compiler
>>inserts pad bytes in the instruction stream to get the target of a branch
>>instruction word-aligned), 
Instruction alignment is very far from the current issue.  Some
processors (in fact, ONLY those with variable-size instructions,
typically non-RISCS's) can be benefited by this, in that the first
fetch after a branch will pick up more of a variable-size instruction.

>>and simplify the hardware tremendously.
This is a major part of the point!  If it were a tremendous savings,
it would never have been done in the first place.  Trust me, if you
had a gate-level schematic of a CPU in front of you, you could look at
it for weeks and never find the unaligned logic.  So many pieces of it
you need anyway: rotators to fetch bytes/words, stalls to deal with
non-zero wait state memory systems, traps to handle page faults.

It's just another small cost-small benefit item.  Like a lot of the other 
stuff that's creeping back into RISC's these days.
:-) Stan

huitema@mirsa.inria.fr (Christian Huitema) (03/29/89)

From experience, I can say that not having to bother with alignment can 
make networking code much faster. Messages exchanged over nets include all
sort of variable size components, and padding is both frowned upon and
somewhat ineffective -- often, one cannot predict where a structure will
start. Thus, being able to simply define:

#define integer_value(mess,x) (*(int *)(mess + x))

rather than:

#define integer_value(mess,x) \
	(((((mess[x]<<8)|mess[x+1])<<8)|mess[x+2]<<8)|mess[x+3])

can speed up a lot the decoding of these messages. Indeed, one can object
the relative frequency of network operations vs computing, but the same holds
for compression programs, disk accesses, etc...

Christian Huitema

henry@utzoo.uucp (Henry Spencer) (03/30/89)

In article <BRUCE.89Mar28134751@tigger.wv.tek.com> bruce@tigger.wv.tek.com (Bruce Robertson;685-2912;;;tigger) writes:
>... nothing says you can't write a utility to reorder your
>structure members at the C source level, optimizing for a particular
>architecture.  Obviously, you would have to avoid using this utility
>if the order of structure elements matters (you wouldn't want to do it
>to the Unix kernel!)

If you look at the kernel sources, especially the older parts, you'll
find that there is a remarkable tendency for (e.g.) "char" members of
structs to be grouped together so they will pack densely.
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

njs@scifi.UUCP (Nicholas J. Simicich) (03/30/89)

Another fairly common convention which limits the utility of
reordering structures is that of using a character vector as the last
element of the structure to simulate variable length records.  One
might have a character vector of MAXPATHLEN bytes, but only read or
write the number of bytes actually contained in the filename.


-- 
Nick Simicich --- uunet!bywater!scifi!njs --- njs@ibm.com (Internet)

bcase@cup.portal.com (Brian bcase Case) (03/31/89)

>I can't imagine any processor that would go to the trouble of taking
>an interrupt at any arbitrary time.

The ability to take an interrupt any time falls out of a simple design.
RISCs as a class are an example.

>>>I don't quite understand the problem.  Alignment restrictions are almost
>>>zero hassle in most code, are enforced by compilers even on processors
>Then, why were the suppliers (IBM, DEC, Motorola, etc) pressured so much
>by programmers to add this feature in the first place?  What has changed?

what has changed:  1) the level of expected performance (mostly through
competition), 2) the size of main memory, and 3) the quality of both
hardware and software engineers (the processor designers now know what
the cost of arbitrary alignment really is, so do software designers).

>>>and simplify the hardware tremendously.
>This is a major part of the point!  If it were a tremendous savings,
>it would never have been done in the first place.  Trust me, if you
>had a gate-level schematic of a CPU in front of you, you could look at
>it for weeks and never find the unaligned logic.  So many pieces of it
>you need anyway: rotators to fetch bytes/words, stalls to deal with
>non-zero wait state memory systems, traps to handle page faults.

You missed the point.  The point is that a fetch of an item that spans
a natural boundary requires *two* fetches.  This is slower and requires
annoying special-case consideration in the design.  It complicates
things, paving the way for design errors, and lengthens some critical
paths.  Knowing that two fetches, i.e., extra time, is required for
unaligned accesses, software engineers that are concerned with the
performance of their code won't do it anyway, even if the hardware
supports it!  Those that aren't concerned with the performance of their
code are welcome to solve the problem anyway they like:  use a processor
that supports unalignment, emulate it in software, ....

guy@auspex.UUCP (Guy Harris) (03/31/89)

>From experience, I can say that not having to bother with alignment can 
>make networking code much faster. Messages exchanged over nets include all
>sort of variable size components, and padding is both frowned upon and
>somewhat ineffective -- often, one cannot predict where a structure will
>start.

Well, for TCP, UDP, and IP headers, at least, items are aligned on their
"natural" boundaries.  Not *all* protocols start stuff on random byte
boundaries....

>Thus, being able to simply define:
>
>#define integer_value(mess,x) (*(int *)(mess + x))
>
>rather than:
>
>#define integer_value(mess,x) \
>	(((((mess[x]<<8)|mess[x+1])<<8)|mess[x+2]<<8)|mess[x+3])
>
>can speed up a lot the decoding of these messages.

Umm, the latter appears to have the advantage that it works regardless
of the byte order of your processor; messages exchanged over nets are
often exchanged between big-endian and little-endian processors, so you
can't necessarily just point at some arbitrary location in your message
and suck up an "int", even if your processor *does* support unaligned
references.

>Indeed, one can object the relative frequency of network operations
>vs computing, but the same holds for compression programs,

Well, "compress" (a common Lempel-Ziv compression program) works on
bytes, or on bit strings; the former don't have alignment problems, and
the latter have alignment problems that the sort of unaligned references
support that is being talked about here won't fix....  (The original
"compress" had VAX bit-field instructions jammed into it; it now has
that controlled by "#ifdef vax" - has anybody bothered using other
processor's bit-field instructions, and has it actually made any
difference?)

>disk accesses,

Huh?  "Disk accesses", as I think of them, are usually to sectors,
typically of 512 bytes.  As for data *on* disk, well:

	1) if you write out structures in the native format, they will
	   presumably be aligned properly;

	2) if you write them out in the native format for one machine,
	   and read them in on a different machine, you have problems
	   other than alignment, such as big-endian vs. little-endian
	   format - see the comments above on the networking case;

	3) if you write them out in some "common" format (e.g., XDR, or
	   ASN.1), you're already doing processing work, so you don't
	   just point some structure pointer at the data and go.

bruce@tigger.wv.tek.com (Bruce Robertson;685-2912;;;tigger) (04/01/89)

   In article <BRUCE.89Mar28134751@tigger.wv.tek.com> bruce@tigger.wv.tek.com (Bruce Robertson;685-2912;;;tigger) writes:
   >... nothing says you can't write a utility to reorder your
   >structure members at the C source level, optimizing for a particular
   >architecture.  Obviously, you would have to avoid using this utility
   >if the order of structure elements matters (you wouldn't want to do it
   >to the Unix kernel!)

In article <1989Mar29.232103.6835@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>  If you look at the kernel sources, especially the older parts, you'll
>  find that there is a remarkable tendency for (e.g.) "char" members of
>  structs to be grouped together so they will pack densely.

What I was thinking of is things like structure definitions of stack
frames, device registers, etc.
--

	Bruce Robertson
	Domain: bruce@tigger.wv.tek.com
	UUCP:   tektronix!tekecs!tigger!bruce

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (04/01/89)

So far, people have brought up two hardware solutions:

 - 1. The CPU can just do it.
 - 2. The CPU can complain (interrupt) when someone tries to do it.

No one has mentioned the third alternative.

Let me begin by saying that I don't know if IBM PC/RTs still use the
third method.  When I used the RT (before it was announced) this method
was, ahem, a design feature, obeying, ahem, RISC principles.

The third method was quit simple: screw up.

An unaligned fetch implies two memory fetches. The ROMP chip would do
the low-order one, obtain the low-order bits, and put them in the
low-order part of the target register. The high-order part of the
register would be unchanged.

Someone out there in Netland may be saying that it's OK: the compiler
can just get everything right. Well, we were writing a compiler, and we
didn't get it right, always, at first. It turns out that un-alignedness
is a property that is preserved: add 4 to an unaligned pointer, and it
still is. It also turned out that this could creep in through lots of
little ratholes.

Let me explain to you about the statistics of data. If only the low
order part of a register is changed, there is a high likelihood that
the high order part is still OK. This is because it was probably zero
before, and it probably still should be zero. Alternatively, the
register held a pointer, and its next value was a pointer to a nearby
place.

The upshot is that the compiler could generate bad code, which would
run fine. Test programs would appear to be perfect - until you tried to
have 256 inputs, or linked them with 64KB of library routines, or
whatever else it took to cross that lurking threshold. Then the program
would pick up bizarre behavior that was hard to backtrack.

I don't know who invented this RISC feature, but I had strong feelings
about him at the time.
-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science
-- 

len@synthesis.Synthesis.COM (Len Lattanzi) (04/01/89)

In article <4618@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
:
:So far, people have brought up two hardware solutions:
:
: - 1. The CPU can just do it.
: - 2. The CPU can complain (interrupt) when someone tries to do it.
:
:No one has mentioned the third alternative.
:
[ load low byte(s) into low part of register, leave top alone ]

The MIPS R2000 has this instruction and its cohort load high bytes into
high part of register leaving low alone. I'm not positive but I believe that
f77 will generate them for suitably unaligned common members.


 Len Lattanzi (len@Synthesis.com) <{ames,pyramid,decwrl}!mips!synthesis!len>
 Synthesis Software Solutions, Inc.				1 408 991 0367
 292 Commercial Avenue, Sunnyvale, California 94086