[comp.sys.m68k] quad-aligning the 68020 stack

mikem@uhccux.uhcc.hawaii.edu (Mike Morton) (08/05/88)

Maybe this is old hat to all you '020 hackers, but I've been wondering about
aligning stack frames on four-byte boundaries.  I assume that in certain
critical situations, you should make sure your local variables are aligned
for speed.

Is there a standard way to do this?  I'll post these code fragments, in
hopes of getting someone to post something better.  These are tested on a
68000, since that's all I have.

/*  To quad-align the stack, execute this in-line.  Uses D0.w and leaves
    2 or 4 bytes on the stack. */
    move.w    sp, d0                    /* copy low bits to get alignment */
    addq.w    #2, d0                    /* toggle quad-align bit (#1) */
    and.w     #2, d0                    /* select just that bit */
    sub.w     d0, sp                    /* MIS-align the stack */
    move.w    d0, -(sp)                 /* then fix it and remember bias */

/*  To restore the stack state, execute this in-line. */
    add.w     (sp)+, sp                 /* pop bias and un-bias */

Is alignment desirable?  Is there a shorter way to do it?  Comments?

 -- Mike Morton // P.O. Box 11378, Honolulu, HI  96828, (808) 456-8455 HST
      Internet: msm@ceta.ics.hawaii.edu
    (anagrams): Mr. Machine Tool; Ethical Mormon; Chosen Immortal; etc.

ditto@cbmvax.UUCP (Michael "Ford" Ditto) (08/07/88)

First of all, on a 68020 system running primarily over a 16-bit bus,
it doesn't really matter.  Such systems are not too uncommon these
days, but they are usually upgraded 680{00,10} systems for which the
software needs to be compatible with the 68000 anyway, so for these
systems you might as well just put your brain in 16-bit mode.

For real 32-bit 68020 systems, I am a firm believer in keeping all
stacks and structures longword aligned at all times.  This means the
C compiler should promote all function arguments to a multiple of 32
bits and pad structures so that longword elements are at aligned
offsets.  Malloc() and similar functions should return longword aligned
storage.

Aside from the delays in accessing misaligned longword data, exception
processing is an important consideration.  Since the 68000 architecture
requires some fairly complex (i.e. slow) exception processing, you
certainly don't want it to run at half speed because the supervisor
stack isn't longword aligned.  This means that interrupt routines must
keep the stack longword aligned so that any nested interrupts aren't
slowed down, and in the case of certain real-time operating systems
which run user tasks in supervisor mode, all user code should keep the
stack aligned as well.  If there is a lot of user software that doesn't
do this, it might be worthwhile to put stack alignment code in the
system entrypoints and interrupt service routines, so that the perfor-
mance problems will not occur when the O.S. itself is running.

-- 
					-=] Ford [=-

	.		.		(In Real Life: Mike Ditto)
.	    :	       ,		ford@kenobi.cts.com
This space under construction,		...!ucsd!elgar!ford
pardon our dust.			ditto@cbmvax.commodore.com

mash@mips.COM (John Mashey) (08/07/88)

In article <4431@cbmvax.UUCP> ditto@cbmvax.UUCP (Mike "Ford" Ditto) writes:
....
>For real 32-bit 68020 systems, I am a firm believer in keeping all
>stacks and structures longword aligned at all times.  This means the
>C compiler should promote all function arguments to a multiple of 32
>bits and pad structures so that longword elements are at aligned
>offsets.  Malloc() and similar functions should return longword aligned
>storage....

Right philosophy, but consider thinking further ahead to 64-bit objects,
at least.  Structures should be aligned to doublewords, especially,
or you may later run into the same problem that 68K compilers did,
if they started aligning longwords on 16-bit boundaries, and then had
incompatible structures when shifting to 68020-preferred alginment,
or to RISC machines that really want objects aligned properly.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ditto@cbmvax.UUCP (Michael "Ford" Ditto) (08/08/88)

In article <4431@cbmvax.UUCP> ditto@cbmvax.UUCP (That's me) wrote:
>For real 32-bit 68020 systems, I am a firm believer in keeping all
>stacks and structures longword aligned at all times.

In article <2727@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>Right philosophy, but consider thinking further ahead to 64-bit objects,
>at least.
 [ ... ]

I don't see a real reason for this.  Admittedly, the older 68000 compilers
that only aligned to 16 bits were a mistake, but that is because the 68000
has a 32-bit archetecture.  I can't imagine an extention to the 68000
archetecture that will have objects larger than 32 bits (other than floats)
nor an object-code-compatible chip with a wider-than-32-bits data bus.
64-bit chips and memories are certainly on their way to popularity, but
they really don't have much affect on 680x0 performance decisions, or
vice-versa.

One related subject that should be considered with the 64-bit chips in
mind, however, is that of alignment in portable data structures, i.e.
data files on disk or headers in network packets which will have to
be described and used on machines of various types and sizes.  If you
create a data structure like:

	struct myrecord
	{
	    short   a;
	    long    b;
	    long    c;
	    quad_t  d;		/* Assume quad_t is a 64-bit integer */
	};

you could have some problems on several types of machines.  The first
problem is that on a 16-bit machine like the 8086, for example, the "b"
field is at offset 2, while a 68020 compiler should put it at offset 4
(skipping 2 bytes after "a").  The next problem is analogous for the "d"
field; suppose that on a 68000 quad_t is typedefed as long[2].  The
compiler would put "d" at offset 12, which is fine for a 68020, but a
machine with instructions to access 64-bit data on a 64-bit bus would
work faster if "d" were at offset 16, a 64-bit boundary.

What this all means, is that such structures should be *explicitly*
padded so that they will have a high probability of having the same
meaning on any machine.  In the above example, a "short" dummy field
should be placed after "a", and a "long" dummy field placed after "c".

This is not only for efficiency reasons, but also because a larger
machine might not have a way to declare a struture with a structure-
packing method that is inefficient for its archetecture.  For example,
if I write a program on my Z-80 (8-bit) system that writes its output
file in this format:

	struct myrecord
	{
	    char    a;		/* 8 bits at byte offset 0 */
	    short   b;		/* 16 bits at byte offset 1 */
	};

It is impossible to declare or reference that structure type in a
reasonable way on a 68000.  A wise programmer would have put a
one-byte pad field between a and b.

-- 
					-=] Ford [=-

	.		.		(In Real Life: Mike Ditto)
.	    :	       ,		ford@kenobi.cts.com
This space under construction,		...!ucsd!elgar!ford
pardon our dust.			ditto@cbmvax.commodore.com

haugj@pigs.UUCP (Joe Bob Willie) (08/09/88)

In article <2194@uhccux.uhcc.hawaii.edu> mikem@uhccux.uhcc.hawaii.edu (Mike Morton) writes:
>
>Maybe this is old hat to all you '020 hackers, but I've been wondering about
>aligning stack frames on four-byte boundaries.  I assume that in certain
>critical situations, you should make sure your local variables are aligned
>for speed.
>
>Is there a standard way to do this?  I'll post these code fragments, in
>hopes of getting someone to post something better.  These are tested on a
>68000, since that's all I have.

[ long winded example deleted ]

real simple:

to align the stack on entry:

	link	#0,a6		| set up your frame pointer like always ;-)
	move.l	sp,d0		| copy sp to scratch register
	and.l	#2,d0		| lop off low to bits (mod 4 alignment)
	move.l	d0,a0		| restore sp with now-aligned version

to unalign the stack on exit:

	unlk	a6		| restore stack pointer to value on entry

>Is alignment desirable?  Is there a shorter way to do it?  Comments?

yes, alignment is desirable on the 68020 and 030.  there is no point in
doing 4 byte alignment on the 68000 or 010, and anything other than 2 byte
alignment causes bus errors.  see my method for a definition of `short',
i believe you can't get any shorter since the and instruction won't take
an address register as an operand (will it on the 020? 030?)
-- 
 jfh@rpp386.uucp	(The Beach Bum at The Big "D" Home for Wayward Hackers)
     "Never attribute to malice what is adequately explained by stupidity"
                -- Hanlon's Razor

ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) (08/09/88)

In article <4434@cbmvax.UUCP> ford@kenobi.cts.com (Mike "Ford" Ditto) writes:
>One related subject that should be considered with the 64-bit chips in
>mind, however, is that of alignment in portable data structures, i.e.
>data files on disk or headers in network packets which will have to
>be described and used on machines of various types and sizes.
....
>What this all means, is that such structures should be *explicitly*
>padded so that they will have a high probability of having the same
>meaning on any machine.
>if I write a program on my Z-80 (8-bit) system ...
...[Z-80 packed struct declaration ommitted]
>It is impossible to declare or reference that structure type in a
>reasonable way on a 68000.  A wise programmer would have put a
>one-byte pad field between a and b.
I disagree.  I'd rather pack my structs in the most reasonable way for 
that machine, and convert to a canonical representation (like that specified 
by SUN's XDR, for example)  for the networked or other heterogenous 
environment.  Even if the bits are all addressable, some machines
have different byte orders (LSB or short in 'high' byte).

In Sun's Unix, for example, you have the 'htonl' and various byte ordering 
macros and routines for working this all out.

Someone will always come up with an archictecture that won't fit in
with your packing scheme.  If you find one that does happen to match, 
consider yourself lucky and enjoy the potential performance advantage.

-- 
					- Ralph W. Hyre, Jr.

Internet: ralphw@ius2.cs.cmu.edu    Phone:(412)268-{2847,3275} CMU-{BUGS,DARK}
Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA

henry@utzoo.uucp (Henry Spencer) (08/10/88)

In article <4434@cbmvax.UUCP> ford@kenobi.cts.com (Mike "Ford" Ditto) writes:
>I don't see a real reason for this.  Admittedly, the older 68000 compilers
>that only aligned to 16 bits were a mistake, but that is because the 68000
>has a 32-bit archetecture.  I can't imagine...
>an object-code-compatible chip with a wider-than-32-bits data bus.

Why not?  It wouldn't be as big a win as the jump from 16 to 32, but for
caches in particular, wider is better in memory buses.  I wouldn't be at
all surprised to see the 68050 or whatever with a 64-bit memory bus.
There's no reason why this wouldn't be object-code-compatible; the only
things that really care how wide the memory accesses are, usually, are
the I/O devices, which most code doesn't touch anyway.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ditto@cbmvax.UUCP (Michael "Ford" Ditto) (08/10/88)

In article <2641@pt.cs.cmu.edu> ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) writes:
>In article <4434@cbmvax.UUCP> ford@kenobi.cts.com (Mike "Ford" Ditto) writes:
>>One related subject that should be considered with the 64-bit chips in
>>mind, however, is that of alignment in portable data structures, i.e.
>>data files on disk or headers in network packets which will have to
>>be described and used on machines of various types and sizes.

 [ ... ]

>I disagree.  I'd rather pack my structs in the most reasonable way for 
>that machine, and convert to a canonical representation (like that specified 
>by SUN's XDR, for example)  for the networked or other heterogenous 
>environment.

That is what I said, isn't it?  If not, I'll summarize:  Don't worry about
aligning data for future (or any other) cpu archetectures unless you are
designing a machine-independent data format.  Just use whatever works best
on the machine it's running on.  On the 68020 this means align longword
AND 64-bit data on 32-bit boundaries.

>In Sun's Unix, for example, you have the 'htonl' and various byte ordering 
>macros and routines for working this all out.

Even these aren't enough for the assortment of machines existing today.
What about 64-bit machines where both "int" and "long" are 64 bits?
Should htonl() swap only the bottom 4 bytes or all 8?  And how do you
declare a "network long integer" (32 bits) anyway?

>Someone will always come up with an archictecture that won't fit in
>with your packing scheme.  If you find one that does happen to match, 
>consider yourself lucky and enjoy the potential performance advantage.

I think that sums it up quite well.

-- 
					-=] Ford [=-

	.		.		(In Real Life: Mike Ditto)
.	    :	       ,		ford@kenobi.cts.com
This space under construction,		...!ucsd!elgar!ford
pardon our dust.			ditto@cbmvax.commodore.com

haugj@pigs.UUCP (Joe Bob Willie) (08/11/88)

In article <1988Aug9.175440.2320@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
}In article <4434@cbmvax.UUCP> ford@kenobi.cts.com (Mike "Ford" Ditto) writes:
}>                            I can't imagine...
}>an object-code-compatible chip with a wider-than-32-bits data bus.
}
}Why not?  It wouldn't be as big a win as the jump from 16 to 32, but for
}caches in particular, wider is better in memory buses.  I wouldn't be at
}all surprised to see the 68050 or whatever with a 64-bit memory bus.

me, i'm holding out for when motorola creates a cray zmp compatible chip
with a 256 bit wide bus.  i've been aligning everything, including 
character data on 256 bit boundaries.  the performance improvement should
be astounding in three or four years!
-- 
 jfh@rpp386.uucp	(The Beach Bum at The Big "D" Home for Wayward Hackers)
     "Never attribute to malice what is adequately explained by stupidity"
                -- Hanlon's Razor

mash@mips.COM (John Mashey) (08/11/88)

In article <4434@cbmvax.UUCP> ford@kenobi.cts.com (Mike "Ford" Ditto) writes:
>In article <4431@cbmvax.UUCP> ditto@cbmvax.UUCP (That's me) wrote:
>>For real 32-bit 68020 systems, I am a firm believer in keeping all
>>stacks and structures longword aligned at all times.
>
>In article <2727@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>>Right philosophy, but consider thinking further ahead to 64-bit objects,
>>at least.
> [ ... ]
>
>I don't see a real reason for this.  Admittedly, the older 68000 compilers
>that only aligned to 16 bits were a mistake, but that is because the 68000
>has a 32-bit archetecture.  I can't imagine an extention to the 68000
>archetecture that will have objects larger than 32 bits (other than floats)
>nor an object-code-compatible chip with a wider-than-32-bits data bus.
>64-bit chips and memories are certainly on their way to popularity, but
>they really don't have much affect on 680x0 performance decisions, or
>vice-versa.

No major disagreement, i.e., the main motivation was for the broadest
possible portability  (as I have no personal interest in improving
in the speed of 68K-based systems :-).  In some systems that
one might design with external caches, it might be a performance or cost
improvement if 64-bit things were usually on the correct boundaries.
(Very minor). 
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

daveh@cbmvax.UUCP (Dave Haynie) (08/24/88)

in article <4486@cbmvax.UUCP>, ditto@cbmvax.UUCP (Michael "Ford" Ditto) says:

> That's why I say I can't imagine an object-code-compatible chip with a
> wider-than-32-bits data bus.  You might as well switch to an incompatible
> instruction set and/or entire archetecture so that you could actually
> gain from the wider bus (add 64-bit registers, 64-bit ALU, 64-bit data
> movement instructions, etc. and then there's every reason to switch to
> "The Big Bus").

I wouldn't expect a complete 64 bit bus on a 680x0, at least in the way
you're thinking, and don't really see much need for one.  However, there
is a use for wider buses.  If you extend the internal Harvard architecture
of the 68030 to two external 32 bit buses, one for I, one for D, you could
probably get a real performance improvement.  External caches could funnel
both buses together to form a single 32 bit system bus, much like the 88k
does.  That's certainly not in the works for the 68040, though if Moto
goes on to the 68050 it might make some sense.  By then they should be able
to build some really nice external cache chips.


> 					-=] Ford [=-
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

andrew@frip.gwd.tek.com (Andrew Klossner) (08/26/88)

[]

	"If you extend the internal Harvard architecture of the 68030
	to two external 32 bit buses, one for I, one for D, you could
	probably get a real performance improvement.  External caches
	could funnel both buses together to form a single 32 bit system
	bus, much like the 88k does."

To finish the thought: if you implement 88k-like caches, you'll have
16-byte cache lines, and you'll get a slight performance improvement if
you align the stack frame to a multiple of 16 bytes (because your stack
frame will use the smallest possible number of cache lines).

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]