[net.lang.c] structure alignment question

rbbb@rice.EDU (David Chase) (09/04/86)

Suppose I have a structure "foo" and function returning a "struct foo".

    struct foo {
	int a,b;
	};
    struct foo bar() {
	struct foo baz;
	...
	return baz;
	};

Note that the structure contains two ints.  I will assume that an "int" is
the same size as some machine quantity called a "word", and that
word alignment is desirable.  I assume therefore that all instances of the
structure generated by the compiler will be word-aligned.

C compilers that I have seen supporting structure-valued functions pass a
hidden parameter to the function containing the address to store the
result.

Can a compiler assume that the storage for the result is word-aligned?
(That is, can a compiler make the same assumptions when returning a
structure that it does when assigning a structure?)
If so, why?  If not, why not?

David

karl@haddock (09/08/86)

rbbb@rice.EDU (David) writes:
>C compilers that I have seen supporting structure-valued functions pass a
>hidden parameter to the function containing the address to store the result.
>Can a compiler assume that the storage for the result is word-aligned?
>(That is, can a compiler make the same assumptions when returning a
>structure that it does when assigning a structure?)

Certainly.  It can make this assumption for *any* expression of type "pointer
to struct foo", whether it's explicit or (as in this example) generated by
the compiler.  *All* instances of a type (including those generated by a
pointer dereference) have the alignment of that type.  The user can sometimes
circumvent this (type punning with casts or unions), but the result is
undefined.

Karl W. Z. Heuer (ima!haddock!karl; karl@haddock.isc.com), The Walking Lint

notes@hcx1.UUCP (09/14/86)

Background:  Our architecture has the following data types and sizes:

               byte   -- 1 byte
	       short  -- 2 bytes
	       int    -- 4 bytes
	       long   -- 4 bytes
	       *ptr   -- 4 bytes

Question:  Should structures and unions always be aligned to 4-byte
           boundaries?

Consider:   Our portable C compiler aligns this structure as follows:

     struct a
     {
        char b ;        ->  byte 0
        union aa
        {
           short bb;    ->  byte 4
           char cc;
        } dd ;
        char c;         ->  byte 8
     } d;

I've been told that on a VAX, however, 'bb' is placed on byte 2 and
'c' is placed on byte 4.  Is this true?  How do other compilers handle
this case?  Which way is "correct"?  If there is sufficient interest,
I will post a summary of responses.

Dave Ray  --  uucp:  {ucf-cs|allegra}!novavax!hrshcx!hcx1!daver

guy@sun.uucp (Guy Harris) (09/17/86)

> Background:  Our architecture has the following data types and sizes:
> (fairly "standard" 32-bit architecture description, which could very well
> be CCI's Power 6; Harris (no relation) sells this machine under their own
> name)
>
> Question:  Should structures and unions always be aligned to 4-byte
>            boundaries?
> Consider:   Our portable C compiler aligns this structure as follows:
> ...
>         short a;        ->  byte 0
>         union aa
>         {
>            short bb;    ->  byte 4
>            char cc;
>         } dd ;
>         char c;         ->  byte 8
> 
> I've been told that on a VAX, however, 'bb' is placed on byte 2 and
> 'c' is placed on byte 4.  Is this true?

Yes, at least in the 4.3BSD version of the compiler, although the other
versions are almost certainly not different enough to align the values
differently.

> How do other compilers handle this case?

Sun's 68000-family compiler does exactly the same thing, although
(obviously) "struct a"s can't be blindly copied between VAXes and Suns.

> Which way is "correct"?

K&R says:

	8.5 Structure and union declarations

	...Each non-field member of a structure begins on an addressing
	boundary appropriate to its type...

K&R does not seem to say anything about the addressing boundaries
appropriate to structures or unions in the Reference Manual, but it implies
that the addressing boundary appropriate to a union, at least, is that of
the member with the most restrictive addressing boundary on page 174 of
chapter 8, where it suggests that in order to align a given structure on a
boundary more restrictive than its "natural" boundary, you place it in a
union with an object with the desired alignment boundary.

The ANSI C draft seems to say much the same thing, and gives little
additional detail.  (If it's stored somewhere not indexed under "alignment",
let me know; I only looked in the Index under "alignment".)

The 4.3BSD VAX C compiler (and, I suspect, all other UNIX VAX C compilers)
follow this rule for structures, as well; if you declare a structure
containing only one member, a "char", the size of the structure will be one
byte and the structure will be aligned only on a byte boundary.

In PCC, the minimum alignment requirement of a structure (or union) is given
by the #define constant "ALSTRUCT"; this is 8 (bits) in the VAX PCC and in
our PCC.  I suspect the person who did the compiler for the machine in
question may have gotten confused and figured that if the machine requires
4-byte alignment of "int"s, it also requires 4-byte alignment of "struct"s
and "union"s (the CCI Power 6 does require 4-byte alignment of "int"s, and
2-byte alignment of "short"s), and set ALSTRUCT to 32 (bits), not realizing
that this is a *minimum*, not *maximum* alignment requirement.

It may be too late to change it now, given that a lot of code has been
compiled for that machine and a lot of on-disk data structures built using
that alignment.  I'd at least try, though.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

mash@mips.UUCP (John Mashey) (09/19/86)

In article <7363@sun.uucp> guy@sun.uucp (Guy Harris) writes:
>> ...in reply to:
>> Question:  Should structures and unions always be aligned to 4-byte
>>            boundaries? <....example follows:...>
>
>Yes, at least in the 4.3BSD version of the compiler, ....
Actually, some might be aligned to 8-byte boundaries, on machines that
like doubles on such boundaries.

>K&R does not seem to say anything about the addressing boundaries
>appropriate to structures or unions in the Reference Manual, but it implies
>that the addressing boundary appropriate to a union, at least, is that of
>the member with the most restrictive addressing boundary on page 174 of
>chapter 8, where it suggests that in order to align a given structure on a
>boundary more restrictive than its "natural" boundary, you place it in a
>union with an object with the desired alignment boundary.
>
>The ANSI C draft seems to say much the same thing, and gives little
>additional detail.  (If it's stored somewhere not indexed under "alignment",
>let me know; I only looked in the Index under "alignment".)
>
Guy is correct, as usual.  The key info is (trickily) found under sizeof:
ANSI C, section C.3.3.4 The sizeof operator, says:
"... When applied to an operand that has structure or union type, the result is
the total nubmer of bytes in such a member considered as a member of an array,
including whatever internal and trailing paddiing might be needed to align each
member in such an array properly.... Another use of the sizeof operator is to
compute the number of members in an array:  sizeof array / sizeof array[0];

Note that the net combination requires that:
	a) Any union or structure must be aligned on the most restrictive
	boundary of any of its members, including member structures.
	b) It's size must be rounded up to a multiple of that same size.
Anything else causes the above usage of sizeof to be unrobust in the face of:
	struct { ..... } array[number];

Note that for speed, it is usually best to align data items on the most
restrictive boundaries, even if a specific machine implementation doesn't
really require it.  Lots of machines let words be on arbitrary boundaries,
but you pay for it in speed, if not now, then later, as CPUs come
to have wider data paths.  A perfect case is the surprise going from
680[01]s to 68020s: if you let long's be aligned on 2-byte, but not necesarily
4-byte boundaries, and if you want to convert to 68020s without converting
data saved in such format, you lose performance you might have had.
(Of course, some space is saved; in practice, it's seldom large.)

Question for the audience: what implementations thought ahead about potential
64-bit buses and aligned doubles on 8-byte boundaries? try this:
main(argc, argv) {
	struct x {
		long i;
		double j;
	} y;
	printf("%x,%x, %d\n", &y.i, &y.j, sizeof y);
}
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

guy@sun.uucp (Guy Harris) (09/19/86)

> >Yes, at least in the 4.3BSD version of the compiler, ....
> Actually, some might be aligned to 8-byte boundaries, on machines that
> like doubles on such boundaries.

By way of clarification, my "yes" was to "yes, the VAX aligns the 'union'
containing a 'short' and a 'char' on a 2-byte, not a 4-byte, boundary."
Structures need not always be aligned on the most restrictive boundary
required by the architecture or imposed by the compiler on scalar data, but
it must align a structure so that all its members are aligned, so that it is
aligned on the most restrictive boundary of all its members.

PCC starts out aligning structures on a "minimum alignment required by all
structures" boundary, and then marches through the member list strengthening
the alignment requirement as necessary.  I presume the reason that the
"minimum alignment required by all structures" is not hardcoded to "one
byte" is that there might be machines that prefer, or require, structures
aligned more stringently.  E.g., a machine might have
"register+displacement" addressing where the displacement is in units of
8-byte quadwords, or the implementor might have decided to speed up
structure assignment by using some trick that can copy data faster if it's
aligned on a quadword boundary.  It was *not* intended for machines with
more stringent restrictions on scalar alignment, as the aforementioned march
handles that problem already.

> Note that for speed, it is usually best to align data items on the most
> restrictive boundaries, even if a specific machine implementation doesn't
> really require it.  Lots of machines let words be on arbitrary boundaries,
> but you pay for it in speed, if not now, then later, as CPUs come
> to have wider data paths.

Yes, the VAX PCC aligns "int"s and "long"s on 4-byte boundaries, even though
the machine can fetch longwords off such a boundary; there is, I'd bet, a
performance penalty for such a fetch.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

tiemann@mcc-pp.UUCP (Michael Tiemann) (09/20/86)

In article <7447@sun.uucp>, guy@sun.UUCP writes:
> > >Yes, at least in the 4.3BSD version of the compiler, ....
> > Actually, some might be aligned to 8-byte boundaries, on machines that
> > like doubles on such boundaries.
>  [... text deleted ...]
> > Note that for speed, it is usually best to align data items on the most
> > restrictive boundaries, even if a specific machine implementation doesn't
> > really require it.  Lots of machines let words be on arbitrary boundaries,
> > but you pay for it in speed, if not now, then later, as CPUs come
> > to have wider data paths.
> 
> Yes, the VAX PCC aligns "int"s and "long"s on 4-byte boundaries, even though
> the machine can fetch longwords off such a boundary; there is, I'd bet, a
> performance penalty for such a fetch.
> -- 
> 	Guy Harris
> 	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
> 	guy@sun.com (or guy@sun.arpa)

I was waiting for SOMEBODY to talk performace for once. The last 68000
compiler I used aligned strings on WORD boundaries. This would cost
one byte per string, half the time. But there was a big speed payoff:
I could do word operations in my strnlen, strncmp, strncpy, and whatever
other string processing functions I happened to write. All this code ported
to a Sun-3 no (apparent) problem, but crashed on a Sun-2, because the SUN
compiler allocated strings in a totally stingy 1-byte alignment.
So now, all this "fast" code actually runs slower than a "dumb" byte-copy
model, because the 68020 faults itself to death reading in 32-bit words
on odd boundaries, and doesn't run at all on a Sun-2 because the 68010
can read odd words.

Now I *know* you're not *supposed* to do things like that, but often
high performance comes not by reading some 20 year old manual, but
by analyzing the machine at hand, and trying to more than just
accomodate it. Tune those compilers, comment those machine dependencies,
and watch the code fly!

How many other compilers are so parsimonious? Is this something we should
stand for, or are we trying to get away from performance in the name
of cleanliness? I suppose that this letter was a bit of a flame, sorry.
Y'all are welcome to flame me back,

Michael
tiemann@mcc.com

chris@umcp-cs.UUCP (Chris Torek) (09/21/86)

In article <1705@mcc-pp.UUCP> tiemann@mcc-pp.UUCP (Michael Tiemann) writes:
>... The last 68000 compiler I used aligned strings on WORD boundaries.
>This would cost one byte per string, half the time. But there was
>a big speed payoff: I could do word operations in my strnlen,
>strncmp, strncpy, and whatever other string processing functions
>I happened to write. ... all this "fast" code actually runs slower
>than a "dumb" byte-copy model [on a Sun-3], because the 68020 faults
>itself to death reading in 32-bit words on odd boundaries, and
>doesn't run at all on a Sun-2 because the 68010 can read odd words.

(Does the 68020 really fault?  I thought it just did two bus accesses.)

It is not difficult to do copies in word mode iff the strings
are aligned:

	| Sun mnenonics

	| /*LINTLIBRARY*/
	| strcpy(to, from) char *to, *from; { *to = *from; return (to); }
	| /*UNTESTED!*/
		ENTRY(strcpy)
	TO	=	a0		| I think this works
	FROM	=	a1
		movl	sp@(4),TO	| to
		movl	sp@(8),FROM	| from
	| I forget if this is legal.  If not, copy to d0 first.
		btst	#0,TO		| test for odd destination
		bnes	odd0		| handle odd dst, unknown src
		btst	#0,FROM		| test for odd source
		bnes	hardway		| handle even dst, odd src

	| both addresses are even; do a fast strcpy
	fastcopy:
		movw	FROM@+,d0	| grab entire word
		movw	d0,d1		| need to test high byte first
		lsrw	#8,d1		| throw out low byte
		beqs	fastend		| if high byte zero, go terminate dst
		movw	d0,TO@+		| copy entire word
		tstb	d0		| and see if we are now done
		bnes	fastcopy	| do more if not
		movl	sp@(4),d0	| set return value
		rts			| and return
	fastend:
		movql	#0,d0
		movb	d0,TO@		| terminate destination string
		movl	sp@(4),d0	| set return value
		rts			| and return

	odd0:
		btst	#0,FROM		| test for odd source
		beqs	hardway		| handle odd dst, even src
		movb	FROM@+,TO@+	| copy one byte to make even
		bnes	fastcopy	| and do rest with fast copy
		movl	sp@(4),d0	| set return value
		rts			| and return

	| one address is even, the other odd, so do it a byte at a time.
	hardway:
		movl	TO,d0		| set return value
	hardloop:
		movb	FROM@+,TO@+	| copy ...
		bnes	hardloop	| until we copy a null
		rts			| return

I wonder, though, if this is truly faster.  Should not a movb/bnes
pair run in loop mode?  (Perhaps not; `dbcc' loops do, though, and
one could use a dbra surrounded by a bit of extra logic.)  Machine
dependent `fast' code is often CPU dependent as well, and one must
be prepared to modify marked inner loops when moving among implem-
entations of one architecture.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

guy@sun.uucp (Guy Harris) (09/22/86)

> The last 68000 compiler I used aligned strings on WORD boundaries.
> This would cost one byte per string, half the time. But there was a
> big speed payoff: I could do word operations in my strnlen, strncmp,
> strncpy, and whatever other string processing functions I happened to
> write.

Oh, really?

	char	string1[] = "foo";
	char	string2[] = "xfoo";

	return(strcmp(string1, string2 + 1));

If you can do this with straightforward word operations, and not take a
performance hit relative to byte-by-byte copies, you've got a really clever
"strcmp".  (Hint:  my 68010 manual does *NOT* mention any "swap bytes"
instruction, and the 68010 doesn't have a barrel shifter.)

Furthermore,

	char	string3[128+1];

	strcpy(string3, string1);

would, if it copied a word at a time, have to deal both with words of the
form

	00xx

(when it recognizes a word of that form, it should stop *before* copying it
and stuff a zero byte at the end of the target) and of the form

	xx00

(when it recognizes a word of that form, it should stop *after* copying it).

> All this code ported to a Sun-3 no (apparent) problem, but crashed on a
> Sun-2, because the SUN compiler allocated strings in a totally stingy
> 1-byte alignment.

It would also have crashed on the first example given above *regardless* of
whether the Sun compiler allocated strings on 1-byte or 2-byte boundaries.
If your routines couldn't handle arguments off byte boundaries, I'm sorry,
but they were *not* correct implementations of the string routines!
*!NOTHING*!  guarantees that the arguments to the string routines always
point to the *first* byte of a string.  Furthermore, nothing guarantees that
"strcat" and "strncat" will always do aligned copies, even if the arguments
always point to the first byte of a string; the reason for this should be
obvious.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

karl@haddock (09/22/86)

>In article <1705@mcc-pp.UUCP> tiemann@mcc-pp.UUCP (Michael Tiemann) writes:
>>The last 68000 compiler I used aligned strings on WORD boundaries. ...
>>There was a big speed payoff: I could do word operations in [strncpy, etc.]

Wrong!  Even if strings are word-aligned, substrings (e.g. "p+1") will not
be.  (Unless the chars themselves are word-aligned, in which case your space
penalty is a factor of two instead of a half byte.)

umcp-cs!chris (Chris Torek) writes:
>It is not difficult to do copies in word mode iff the strings are aligned:
>[example of a strcpy() that considers four alignment cases]

Even if the strings are word-aligned, you have to test for a null *byte*
(not a zero word); my guess is that the extra manipulation inside the loop
will make it slower than a byte-by-byte copy.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

tiemann@mcc-pp.UUCP (Michael Tiemann) (09/22/86)

In article <7479@sun.uucp>, guy@sun.uucp (Guy Harris) writes:
> > The last 68000 compiler I used aligned strings on WORD boundaries.
> > This would cost one byte per string, half the time. But there was a
> > big speed payoff: I could do word operations in my strnlen, strncmp,
> > strncpy, and whatever other string processing functions I happened to
> > write.
> 
> Oh, really?
> 
> 	char	string1[] = "foo";
> 	char	string2[] = "xfoo";
> 
> 	return(strcmp(string1, string2 + 1));
> 
> If you can do this with straightforward word operations ...
> [ ... USW ... ]
> *!NOTHING*!  guarantees that the arguments to the string routines always
> point to the *first* byte of a string.  Furthermore, nothing guarantees that
> "strcat" and "strncat" will always do aligned copies, even if the arguments
> always point to the first byte of a string; the reason for this should be
> obvious.
> -- 
> 	Guy Harris
> 	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
> 	guy@sun.com (or guy@sun.arpa)

What I meant was... There are times when I *can* (can-to can-to can-to)
guarantee that I am refering to the first byte of a string. Sure,
when I am doing SUB-string operations I have to be prepared to
pay the price, but if I happen only to want to do STRING operations,
where I *always* refer *only* to the string as a whole, then it
sure would be nice to be rewarded for passing that extra knowledge
on to my compiler.

When I don't know what I'm dealing with, I use strcmp, strcat, etc,
(and I expect them do make pessimistic decisions).
I wrote my own C routines "fmove", "fzero" which are like
"bcopy" and "bzero", except I only call them when I know I'll
be using word aligned things. Also, "fcmp" depends on the fact that a
word-aligned string is terminated by a null word (too much to
ask of a compiler to do that: possible (2*sizeof(int)-1 byte overhead),
but if they're two of MY strings, I call it.

The point is, a word aligned string does not seem to be too much
to ask for, and the performance improvement IS noticable:
Lets be hypothetical for a moment:

int lookup(s)
     char *s;
{
    int where;

    switch (*(int *)s) {
      case 0x41424300:		/* "abc" */
	return ABC;
      case 0x78797a00:		/* "xyz" */
	return XYZ;
      default:
	/* must look up in symbol table... */
	do { ... };
	return where;
    }
}

Keep your reserved words < 4 characters and off you go!
No flames about "portability": these were handled by a little
emacs function (generates hex and ascii comment for short strings),
and each emacs (VAX, SUN, whatever) has its own version.

"A little knowledge is a dangerous thing"... programmers should
*never* learn about machine architectures: they will only want to
write fast code...

I hope that this time I have clarified my position: everybody can
now say I'm wrong (of course I am) or right.

Michael
tiemann@mcc.com

"I'm that C programmer your mother warned you about"

guy@sun.uucp (Guy Harris) (09/22/86)

> (Does the 68020 really fault?  I thought it just did two bus accesses.)

"fault" was a poor choice of words on his part; it already means something,
namely a trap.  The '020 doesn't fault, it just does two bus accesses.

> 	| strcpy(to, from) char *to, *from; { *to = *from; return (to); }
> 	| /*UNTESTED!*/

When tested with some simple test cases (even->even, odd->even, even->odd,
odd->odd), it worked.

> 	TO	=	a0		| I think this works

It does.

> 	| I forget if this is legal.  If not, copy to d0 first.
> 		btst	#0,TO		| test for odd destination

It's not, and I modified it to do so.

> I wonder, though, if this is truly faster.  Should not a movb/bnes
> pair run in loop mode?

Nope.  Only "dbCC" loops run in loop mode.

> (Perhaps not; `dbcc' loops do, though, and one could use a dbra surrounded
> by a bit of extra logic.)

Yes.  The following, courtesy of John Gilmore and Vaughan Pratt, is what is
actually used in the (3.2 version of) "strcpy", etc.:

	moveq	#-1,d1		| maximum possible (16-bit) count
hardloop:
	movb	FROM@+,TO@+	| copy...
	dbeq	d1,hardloop	| until we copy a null or the count is -1
	bne	hardloop	| if not-null, continue copying with count
				| freshly initialized to -1

Now for the numbers.  A test program was built to do a large number of
copies in a loop, and to do the same loop with no body; the times were
subtracted and the result was divided by the number of iterations.  The
program was run with strings of length 2, 10, and 100.  All strings were
"malloc"ed, so they started on word boundaries (the program tested this,
just to make sure).  The results:

Byte-by-byte copy, using "movb"/"dbcc" loop (standard 3.2 "strcpy), Sun-2
(10 MhZ 68010, no caches, zero wait states):

	250000 copies of 2 bytes took 5.720000 seconds
	0.000023 seconds/call

	50000 copies of 10 bytes took 1.760000 seconds
	0.000035 seconds/call

	5000 copies of 100 bytes took 0.860000 seconds
	0.000172 seconds/call

New strcpy, same Sun-2:

	250000 copies of 2 bytes took 8.440000 seconds
	0.000034 seconds/call

	50000 copies of 10 bytes took 3.120000 seconds
	0.000062 seconds/call

	5000 copies of 100 bytes took 1.880000 seconds
	0.000376 seconds/call

Standard strcpy, Sun-3/75 (16.67 MhZ 68020, no caches other than the on-chip
256-byte instruction cache, 1.5 wait states):

	250000 copies of 2 bytes took 1.780000 seconds
	0.000007 seconds/call

	50000 copies of 10 bytes took 0.720000 seconds
	0.000014 seconds/call

	5000 copies of 100 bytes took 0.500000 seconds
	0.000100 seconds/call

New strcpy, same Sun-3/75:

	250000 copies of 2 bytes took 2.800000 seconds
	0.000011 seconds/call

	50000 copies of 10 bytes took 0.960000 seconds
	0.000019 seconds/call

	5000 copies of 100 bytes took 0.520000 seconds
	0.000104 seconds/call

Standard strcpy, Sun-3/200 (25 MhZ 68020, off-chip write-back cache, 0 wait
states):

	250000 copies of 2 bytes took 1.060000 seconds
	0.000004 seconds/call

	50000 copies of 10 bytes took 0.480000 seconds
	0.000010 seconds/call

	5000 copies of 100 bytes took 0.260000 seconds
	0.000052 seconds/call

New strcpy, same Sun-3/200:

	250000 copies of 2 bytes took 1.420000 seconds
	0.000006 seconds/call

	50000 copies of 10 bytes took 0.520000 seconds
	0.000010 seconds/call

	5000 copies of 100 bytes took 0.320000 seconds
	0.000064 seconds/call

These numbers were quite reproducible.

The moral(s) of the story:

	1) Loop mode, on the 010, is a big win.  (The byte-by-byte
	   "strcpy" runs in loop mode on the 010, the other one
	   doesn't; the other one takes about twice as long.)

	2) The instruction cache, on the 020, is a big win.  (The 020
	   versions don't differ by as much, and the other one seems
	   to be catching up as the strings get longer, which didn't
	   happen on the 010.)

	3) With realistic string lengths, and 68K-family machines
	   offered by Sun, at least, the plain vanilla byte-by-byte
	   copy is the right way to do things, even with word-aligned
	   strings.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

guy@sun.uucp (Guy Harris) (09/23/86)

> What I meant was...

Then why the hell didn't you *say* that?  What you *did* say was

	"...I could use do word operations in my strnlen, strncmp,
	strncpy, and whatever other string processing functions I
	happened to write."

which sure sounds to *me* like you were doing word operations in the
standard string operations.  If you weren't referring to them, you shouldn't
have called them "my strnlen, ..."; you should have called them "my string
processing functions", and perhaps emphasized that they were *not*
replacements for the standard ones.

> Also, "fcmp" depends on the fact that a word-aligned string is
> terminated by a null word

As do the string copy and string length routines, unless you're individually
checking each byte of each word that you move.

> The point is, a word aligned string does not seem to be too much
> to ask for,

Yes, it is, considering you can do it yourself.  Just declare all your
strings to be unions of arrays of "char"s and a "short".  If you want to
initialize the string, you can probably get away with declaring it to be an
integral number of words long - if you include the extra padding word, the
compiler will even do you the favor of making that word all-zero.  If you
use ANSI C, you can even initialize the union by declaring the string to be
its first member.

Remember, ANSI C does *not* guarantee, in any way, that strings are aligned
on any boundary more strict than a byte boundary.  As such, you're going to
have to do this sort of thing anyway, unless you "know" how the compiler
works.  Yes, it means you have to type more, but that's life.

> "A little knowledge is a dangerous thing"... programmers should
> *never* learn about machine architectures: they will only want to
> write fast code...

Could you please point out where *anybody* stated their position this way?
Don't just assume that this was somebody's motivation; one of the biggest
source of errors in this business seems to be people assuming they "know"
why something was done, or how it works internally, or....

Sometimes that kind of tweaking is a good idea, and sometimes it just makes
work for you or, more likely, the guy who inherited your code when it comes
time to port it to a new architecture.  I'm tempted, at times, to use Barney
Frank's phrase and call it the programming equivalent of "crack"; it gives a
short-term high, but has bad long-term consequences.  If you're going to do
that sort of thing, 1) make sure you *really* need it, and can't get the
same performance improvement using more portable code, 2) wrap it in #ifdefs
so that people know it's there (and don't have to go through too much
trouble to compile a version for a particular machine), and 3) provide a
portable version so people can at least get it to work on a new machine
fairly quickly.

> I hope that this time I have clarified my position: everybody can
> now say I'm wrong (of course I am) or right.

You're wrong.  You don't need to have the compiler put strings on word
boundaries in order to be able to put your specialized kind of string on
word boundaries.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

levy@ttrdc.UUCP (Daniel R. Levy) (09/25/86)

In article <1705@mcc-pp.UUCP>, tiemann@mcc-pp.UUCP (Michael Tiemann) writes:
>> > Note that for speed, it is usually best to align data items on the most
>> > restrictive boundaries, even if a specific machine implementation doesn't
>> > really require it.  Lots of machines let words be on arbitrary boundaries,
>> > but you pay for it in speed, if not now, then later, as CPUs come
>> > to have wider data paths.
>I was waiting for SOMEBODY to talk performace for once. The last 68000
>compiler I used aligned strings on WORD boundaries. This would cost
>one byte per string, half the time. But there was a big speed payoff:
>I could do word operations in my strnlen, strncmp, strncpy, and whatever
>other string processing functions I happened to write. All this code ported
>to a Sun-3 no (apparent) problem, but crashed on a Sun-2, because the SUN
>compiler allocated strings in a totally stingy 1-byte alignment.
>So now, all this "fast" code actually runs slower than a "dumb" byte-copy
>model, because the 68020 faults itself to death reading in 32-bit words
>on odd boundaries, and doesn't run at all on a Sun-2 because the 68010
>can read odd words.
 ^^^ can't, you mean?

What's more, what if there lay data in the byte after the end of a string
being copied to which was not an integral multiple of your copy-quantity
(4 bytes or whatever)?  That byte (and possibly more) will get clobbered,
unless you stop short of that point and finish the operation using byte
operations.  For that matter, compare of null-terminated string operations
also would need similar handling to make the compares valid (and you then
need to know the length of the string anyhow first).  (If all strings involved
were counted strings a la fortran, that would be a different story.)

But for the functions where word-operations would make sense if possible,
one could check whether the pointer(s) being passed were appropriately
aligned to choose the method (word operations or byte operations).  This
would be something like

func(ptr,count) char *ptr; int count;
{
	if (ptr & 01) {
		/* byte operations */
		...
	} else if (ptr & 02) {
		/* short word operations */
		...
	} else {
		/* long word operations */
		...
	}
}

I'm sure details would need to be worked out, (and why not use the
word operations in the middle of even a misaligned string?) but I wonder
if this would prove useful?

One could probably also force string alignment by using a union, but
it might be unwise to depend on it, as well as it being unwieldy for
places where inline data "texttexttext..." strings would normally be
used.

>Now I *know* you're not *supposed* to do things like that, but often
>high performance comes not by reading some 20 year old manual, but
>by analyzing the machine at hand, and trying to more than just
>accomodate it. Tune those compilers, comment those machine dependencies,
>and watch the code fly!
>I suppose that this letter was a bit of a flame, sorry.
>Y'all are welcome to flame me back,
>
>Michael
>tiemann@mcc.com

If you do this kind of thing, for the sake of portability #ifdef it for
the targeted machine(s) and supply a safely portable, if slow, alternative
for others.  Otherwise somebody is gonna come along, port the code to
some machine you never dreamed of, and it will crash and burn.
-- 
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer or the administrator of any computer
| at&t computer systems division |  upon which I may hack.
|        skokie, illinois        |
 --------------------------------   Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
	   go for it!  			allegra,ulysses,vax135}!ttrdc!levy

perl@rdin.UUCP (Robert Perlberg) (10/02/86)

If Michael says that his word-oriented string routines are faster, I
wouldn't contradict him (after all, he has been running them).
However, I don't see why it is necessary to have a special set of
functions to handle tha word-aligned case.  The standard functions
could test the alignment of their arguments and handle them
appropriatly.  This would have many advantages:

The optimization would be realized on programs that weren't written to
try to use it.

Programs that were written with the optimization in mind would run on
all machines even if they didn't or couldn't support the optimization.

Compilers that put strings on word boundaries would enhance the
frequency with which the optimization is used without the programmer
having to communicate anything to the string routines.

By the way, Michael, how do your string routines handle the '\0'?

Robert Perlberg
Resource Dynamics Inc.
New York
{philabs|delftcc}!rdin!perl

tiemann@mcc-pp.UUCP (Michael Tiemann) (10/05/86)

In article <581@rdin.UUCP>, perl@rdin.UUCP (Robert Perlberg) writes:
> If Michael says that his word-oriented string routines are faster, I
> wouldn't contradict him (after all, he has been running them).

The 68000 had no loop-mode, so they were faster. But with the advent
of superior technology, and a corresponding *lack* of followup
on my part, the great assumptions I made in 1982 no longer applied.
Whether or not they should have applied is a philosophical matter
taht has been resolved to my satisfaction (= no).

> By the way, Michael, how do your string routines handle the '\0'?

I used these routines for symbol-table management, and therefore,
never had any of these nasty cases to deal with. I have been
duly humbled, and will not feel slighted in the least if we drop
this discussion.

By the way... It was the numbers that Guy put on the net, that
convinced me I was wrong: wrong beacuse even if *one* assumption
that I make is allowed for by the compiler (by word aligning
strings, for example), there are too many other assumptions
about the hardware that must be taken into account. My moral
of the story is, let the compiler generate the code it is supposed
to, and write your special-purpose functions from there. Asking
for one more inch here or there may be smart-today-dumb-tomorrow,
as in this example.

Michael
tiemann@mcc.com

thomas@kuling.UUCP (Thomas H{meenaho) (10/08/86)

In article <7479@sun.uucp> guy@sun.uucp (Guy Harris) writes:
>> The last 68000 compiler I used aligned strings on WORD boundaries.
>> This would cost one byte per string, half the time. But there was a
>> big speed payoff: I could do word operations in my strnlen, strncmp,
>> strncpy, and whatever other string processing functions I happened to
>> write.
>
>	char	string3[128+1];
>
>	strcpy(string3, string1);
>
>would, if it copied a word at a time, have to deal both with words of the
>form
>
>	00xx
>
>(when it recognizes a word of that form, it should stop *before* copying it
>and stuff a zero byte at the end of the target) and of the form
>
>	xx00
>
>(when it recognizes a word of that form, it should stop *after* copying it).
>

In sysV for the 68K this technique is used for bcopy inside the kernel.
bcopy is easier as one knows how many bytes are to be moved.
bzero is even smarter, it clears a set of registers and uses movem to copy
them to memory.


-- 
Thomas Hameenaho, Dept. of Computer Science, Uppsala University, Sweden
Phone: +46 18 138650
UUCP: thomas@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!thomas)