rbbb@rice.EDU (David Chase) (09/04/86)
Suppose I have a structure "foo" and function returning a "struct foo". struct foo { int a,b; }; struct foo bar() { struct foo baz; ... return baz; }; Note that the structure contains two ints. I will assume that an "int" is the same size as some machine quantity called a "word", and that word alignment is desirable. I assume therefore that all instances of the structure generated by the compiler will be word-aligned. C compilers that I have seen supporting structure-valued functions pass a hidden parameter to the function containing the address to store the result. Can a compiler assume that the storage for the result is word-aligned? (That is, can a compiler make the same assumptions when returning a structure that it does when assigning a structure?) If so, why? If not, why not? David
karl@haddock (09/08/86)
rbbb@rice.EDU (David) writes: >C compilers that I have seen supporting structure-valued functions pass a >hidden parameter to the function containing the address to store the result. >Can a compiler assume that the storage for the result is word-aligned? >(That is, can a compiler make the same assumptions when returning a >structure that it does when assigning a structure?) Certainly. It can make this assumption for *any* expression of type "pointer to struct foo", whether it's explicit or (as in this example) generated by the compiler. *All* instances of a type (including those generated by a pointer dereference) have the alignment of that type. The user can sometimes circumvent this (type punning with casts or unions), but the result is undefined. Karl W. Z. Heuer (ima!haddock!karl; karl@haddock.isc.com), The Walking Lint
notes@hcx1.UUCP (09/14/86)
Background: Our architecture has the following data types and sizes: byte -- 1 byte short -- 2 bytes int -- 4 bytes long -- 4 bytes *ptr -- 4 bytes Question: Should structures and unions always be aligned to 4-byte boundaries? Consider: Our portable C compiler aligns this structure as follows: struct a { char b ; -> byte 0 union aa { short bb; -> byte 4 char cc; } dd ; char c; -> byte 8 } d; I've been told that on a VAX, however, 'bb' is placed on byte 2 and 'c' is placed on byte 4. Is this true? How do other compilers handle this case? Which way is "correct"? If there is sufficient interest, I will post a summary of responses. Dave Ray -- uucp: {ucf-cs|allegra}!novavax!hrshcx!hcx1!daver
guy@sun.uucp (Guy Harris) (09/17/86)
> Background: Our architecture has the following data types and sizes: > (fairly "standard" 32-bit architecture description, which could very well > be CCI's Power 6; Harris (no relation) sells this machine under their own > name) > > Question: Should structures and unions always be aligned to 4-byte > boundaries? > Consider: Our portable C compiler aligns this structure as follows: > ... > short a; -> byte 0 > union aa > { > short bb; -> byte 4 > char cc; > } dd ; > char c; -> byte 8 > > I've been told that on a VAX, however, 'bb' is placed on byte 2 and > 'c' is placed on byte 4. Is this true? Yes, at least in the 4.3BSD version of the compiler, although the other versions are almost certainly not different enough to align the values differently. > How do other compilers handle this case? Sun's 68000-family compiler does exactly the same thing, although (obviously) "struct a"s can't be blindly copied between VAXes and Suns. > Which way is "correct"? K&R says: 8.5 Structure and union declarations ...Each non-field member of a structure begins on an addressing boundary appropriate to its type... K&R does not seem to say anything about the addressing boundaries appropriate to structures or unions in the Reference Manual, but it implies that the addressing boundary appropriate to a union, at least, is that of the member with the most restrictive addressing boundary on page 174 of chapter 8, where it suggests that in order to align a given structure on a boundary more restrictive than its "natural" boundary, you place it in a union with an object with the desired alignment boundary. The ANSI C draft seems to say much the same thing, and gives little additional detail. (If it's stored somewhere not indexed under "alignment", let me know; I only looked in the Index under "alignment".) The 4.3BSD VAX C compiler (and, I suspect, all other UNIX VAX C compilers) follow this rule for structures, as well; if you declare a structure containing only one member, a "char", the size of the structure will be one byte and the structure will be aligned only on a byte boundary. In PCC, the minimum alignment requirement of a structure (or union) is given by the #define constant "ALSTRUCT"; this is 8 (bits) in the VAX PCC and in our PCC. I suspect the person who did the compiler for the machine in question may have gotten confused and figured that if the machine requires 4-byte alignment of "int"s, it also requires 4-byte alignment of "struct"s and "union"s (the CCI Power 6 does require 4-byte alignment of "int"s, and 2-byte alignment of "short"s), and set ALSTRUCT to 32 (bits), not realizing that this is a *minimum*, not *maximum* alignment requirement. It may be too late to change it now, given that a lot of code has been compiled for that machine and a lot of on-disk data structures built using that alignment. I'd at least try, though. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
mash@mips.UUCP (John Mashey) (09/19/86)
In article <7363@sun.uucp> guy@sun.uucp (Guy Harris) writes: >> ...in reply to: >> Question: Should structures and unions always be aligned to 4-byte >> boundaries? <....example follows:...> > >Yes, at least in the 4.3BSD version of the compiler, .... Actually, some might be aligned to 8-byte boundaries, on machines that like doubles on such boundaries. >K&R does not seem to say anything about the addressing boundaries >appropriate to structures or unions in the Reference Manual, but it implies >that the addressing boundary appropriate to a union, at least, is that of >the member with the most restrictive addressing boundary on page 174 of >chapter 8, where it suggests that in order to align a given structure on a >boundary more restrictive than its "natural" boundary, you place it in a >union with an object with the desired alignment boundary. > >The ANSI C draft seems to say much the same thing, and gives little >additional detail. (If it's stored somewhere not indexed under "alignment", >let me know; I only looked in the Index under "alignment".) > Guy is correct, as usual. The key info is (trickily) found under sizeof: ANSI C, section C.3.3.4 The sizeof operator, says: "... When applied to an operand that has structure or union type, the result is the total nubmer of bytes in such a member considered as a member of an array, including whatever internal and trailing paddiing might be needed to align each member in such an array properly.... Another use of the sizeof operator is to compute the number of members in an array: sizeof array / sizeof array[0]; Note that the net combination requires that: a) Any union or structure must be aligned on the most restrictive boundary of any of its members, including member structures. b) It's size must be rounded up to a multiple of that same size. Anything else causes the above usage of sizeof to be unrobust in the face of: struct { ..... } array[number]; Note that for speed, it is usually best to align data items on the most restrictive boundaries, even if a specific machine implementation doesn't really require it. Lots of machines let words be on arbitrary boundaries, but you pay for it in speed, if not now, then later, as CPUs come to have wider data paths. A perfect case is the surprise going from 680[01]s to 68020s: if you let long's be aligned on 2-byte, but not necesarily 4-byte boundaries, and if you want to convert to 68020s without converting data saved in such format, you lose performance you might have had. (Of course, some space is saved; in practice, it's seldom large.) Question for the audience: what implementations thought ahead about potential 64-bit buses and aligned doubles on 8-byte boundaries? try this: main(argc, argv) { struct x { long i; double j; } y; printf("%x,%x, %d\n", &y.i, &y.j, sizeof y); } -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
guy@sun.uucp (Guy Harris) (09/19/86)
> >Yes, at least in the 4.3BSD version of the compiler, .... > Actually, some might be aligned to 8-byte boundaries, on machines that > like doubles on such boundaries. By way of clarification, my "yes" was to "yes, the VAX aligns the 'union' containing a 'short' and a 'char' on a 2-byte, not a 4-byte, boundary." Structures need not always be aligned on the most restrictive boundary required by the architecture or imposed by the compiler on scalar data, but it must align a structure so that all its members are aligned, so that it is aligned on the most restrictive boundary of all its members. PCC starts out aligning structures on a "minimum alignment required by all structures" boundary, and then marches through the member list strengthening the alignment requirement as necessary. I presume the reason that the "minimum alignment required by all structures" is not hardcoded to "one byte" is that there might be machines that prefer, or require, structures aligned more stringently. E.g., a machine might have "register+displacement" addressing where the displacement is in units of 8-byte quadwords, or the implementor might have decided to speed up structure assignment by using some trick that can copy data faster if it's aligned on a quadword boundary. It was *not* intended for machines with more stringent restrictions on scalar alignment, as the aforementioned march handles that problem already. > Note that for speed, it is usually best to align data items on the most > restrictive boundaries, even if a specific machine implementation doesn't > really require it. Lots of machines let words be on arbitrary boundaries, > but you pay for it in speed, if not now, then later, as CPUs come > to have wider data paths. Yes, the VAX PCC aligns "int"s and "long"s on 4-byte boundaries, even though the machine can fetch longwords off such a boundary; there is, I'd bet, a performance penalty for such a fetch. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
tiemann@mcc-pp.UUCP (Michael Tiemann) (09/20/86)
In article <7447@sun.uucp>, guy@sun.UUCP writes: > > >Yes, at least in the 4.3BSD version of the compiler, .... > > Actually, some might be aligned to 8-byte boundaries, on machines that > > like doubles on such boundaries. > [... text deleted ...] > > Note that for speed, it is usually best to align data items on the most > > restrictive boundaries, even if a specific machine implementation doesn't > > really require it. Lots of machines let words be on arbitrary boundaries, > > but you pay for it in speed, if not now, then later, as CPUs come > > to have wider data paths. > > Yes, the VAX PCC aligns "int"s and "long"s on 4-byte boundaries, even though > the machine can fetch longwords off such a boundary; there is, I'd bet, a > performance penalty for such a fetch. > -- > Guy Harris > {ihnp4, decvax, seismo, decwrl, ...}!sun!guy > guy@sun.com (or guy@sun.arpa) I was waiting for SOMEBODY to talk performace for once. The last 68000 compiler I used aligned strings on WORD boundaries. This would cost one byte per string, half the time. But there was a big speed payoff: I could do word operations in my strnlen, strncmp, strncpy, and whatever other string processing functions I happened to write. All this code ported to a Sun-3 no (apparent) problem, but crashed on a Sun-2, because the SUN compiler allocated strings in a totally stingy 1-byte alignment. So now, all this "fast" code actually runs slower than a "dumb" byte-copy model, because the 68020 faults itself to death reading in 32-bit words on odd boundaries, and doesn't run at all on a Sun-2 because the 68010 can read odd words. Now I *know* you're not *supposed* to do things like that, but often high performance comes not by reading some 20 year old manual, but by analyzing the machine at hand, and trying to more than just accomodate it. Tune those compilers, comment those machine dependencies, and watch the code fly! How many other compilers are so parsimonious? Is this something we should stand for, or are we trying to get away from performance in the name of cleanliness? I suppose that this letter was a bit of a flame, sorry. Y'all are welcome to flame me back, Michael tiemann@mcc.com
chris@umcp-cs.UUCP (Chris Torek) (09/21/86)
In article <1705@mcc-pp.UUCP> tiemann@mcc-pp.UUCP (Michael Tiemann) writes: >... The last 68000 compiler I used aligned strings on WORD boundaries. >This would cost one byte per string, half the time. But there was >a big speed payoff: I could do word operations in my strnlen, >strncmp, strncpy, and whatever other string processing functions >I happened to write. ... all this "fast" code actually runs slower >than a "dumb" byte-copy model [on a Sun-3], because the 68020 faults >itself to death reading in 32-bit words on odd boundaries, and >doesn't run at all on a Sun-2 because the 68010 can read odd words. (Does the 68020 really fault? I thought it just did two bus accesses.) It is not difficult to do copies in word mode iff the strings are aligned: | Sun mnenonics | /*LINTLIBRARY*/ | strcpy(to, from) char *to, *from; { *to = *from; return (to); } | /*UNTESTED!*/ ENTRY(strcpy) TO = a0 | I think this works FROM = a1 movl sp@(4),TO | to movl sp@(8),FROM | from | I forget if this is legal. If not, copy to d0 first. btst #0,TO | test for odd destination bnes odd0 | handle odd dst, unknown src btst #0,FROM | test for odd source bnes hardway | handle even dst, odd src | both addresses are even; do a fast strcpy fastcopy: movw FROM@+,d0 | grab entire word movw d0,d1 | need to test high byte first lsrw #8,d1 | throw out low byte beqs fastend | if high byte zero, go terminate dst movw d0,TO@+ | copy entire word tstb d0 | and see if we are now done bnes fastcopy | do more if not movl sp@(4),d0 | set return value rts | and return fastend: movql #0,d0 movb d0,TO@ | terminate destination string movl sp@(4),d0 | set return value rts | and return odd0: btst #0,FROM | test for odd source beqs hardway | handle odd dst, even src movb FROM@+,TO@+ | copy one byte to make even bnes fastcopy | and do rest with fast copy movl sp@(4),d0 | set return value rts | and return | one address is even, the other odd, so do it a byte at a time. hardway: movl TO,d0 | set return value hardloop: movb FROM@+,TO@+ | copy ... bnes hardloop | until we copy a null rts | return I wonder, though, if this is truly faster. Should not a movb/bnes pair run in loop mode? (Perhaps not; `dbcc' loops do, though, and one could use a dbra surrounded by a bit of extra logic.) Machine dependent `fast' code is often CPU dependent as well, and one must be prepared to modify marked inner loops when moving among implem- entations of one architecture. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu
guy@sun.uucp (Guy Harris) (09/22/86)
> The last 68000 compiler I used aligned strings on WORD boundaries. > This would cost one byte per string, half the time. But there was a > big speed payoff: I could do word operations in my strnlen, strncmp, > strncpy, and whatever other string processing functions I happened to > write. Oh, really? char string1[] = "foo"; char string2[] = "xfoo"; return(strcmp(string1, string2 + 1)); If you can do this with straightforward word operations, and not take a performance hit relative to byte-by-byte copies, you've got a really clever "strcmp". (Hint: my 68010 manual does *NOT* mention any "swap bytes" instruction, and the 68010 doesn't have a barrel shifter.) Furthermore, char string3[128+1]; strcpy(string3, string1); would, if it copied a word at a time, have to deal both with words of the form 00xx (when it recognizes a word of that form, it should stop *before* copying it and stuff a zero byte at the end of the target) and of the form xx00 (when it recognizes a word of that form, it should stop *after* copying it). > All this code ported to a Sun-3 no (apparent) problem, but crashed on a > Sun-2, because the SUN compiler allocated strings in a totally stingy > 1-byte alignment. It would also have crashed on the first example given above *regardless* of whether the Sun compiler allocated strings on 1-byte or 2-byte boundaries. If your routines couldn't handle arguments off byte boundaries, I'm sorry, but they were *not* correct implementations of the string routines! *!NOTHING*! guarantees that the arguments to the string routines always point to the *first* byte of a string. Furthermore, nothing guarantees that "strcat" and "strncat" will always do aligned copies, even if the arguments always point to the first byte of a string; the reason for this should be obvious. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
karl@haddock (09/22/86)
>In article <1705@mcc-pp.UUCP> tiemann@mcc-pp.UUCP (Michael Tiemann) writes: >>The last 68000 compiler I used aligned strings on WORD boundaries. ... >>There was a big speed payoff: I could do word operations in [strncpy, etc.] Wrong! Even if strings are word-aligned, substrings (e.g. "p+1") will not be. (Unless the chars themselves are word-aligned, in which case your space penalty is a factor of two instead of a half byte.) umcp-cs!chris (Chris Torek) writes: >It is not difficult to do copies in word mode iff the strings are aligned: >[example of a strcpy() that considers four alignment cases] Even if the strings are word-aligned, you have to test for a null *byte* (not a zero word); my guess is that the extra manipulation inside the loop will make it slower than a byte-by-byte copy. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
tiemann@mcc-pp.UUCP (Michael Tiemann) (09/22/86)
In article <7479@sun.uucp>, guy@sun.uucp (Guy Harris) writes: > > The last 68000 compiler I used aligned strings on WORD boundaries. > > This would cost one byte per string, half the time. But there was a > > big speed payoff: I could do word operations in my strnlen, strncmp, > > strncpy, and whatever other string processing functions I happened to > > write. > > Oh, really? > > char string1[] = "foo"; > char string2[] = "xfoo"; > > return(strcmp(string1, string2 + 1)); > > If you can do this with straightforward word operations ... > [ ... USW ... ] > *!NOTHING*! guarantees that the arguments to the string routines always > point to the *first* byte of a string. Furthermore, nothing guarantees that > "strcat" and "strncat" will always do aligned copies, even if the arguments > always point to the first byte of a string; the reason for this should be > obvious. > -- > Guy Harris > {ihnp4, decvax, seismo, decwrl, ...}!sun!guy > guy@sun.com (or guy@sun.arpa) What I meant was... There are times when I *can* (can-to can-to can-to) guarantee that I am refering to the first byte of a string. Sure, when I am doing SUB-string operations I have to be prepared to pay the price, but if I happen only to want to do STRING operations, where I *always* refer *only* to the string as a whole, then it sure would be nice to be rewarded for passing that extra knowledge on to my compiler. When I don't know what I'm dealing with, I use strcmp, strcat, etc, (and I expect them do make pessimistic decisions). I wrote my own C routines "fmove", "fzero" which are like "bcopy" and "bzero", except I only call them when I know I'll be using word aligned things. Also, "fcmp" depends on the fact that a word-aligned string is terminated by a null word (too much to ask of a compiler to do that: possible (2*sizeof(int)-1 byte overhead), but if they're two of MY strings, I call it. The point is, a word aligned string does not seem to be too much to ask for, and the performance improvement IS noticable: Lets be hypothetical for a moment: int lookup(s) char *s; { int where; switch (*(int *)s) { case 0x41424300: /* "abc" */ return ABC; case 0x78797a00: /* "xyz" */ return XYZ; default: /* must look up in symbol table... */ do { ... }; return where; } } Keep your reserved words < 4 characters and off you go! No flames about "portability": these were handled by a little emacs function (generates hex and ascii comment for short strings), and each emacs (VAX, SUN, whatever) has its own version. "A little knowledge is a dangerous thing"... programmers should *never* learn about machine architectures: they will only want to write fast code... I hope that this time I have clarified my position: everybody can now say I'm wrong (of course I am) or right. Michael tiemann@mcc.com "I'm that C programmer your mother warned you about"
guy@sun.uucp (Guy Harris) (09/22/86)
> (Does the 68020 really fault? I thought it just did two bus accesses.) "fault" was a poor choice of words on his part; it already means something, namely a trap. The '020 doesn't fault, it just does two bus accesses. > | strcpy(to, from) char *to, *from; { *to = *from; return (to); } > | /*UNTESTED!*/ When tested with some simple test cases (even->even, odd->even, even->odd, odd->odd), it worked. > TO = a0 | I think this works It does. > | I forget if this is legal. If not, copy to d0 first. > btst #0,TO | test for odd destination It's not, and I modified it to do so. > I wonder, though, if this is truly faster. Should not a movb/bnes > pair run in loop mode? Nope. Only "dbCC" loops run in loop mode. > (Perhaps not; `dbcc' loops do, though, and one could use a dbra surrounded > by a bit of extra logic.) Yes. The following, courtesy of John Gilmore and Vaughan Pratt, is what is actually used in the (3.2 version of) "strcpy", etc.: moveq #-1,d1 | maximum possible (16-bit) count hardloop: movb FROM@+,TO@+ | copy... dbeq d1,hardloop | until we copy a null or the count is -1 bne hardloop | if not-null, continue copying with count | freshly initialized to -1 Now for the numbers. A test program was built to do a large number of copies in a loop, and to do the same loop with no body; the times were subtracted and the result was divided by the number of iterations. The program was run with strings of length 2, 10, and 100. All strings were "malloc"ed, so they started on word boundaries (the program tested this, just to make sure). The results: Byte-by-byte copy, using "movb"/"dbcc" loop (standard 3.2 "strcpy), Sun-2 (10 MhZ 68010, no caches, zero wait states): 250000 copies of 2 bytes took 5.720000 seconds 0.000023 seconds/call 50000 copies of 10 bytes took 1.760000 seconds 0.000035 seconds/call 5000 copies of 100 bytes took 0.860000 seconds 0.000172 seconds/call New strcpy, same Sun-2: 250000 copies of 2 bytes took 8.440000 seconds 0.000034 seconds/call 50000 copies of 10 bytes took 3.120000 seconds 0.000062 seconds/call 5000 copies of 100 bytes took 1.880000 seconds 0.000376 seconds/call Standard strcpy, Sun-3/75 (16.67 MhZ 68020, no caches other than the on-chip 256-byte instruction cache, 1.5 wait states): 250000 copies of 2 bytes took 1.780000 seconds 0.000007 seconds/call 50000 copies of 10 bytes took 0.720000 seconds 0.000014 seconds/call 5000 copies of 100 bytes took 0.500000 seconds 0.000100 seconds/call New strcpy, same Sun-3/75: 250000 copies of 2 bytes took 2.800000 seconds 0.000011 seconds/call 50000 copies of 10 bytes took 0.960000 seconds 0.000019 seconds/call 5000 copies of 100 bytes took 0.520000 seconds 0.000104 seconds/call Standard strcpy, Sun-3/200 (25 MhZ 68020, off-chip write-back cache, 0 wait states): 250000 copies of 2 bytes took 1.060000 seconds 0.000004 seconds/call 50000 copies of 10 bytes took 0.480000 seconds 0.000010 seconds/call 5000 copies of 100 bytes took 0.260000 seconds 0.000052 seconds/call New strcpy, same Sun-3/200: 250000 copies of 2 bytes took 1.420000 seconds 0.000006 seconds/call 50000 copies of 10 bytes took 0.520000 seconds 0.000010 seconds/call 5000 copies of 100 bytes took 0.320000 seconds 0.000064 seconds/call These numbers were quite reproducible. The moral(s) of the story: 1) Loop mode, on the 010, is a big win. (The byte-by-byte "strcpy" runs in loop mode on the 010, the other one doesn't; the other one takes about twice as long.) 2) The instruction cache, on the 020, is a big win. (The 020 versions don't differ by as much, and the other one seems to be catching up as the strings get longer, which didn't happen on the 010.) 3) With realistic string lengths, and 68K-family machines offered by Sun, at least, the plain vanilla byte-by-byte copy is the right way to do things, even with word-aligned strings. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
guy@sun.uucp (Guy Harris) (09/23/86)
> What I meant was... Then why the hell didn't you *say* that? What you *did* say was "...I could use do word operations in my strnlen, strncmp, strncpy, and whatever other string processing functions I happened to write." which sure sounds to *me* like you were doing word operations in the standard string operations. If you weren't referring to them, you shouldn't have called them "my strnlen, ..."; you should have called them "my string processing functions", and perhaps emphasized that they were *not* replacements for the standard ones. > Also, "fcmp" depends on the fact that a word-aligned string is > terminated by a null word As do the string copy and string length routines, unless you're individually checking each byte of each word that you move. > The point is, a word aligned string does not seem to be too much > to ask for, Yes, it is, considering you can do it yourself. Just declare all your strings to be unions of arrays of "char"s and a "short". If you want to initialize the string, you can probably get away with declaring it to be an integral number of words long - if you include the extra padding word, the compiler will even do you the favor of making that word all-zero. If you use ANSI C, you can even initialize the union by declaring the string to be its first member. Remember, ANSI C does *not* guarantee, in any way, that strings are aligned on any boundary more strict than a byte boundary. As such, you're going to have to do this sort of thing anyway, unless you "know" how the compiler works. Yes, it means you have to type more, but that's life. > "A little knowledge is a dangerous thing"... programmers should > *never* learn about machine architectures: they will only want to > write fast code... Could you please point out where *anybody* stated their position this way? Don't just assume that this was somebody's motivation; one of the biggest source of errors in this business seems to be people assuming they "know" why something was done, or how it works internally, or.... Sometimes that kind of tweaking is a good idea, and sometimes it just makes work for you or, more likely, the guy who inherited your code when it comes time to port it to a new architecture. I'm tempted, at times, to use Barney Frank's phrase and call it the programming equivalent of "crack"; it gives a short-term high, but has bad long-term consequences. If you're going to do that sort of thing, 1) make sure you *really* need it, and can't get the same performance improvement using more portable code, 2) wrap it in #ifdefs so that people know it's there (and don't have to go through too much trouble to compile a version for a particular machine), and 3) provide a portable version so people can at least get it to work on a new machine fairly quickly. > I hope that this time I have clarified my position: everybody can > now say I'm wrong (of course I am) or right. You're wrong. You don't need to have the compiler put strings on word boundaries in order to be able to put your specialized kind of string on word boundaries. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
levy@ttrdc.UUCP (Daniel R. Levy) (09/25/86)
In article <1705@mcc-pp.UUCP>, tiemann@mcc-pp.UUCP (Michael Tiemann) writes: >> > Note that for speed, it is usually best to align data items on the most >> > restrictive boundaries, even if a specific machine implementation doesn't >> > really require it. Lots of machines let words be on arbitrary boundaries, >> > but you pay for it in speed, if not now, then later, as CPUs come >> > to have wider data paths. >I was waiting for SOMEBODY to talk performace for once. The last 68000 >compiler I used aligned strings on WORD boundaries. This would cost >one byte per string, half the time. But there was a big speed payoff: >I could do word operations in my strnlen, strncmp, strncpy, and whatever >other string processing functions I happened to write. All this code ported >to a Sun-3 no (apparent) problem, but crashed on a Sun-2, because the SUN >compiler allocated strings in a totally stingy 1-byte alignment. >So now, all this "fast" code actually runs slower than a "dumb" byte-copy >model, because the 68020 faults itself to death reading in 32-bit words >on odd boundaries, and doesn't run at all on a Sun-2 because the 68010 >can read odd words. ^^^ can't, you mean? What's more, what if there lay data in the byte after the end of a string being copied to which was not an integral multiple of your copy-quantity (4 bytes or whatever)? That byte (and possibly more) will get clobbered, unless you stop short of that point and finish the operation using byte operations. For that matter, compare of null-terminated string operations also would need similar handling to make the compares valid (and you then need to know the length of the string anyhow first). (If all strings involved were counted strings a la fortran, that would be a different story.) But for the functions where word-operations would make sense if possible, one could check whether the pointer(s) being passed were appropriately aligned to choose the method (word operations or byte operations). This would be something like func(ptr,count) char *ptr; int count; { if (ptr & 01) { /* byte operations */ ... } else if (ptr & 02) { /* short word operations */ ... } else { /* long word operations */ ... } } I'm sure details would need to be worked out, (and why not use the word operations in the middle of even a misaligned string?) but I wonder if this would prove useful? One could probably also force string alignment by using a union, but it might be unwise to depend on it, as well as it being unwieldy for places where inline data "texttexttext..." strings would normally be used. >Now I *know* you're not *supposed* to do things like that, but often >high performance comes not by reading some 20 year old manual, but >by analyzing the machine at hand, and trying to more than just >accomodate it. Tune those compilers, comment those machine dependencies, >and watch the code fly! >I suppose that this letter was a bit of a flame, sorry. >Y'all are welcome to flame me back, > >Michael >tiemann@mcc.com If you do this kind of thing, for the sake of portability #ifdef it for the targeted machine(s) and supply a safely portable, if slow, alternative for others. Otherwise somebody is gonna come along, port the code to some machine you never dreamed of, and it will crash and burn. -- ------------------------------- Disclaimer: The views contained herein are | dan levy | yvel nad | my own and are not at all those of my em- | an engihacker @ | ployer or the administrator of any computer | at&t computer systems division | upon which I may hack. | skokie, illinois | -------------------------------- Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa, go for it! allegra,ulysses,vax135}!ttrdc!levy
perl@rdin.UUCP (Robert Perlberg) (10/02/86)
If Michael says that his word-oriented string routines are faster, I wouldn't contradict him (after all, he has been running them). However, I don't see why it is necessary to have a special set of functions to handle tha word-aligned case. The standard functions could test the alignment of their arguments and handle them appropriatly. This would have many advantages: The optimization would be realized on programs that weren't written to try to use it. Programs that were written with the optimization in mind would run on all machines even if they didn't or couldn't support the optimization. Compilers that put strings on word boundaries would enhance the frequency with which the optimization is used without the programmer having to communicate anything to the string routines. By the way, Michael, how do your string routines handle the '\0'? Robert Perlberg Resource Dynamics Inc. New York {philabs|delftcc}!rdin!perl
tiemann@mcc-pp.UUCP (Michael Tiemann) (10/05/86)
In article <581@rdin.UUCP>, perl@rdin.UUCP (Robert Perlberg) writes: > If Michael says that his word-oriented string routines are faster, I > wouldn't contradict him (after all, he has been running them). The 68000 had no loop-mode, so they were faster. But with the advent of superior technology, and a corresponding *lack* of followup on my part, the great assumptions I made in 1982 no longer applied. Whether or not they should have applied is a philosophical matter taht has been resolved to my satisfaction (= no). > By the way, Michael, how do your string routines handle the '\0'? I used these routines for symbol-table management, and therefore, never had any of these nasty cases to deal with. I have been duly humbled, and will not feel slighted in the least if we drop this discussion. By the way... It was the numbers that Guy put on the net, that convinced me I was wrong: wrong beacuse even if *one* assumption that I make is allowed for by the compiler (by word aligning strings, for example), there are too many other assumptions about the hardware that must be taken into account. My moral of the story is, let the compiler generate the code it is supposed to, and write your special-purpose functions from there. Asking for one more inch here or there may be smart-today-dumb-tomorrow, as in this example. Michael tiemann@mcc.com
thomas@kuling.UUCP (Thomas H{meenaho) (10/08/86)
In article <7479@sun.uucp> guy@sun.uucp (Guy Harris) writes: >> The last 68000 compiler I used aligned strings on WORD boundaries. >> This would cost one byte per string, half the time. But there was a >> big speed payoff: I could do word operations in my strnlen, strncmp, >> strncpy, and whatever other string processing functions I happened to >> write. > > char string3[128+1]; > > strcpy(string3, string1); > >would, if it copied a word at a time, have to deal both with words of the >form > > 00xx > >(when it recognizes a word of that form, it should stop *before* copying it >and stuff a zero byte at the end of the target) and of the form > > xx00 > >(when it recognizes a word of that form, it should stop *after* copying it). > In sysV for the 68K this technique is used for bcopy inside the kernel. bcopy is easier as one knows how many bytes are to be moved. bzero is even smarter, it clears a set of registers and uses movem to copy them to memory. -- Thomas Hameenaho, Dept. of Computer Science, Uppsala University, Sweden Phone: +46 18 138650 UUCP: thomas@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!thomas)