kdb@chinet.chi.il.us (Karl Botts) (09/26/89)
/* The style of this file has been inspired by several references I have encountered in recent months to what has been called, I suspect unfortunately, "literate programming". One article was in JACM by D. E. Knuth. The general idea is to produce files which are both an English language document and compilable source files. Mr. Knuth is working on some specialized tools with an eye toward research papers and educational expositions. I like the general idea a great deal and thought I would try it out on the less formal vehicle of a UseNet article, without using special tools. Karl D. Botts 5480 Hyde Park Blvd. Chicago IL, 60615 {gargoyle|mcdchg|att}!chinet!kdb */ /* The result of compiling this file is an ANSI C memmove(), for systems that don't have memmove() and have a memcpy() that doesn't handle overlaps, i.e., SysV.2 ported by Convergent Technologies, as found on my AT&T 3b1. It is specially optimized for GnuC and an mc68k, as described in detail in the internal comments. It compiles with PCC but the result is not as fast. The internal comments also describe some peculiarities of the Gnu C optimizer discovered while tinkering with this code. All cases are handled, including any of the "null copy operations" -- dest == src || n == 0. Note that in the first case the block will indeed get copied over itself, but this can do nothing except burn cpu unless, I suppose, something in the block is volatile. Note also that n is expected to be unsigned; if size_t is signed (as it is by default on my SysV) and you pass a negative number to this routine it will interpret the number as unsigned and proceed. This will cause a core dump but I believe it is correct. I wound up doing a bunch of timing tests on this code; some results and some remarks are appended to this file. /* /* By the way, I have seen usenet code -- pcomm for one -- which does #define memmove memcpy when configured for the 3b1 and comments that memcpy() handles overlaps and so can be used as memmove(). This is variant (as is stated by memory(3C)) and is not true on my system (as determined the hard way), although pcomm seems to work anyhow. I don't use it regularly. */ #include <stdio.h> /* The goal here is to get a typedef name "size_t" defined. */ #if __STDC__ # include <stddef.h> #else #if defined(SYSV) && defined(PCC) # include <unistd.h>/* defines size_t as int; see below */ #endif #endif /* The goal here is to find a declaration or prototype for memmove(), in the hope of generating an error message if it doesn't match the definition in this file. The main thing to worry about is whether the third arg is signed or unsigned. */ #ifdef KDB # include <kdb/stdlib.h> #else #if __STDC__ # include <stdlib.h> # include <memory.h> #endif #endif #if __STDC__ || defined(PROTOS) void *memmove(void *d, void *s, register size_t n) #else void *memmove(d, s, n) void *d; void *s; register size_t n; #endif { register long *dl = (long *)d, *sl = (long *)s; int nstash; /* The rule for handling overlap is: Always start copying from the end of the source toward which the destination is offset. If the offset is zero it doesn't matter.*/ if ( dl > sl ) { /* This ain't ANSI C (a cast does not form an lvalue) but both SysV.2 PCC and GnuC accept it, and it's damn convenient. Also optimizes very nicely.*/ ((char *)dl) += n; ((char *)sl) += n; } if ( (int)dl & 1 ) { if ( (int)sl & 1 ) goto byte_aligned; /* else drop thru */ } else { if ( ! ((int)sl & 1) ) goto word_aligned; /* else drop thru */ } mis_aligned: /* I don't think I can do much better than the simple bytewise copy. As written here, Gnu C optimizes the loop into the two instruction sequence: L%12: mov.b -(%a2),-(%a3) dbra %d0,L%12 which is the special and only copying loop which the 68010 can keep entirely in the prefetch queue, thus freeing the bus for full-time data and nearly doubling the speed of otherwise apparently equivalent loops. Therefore I can't think of any fancy juggling to handle the odd-alignment that seems likely to turn a profit. The loop here (or actually, its longword analog, below), should still be the fastest way to copy a block of memory, but it won't be as much faster on a 68020 as on a 68010 because equivalent loops with a couple of extra instructions will still get kept in the prefetch queue. The same should be true on a 68000 because none of the loops get kept in the prefetch queue. I don't have a 68020 or a 68000 machine handy so this paragraph is sheer speculation. */ if ( dl > sl ) { if ( n-- ) do *--((char *)dl) = *--((char *)sl); while ( n-- ); } else { /* Here's the strange part. If the following loop is written with the casts-to-lvalue like its complement just above, instead of the local char * vars, Gnu C doesn't optimize it to the complement of the seqeunce given above, which would be: L%12: mov.b (%a2)+,(%a3)+ dbra %d0,L%12 Instead it produces: L%16: mov.l %a3,%a0 addq.w &1,%a3 mov.l %a2,%a1 addq.w &1,%a2 mov.b (%a1),(%a0) dbra %d0,L%16 I noticed this because it made enough difference to badly skew some timing tests I was running on a parsing algorithm. Anyhow, adding the char *temps as here coaxes Gnu C into performing the desired optimization. It is worth noting that SysV.2 PCC, which I also tried, did not make this optimization despite my best attempts to devise suggestively convoluted loops. The Gnu code is almost twice as fast as the best PCC could do (on a 68010 -- see above). The upshot of all this is that the inner loop of the resulting code (together with the longword loop below) is identical to the main loops of the standard library memcpy() routine on my 3b1, which I assume is coded in assembler (but doesn't handle overlap, which is why I had to write this routine in the first place.) This means that memory copying functions, one of the last remaining rationalizations for coding library routines in assembler, will no longer serve as an excuse. It also means that inline C code to copy small blocks can be significantly faster than calling memcpy() (it seldom was before, in practice.) But see the caveats below.*/ register char *dc = (char *)d, *sc = (char *)s;/* casts needed by PCC */ if ( n-- ) do *dc++ = *sc++; while ( n-- ); } return(d); byte_aligned: /* both byte aligned -- copy the odd byte */ if ( n-- ) if ( dl > sl ) *--((char *)dl) = *--((char *)sl); else *((char *)dl)++ = *((char *)sl)++; /* drop thru... */ else return(d); word_aligned: nstash = n; n /= sizeof(long); if ( dl > sl ) { /* GnuC performs an optimization on this loop equivalent to the one described above, but copying longwords. It really cooks along. However, the loop must be written exactly as here; any deviation, even to the extent of not performing the initial decrement of <n> and using predecrement in the loop control expression, results in the desired optimization not being done. I tried nearly a dozen variations, including several array notation equivalents, as well as using an end pointer instead of a loop counter; none resulted in the desired optimization. This is a shame, because what's below is a non-intuitive way to write this loop -- I certainly wouldn't write it this way off the top of my head, and I doubt if it occurs in this precise form in much existing code. It seems to me that the compiler should be able to figure out equivalent loops and perform the optimization. The speed increase is, as noted above, close to doubling in some cases, and it is hard to think of a more useful construct to optimize than block-copy loops.*/ if ( n-- ) do /* main loop -- copy longwords */ *--dl = *--sl; while ( n-- ); n = nstash; if ( n & 2 ) *--((short *)dl) = *--((short *)sl); if ( n & 1 ) *--((char *)dl) = *--((char *)sl); } else { if ( n-- ) do *dl++ = *sl++; while ( n-- ); n = nstash; if ( n & 2 ) *((short *)dl)++ = *((short *)sl)++; if ( n & 1 ) *((char *)dl)++ = *((char *)sl)++; } return(d); } #ifdef TESTING /* These are some functions and code fragments used in the timeing tests, the results of which are appended. */ /* maxcpy() receives the GnuC/68010 optimization but doesn't handle overlap, or any alignment except both destination and source word-aligned. It is about the fastest copy function you can write in C for the 68010, and I don't think you could do much better in assembler. */ char *maxcpy(dl, sl, n) register long *dl, *sl; register int n; { n /= sizeof(long); if ( n-- ) do *dl++ = *sl++; while ( n-- ); } /* null() is the null function. Incidentally, GnuC does significantly better than PCC at optimizing function calls. It is smarter about picking up arguments off the stack into registers and automatic vars, and it pushes the args for and calls several functions before it cleans up the stack. That is, it does: push args call foo() push args call bar() pop args for both foo() and bar() GnuC sometimes calls four or more functions before cleaning up. All this adds up to a null function about 3/2 as fast as PCC produces. */ char *null(d, s, n) register char *d, *s; register int n; { return(d); } /* "inline longwords" is the same code as maxcpy() but in-line; as you would expect, the the times for null() and inline longwords add up pretty closely to the time for maxcpy() in all tests. "inline bytes" is the same, but it copies bytes instead of longwords. Since maxcpy() and inline longwords would cause a core dump in the non-word-aligned tests, the results are not here. */ #endif /* Here are the results, in seconds of copying five sizes of blocks by several methods with each of the basic alignment variations. d is the destination offset from longword alignment (here, always zero), s is the source offset, n is the length of the block in bytes, and i is the number of iterations tested. The total bytes copied , n * i, is the same for each test. This test sequence is mainly to test alignment differences. Note that for small blocks the alignment, which only effects the speed of the innermost loop, is pretty much irrelevent; the function call and setup overhead dominates completely. For big blocks, mis-alignment slows both memcpy() and memmove() by a factor of three -- mis-alignment forces byte by byte copying. d: 0 s: 0 n: 4 i: 262144 null:4.3 d: 0 s: 0 n: 4 i: 262144 memcpy:9.5 d: 0 s: 0 n: 4 i: 262144 maxcpy:8.0 d: 0 s: 0 n: 4 i: 262144 memmove:14.9 d: 0 s: 0 n: 4 i: 262144 inline bytes:4.1 d: 0 s: 0 n: 4 i: 262144 inline longwords:2.0 d: 0 s: 1 n: 4 i: 262144 null:4.2 d: 0 s: 1 n: 4 i: 262144 memcpy:9.5 d: 0 s: 1 n: 4 i: 262144 memmove:14.9 d: 0 s: 1 n: 4 i: 262144 inline bytes:4.1 d: 0 s: 2 n: 4 i: 262144 null:4.2 d: 0 s: 2 n: 4 i: 262144 memcpy:9.2 d: 0 s: 2 n: 4 i: 262144 maxcpy:7.9 d: 0 s: 2 n: 4 i: 262144 memmove:14.8 d: 0 s: 2 n: 4 i: 262144 inline bytes:4.1 d: 0 s: 2 n: 4 i: 262144 inline longwords:2.0 d: 0 s: 3 n: 4 i: 262144 null:4.2 d: 0 s: 3 n: 4 i: 262144 memcpy:9.5 d: 0 s: 3 n: 4 i: 262144 memmove:14.7 d: 0 s: 3 n: 4 i: 262144 inline bytes:4.2 d: 0 s: 0 n: 16 i: 65536 null:1.1 d: 0 s: 0 n: 16 i: 65536 memcpy:2.8 d: 0 s: 0 n: 16 i: 65536 maxcpy:2.4 d: 0 s: 0 n: 16 i: 65536 memmove:4.3 d: 0 s: 0 n: 16 i: 65536 inline bytes:2.2 d: 0 s: 0 n: 16 i: 65536 inline longwords:1.4 d: 0 s: 1 n: 16 i: 65536 null:1.0 d: 0 s: 1 n: 16 i: 65536 memcpy:3.6 d: 0 s: 1 n: 16 i: 65536 memmove:5.2 d: 0 s: 1 n: 16 i: 65536 inline bytes:2.2 d: 0 s: 2 n: 16 i: 65536 null:1.1 d: 0 s: 2 n: 16 i: 65536 memcpy:2.7 d: 0 s: 2 n: 16 i: 65536 maxcpy:2.4 d: 0 s: 2 n: 16 i: 65536 memmove:4.2 d: 0 s: 2 n: 16 i: 65536 inline bytes:2.2 d: 0 s: 2 n: 16 i: 65536 inline longwords:1.3 d: 0 s: 3 n: 16 i: 65536 null:1.1 d: 0 s: 3 n: 16 i: 65536 memcpy:3.6 d: 0 s: 3 n: 16 i: 65536 memmove:5.2 d: 0 s: 3 n: 16 i: 65536 inline bytes:2.2 d: 0 s: 0 n: 64 i: 16384 null:0.3 d: 0 s: 0 n: 64 i: 16384 memcpy:1.1 d: 0 s: 0 n: 64 i: 16384 maxcpy:1.1 d: 0 s: 0 n: 64 i: 16384 memmove:1.6 d: 0 s: 0 n: 64 i: 16384 inline bytes:1.7 d: 0 s: 0 n: 64 i: 16384 inline longwords:0.8 d: 0 s: 1 n: 64 i: 16384 null:0.3 d: 0 s: 1 n: 64 i: 16384 memcpy:2.1 d: 0 s: 1 n: 64 i: 16384 memmove:2.8 d: 0 s: 1 n: 64 i: 16384 inline bytes:1.7 d: 0 s: 2 n: 64 i: 16384 null:0.3 d: 0 s: 2 n: 64 i: 16384 memcpy:1.1 d: 0 s: 2 n: 64 i: 16384 maxcpy:1.1 d: 0 s: 2 n: 64 i: 16384 memmove:1.6 d: 0 s: 2 n: 64 i: 16384 inline bytes:1.7 d: 0 s: 2 n: 64 i: 16384 inline longwords:0.8 d: 0 s: 3 n: 64 i: 16384 null:0.3 d: 0 s: 3 n: 64 i: 16384 memcpy:2.1 d: 0 s: 3 n: 64 i: 16384 memmove:2.8 d: 0 s: 3 n: 64 i: 16384 inline bytes:1.7 d: 0 s: 0 n: 512 i: 2048 null:0.0 d: 0 s: 0 n: 512 i: 2048 memcpy:0.7 d: 0 s: 0 n: 512 i: 2048 maxcpy:0.7 d: 0 s: 0 n: 512 i: 2048 memmove:0.8 d: 0 s: 0 n: 512 i: 2048 inline bytes:1.7 d: 0 s: 0 n: 512 i: 2048 inline longwords:0.6 d: 0 s: 1 n: 512 i: 2048 null:0.0 d: 0 s: 1 n: 512 i: 2048 memcpy:1.7 d: 0 s: 1 n: 512 i: 2048 memmove:2.2 d: 0 s: 1 n: 512 i: 2048 inline bytes:1.6 d: 0 s: 2 n: 512 i: 2048 null:0.0 d: 0 s: 2 n: 512 i: 2048 memcpy:0.7 d: 0 s: 2 n: 512 i: 2048 maxcpy:0.7 d: 0 s: 2 n: 512 i: 2048 memmove:0.9 d: 0 s: 2 n: 512 i: 2048 inline bytes:1.7 d: 0 s: 2 n: 512 i: 2048 inline longwords:0.6 d: 0 s: 3 n: 512 i: 2048 null:0.0 d: 0 s: 3 n: 512 i: 2048 memcpy:1.6 d: 0 s: 3 n: 512 i: 2048 memmove:2.1 d: 0 s: 3 n: 512 i: 2048 inline bytes:1.6 d: 0 s: 0 n: 4096 i: 256 null:0.0 d: 0 s: 0 n: 4096 i: 256 memcpy:0.6 d: 0 s: 0 n: 4096 i: 256 maxcpy:0.6 d: 0 s: 0 n: 4096 i: 256 memmove:0.7 d: 0 s: 0 n: 4096 i: 256 inline bytes:1.6 d: 0 s: 0 n: 4096 i: 256 inline longwords:0.6 d: 0 s: 1 n: 4096 i: 256 null:0.0 d: 0 s: 1 n: 4096 i: 256 memcpy:1.6 d: 0 s: 1 n: 4096 i: 256 memmove:2.1 d: 0 s: 1 n: 4096 i: 256 inline bytes:1.6 d: 0 s: 2 n: 4096 i: 256 null:0.0 d: 0 s: 2 n: 4096 i: 256 memcpy:0.6 d: 0 s: 2 n: 4096 i: 256 maxcpy:0.6 d: 0 s: 2 n: 4096 i: 256 memmove:0.7 d: 0 s: 2 n: 4096 i: 256 inline bytes:1.6 d: 0 s: 2 n: 4096 i: 256 inline longwords:0.7 d: 0 s: 3 n: 4096 i: 256 null:0.0 d: 0 s: 3 n: 4096 i: 256 memcpy:1.6 d: 0 s: 3 n: 4096 i: 256 memmove:2.2 d: 0 s: 3 n: 4096 i: 256 inline bytes:1.6 This sequence of tests does all longword copying, with more iterations and more block sizes for better resolution, to test raw speed. d: 0 s: 0 n: 4 i: 1048576 null:17.0 d: 0 s: 0 n: 4 i: 1048576 memcpy:36.8 d: 0 s: 0 n: 4 i: 1048576 maxcpy:31.6 d: 0 s: 0 n: 4 i: 1048576 memmove:60.0 d: 0 s: 0 n: 4 i: 1048576 inline bytes:16.5 d: 0 s: 0 n: 4 i: 1048576 inline longwords:7.9 d: 0 s: 0 n: 8 i: 524288 null:8.5 d: 0 s: 0 n: 8 i: 524288 memcpy:19.5 d: 0 s: 0 n: 8 i: 524288 maxcpy:16.9 d: 0 s: 0 n: 8 i: 524288 memmove:31.1 d: 0 s: 0 n: 8 i: 524288 inline bytes:11.5 d: 0 s: 0 n: 8 i: 524288 inline longwords:8.1 d: 0 s: 0 n: 16 i: 262144 null:4.2 d: 0 s: 0 n: 16 i: 262144 memcpy:10.9 d: 0 s: 0 n: 16 i: 262144 maxcpy:9.6 d: 0 s: 0 n: 16 i: 262144 memmove:17.0 d: 0 s: 0 n: 16 i: 262144 inline bytes:8.7 d: 0 s: 0 n: 16 i: 262144 inline longwords:5.3 d: 0 s: 0 n: 32 i: 131072 null:2.1 d: 0 s: 0 n: 32 i: 131072 memcpy:6.7 d: 0 s: 0 n: 32 i: 131072 maxcpy:5.9 d: 0 s: 0 n: 32 i: 131072 memmove:9.9 d: 0 s: 0 n: 32 i: 131072 inline bytes:7.6 d: 0 s: 0 n: 32 i: 131072 inline longwords:3.9 d: 0 s: 0 n: 64 i: 65536 null:1.0 d: 0 s: 0 n: 64 i: 65536 memcpy:4.5 d: 0 s: 0 n: 64 i: 65536 maxcpy:4.2 d: 0 s: 0 n: 64 i: 65536 memmove:6.4 d: 0 s: 0 n: 64 i: 65536 inline bytes:6.9 d: 0 s: 0 n: 64 i: 65536 inline longwords:3.2 d: 0 s: 0 n: 128 i: 32768 null:0.5 d: 0 s: 0 n: 128 i: 32768 memcpy:3.5 d: 0 s: 0 n: 128 i: 32768 maxcpy:3.3 d: 0 s: 0 n: 128 i: 32768 memmove:4.6 d: 0 s: 0 n: 128 i: 32768 inline bytes:6.6 d: 0 s: 0 n: 128 i: 32768 inline longwords:2.8 d: 0 s: 0 n: 512 i: 8192 null:0.1 d: 0 s: 0 n: 512 i: 8192 memcpy:2.7 d: 0 s: 0 n: 512 i: 8192 maxcpy:2.6 d: 0 s: 0 n: 512 i: 8192 memmove:3.3 d: 0 s: 0 n: 512 i: 8192 inline bytes:6.4 d: 0 s: 0 n: 512 i: 8192 inline longwords:2.5 */