usenet@nlm.nih.gov (usenet news poster) (08/12/90)
Let's assume that John Mashey is correct (he usually is :-) and that we will soon see full 64-bit microprocessors. You need 64-bits for alot of floating point, something more than 32-bits would be nice for adressing, sticking with factors of 2 in word size has advantages, and gates keep getting cheaper. Would it make sense to design these machines with small word SIMD parallelism? Consider a real world task like comparing two C character strings. You can get them from memory faster with wide datapaths, but the inner loop: get a byte from string1 test for end of string1 get a byte from string2 test for end of string2 byte compare loop is going to execute at the same speed (clock cycles) on any machine with 8-bit or larger data registers. Now, suppose you had 64-bit registers and a few parallel instructions: get 8 bytes from string1 test if any byte is a terminator get 8 bytes from string2 test if any byte is a terminator compare 8 bytes loop Obviously the logic for handling a terminator is going to be more complex, and an efficient mechanism for fetching non-aligned data will be needed, but the bottom line is that for strings of any length, the parallel loop is going to process data alot faster. My impression is that the hardware overhead for implementing a set of parallel instructions (8xbyte, 4xshort, 2xlong) would not be that great because in most cases (add, mul, compare) the full 64-bit operation calculates them as intermediate results anyway. So why not bring them out, add a parallel set of condition code registers, and build a micro SIMD? Programming byte or short word operations for a machine like this would be similar to programming floating point on a vector machine like the Cray. A pain in the neck by hand, but the compilers seem to be able to do it moderately well. David States A bit may not be a terrible thing to waste, but throwing them away 56 at a time is ridiculous.
colin@array.UUCP (Colin Plumb) (08/12/90)
Look at the Am29000 (and 29050, soon). They have a CPBYTES instruction which does what you want, 32 bits at a time. They return v1.b0 = v2.b0 || v1.b1 == v2.b1 || v1.b2 == v2.b2 || v1.b3 == v2.b3 so you can write strcmp as: loop: word1 = *pointer1++; if (word1 != word2) goto notequal; if (cpbytes(word1,0) == FALSE) goto loop; <check trailing bytes> Perhaps someone from AMD could comment on the difference it's made to their string instructions. -- -Colin