[comp.arch] All those bits

usenet@nlm.nih.gov (usenet news poster) (08/12/90)

Let's assume that John Mashey is correct (he usually is :-) and that we
will soon see full 64-bit microprocessors.  You need 64-bits for alot
of floating point, something more than 32-bits would be nice for
adressing, sticking with factors of 2 in word size has advantages, and
gates keep getting cheaper.  Would it make sense to design these
machines with small word SIMD parallelism?

Consider a real world task like comparing two C character strings.  You
can get them from memory faster with wide datapaths, but the inner loop:

	get a byte from string1
	test for end of string1
	get a byte from string2
	test for end of string2
	byte compare
	loop

is going to execute at the same speed (clock cycles) on any machine
with 8-bit or larger data registers.  Now, suppose you had 64-bit
registers and a few parallel instructions:

	get 8 bytes from string1
	test if any byte is a terminator
	get 8 bytes from string2
	test if any byte is a terminator
	compare 8 bytes
	loop

Obviously the logic for handling a terminator is going to be more
complex, and an efficient mechanism for fetching non-aligned data will
be needed, but the bottom line is that for strings of any length, the 
parallel loop is going to process data alot faster.

My impression is that the hardware overhead for implementing a set
of parallel instructions (8xbyte, 4xshort, 2xlong) would not be that
great because in most cases (add, mul, compare) the full 64-bit operation 
calculates them as intermediate results anyway.  So why not bring them out,
add a parallel set of condition code registers, and build a micro SIMD?

Programming byte or short word operations for a machine like this would
be similar to programming floating point on a vector machine like the
Cray.  A pain in the neck by hand, but the compilers seem to be able to
do it moderately well.

David States            A bit may not be a terrible thing to waste, but 
			throwing them away 56 at a time is ridiculous.

colin@array.UUCP (Colin Plumb) (08/12/90)

Look at the Am29000 (and 29050, soon).  They have a CPBYTES instruction
which does what you want, 32 bits at a time.  They return
v1.b0 = v2.b0 || v1.b1 == v2.b1 || v1.b2 == v2.b2 || v1.b3 == v2.b3
so you can write strcmp as:

loop:
	word1 = *pointer1++;
	if (word1 != word2)
		goto notequal;
	if (cpbytes(word1,0) == FALSE)
		goto loop;

	<check trailing bytes>

Perhaps someone from AMD could comment on the difference it's made
to their string instructions.
-- 
	-Colin