[comp.arch] String Processing Instruction -- AMD 29000 has *slow* byte access

gnu@hoptoad.UUCP (03/30/87)

In article <15313@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes:
> In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes:
> >	(a) significantly slow byte addressing to begin with (ala cray)?
> >if (a), then improving memory byte access speed in the architecture is a
> >more general solution with more payoff overall than the compare gate hack.
> >what is the risc chip cost for byte vs. word addressibility, anyway?

James Woods hit the nail on the head with this question.  From the
preliminary 29000 description, there are *no* byte instructions.
This means that *cp turns into about 4 instructions:  load a word, 
shift, mask, etc.  Worse, *cp= turns into many more, since you have
to load the target word, shift a mask to the byte of interest, mask
out the old value, shift the new value, "or" it in, and store the word
back.

In other words, the designers of the 29000 did not think at all
about typical Unix code like:

	register char *p, *q;
	while (*p++ = *q++) ;

which takes on the order of 10 instructions PER BYTE.  (I'd be
interested in seeing the generated code for this program.)  The 680x0
does it in 2 instructions, and even the dumb PCC compiler generates them.

I helped to write some of the code to do rasterops on the Sun, and
I remember what kind of code it takes to do bit-aligned copying
on an instruction set that doesn't support bit fields.
Character strings are to the 29000 what bit fields are to the 68010.
First you see if the operands overlap, then...are they aligned, then...
are they wider than a word, then...wider than two words?, then...
You can make it fast, or you can make it simple...or maybe neither.

I think the lack of byte stores, in particular, and byte addressing, in
general, is the worst bug in the 29000.  Then again, it's better than an 8088...
-- 
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu

tim@amdcad.UUCP (03/31/87)

In article <1945@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> James Woods hit the nail on the head with this question.  From the
> preliminary 29000 description, there are *no* byte instructions.
> This means that *cp turns into about 4 instructions:  load a word, 
> shift, mask, etc.  Worse, *cp= turns into many more, since you have
> to load the target word, shift a mask to the byte of interest, mask
> out the old value, shift the new value, "or" it in, and store the word
> back.

Wait a minute, we *do* have byte-insert and byte-extract instructions.
*cp as an rvalue is two instructions (load/byte-extract), *cp as a lvalue
is three (load/byte-insert/store) if no optimization can be performed
(such as a re-use of the loaded value).  Sure, the code size is larger,
but the number of cycles is the same as on other processors (if not
better):  remember byte-alignment networks slow down memory systems.

> In other words, the designers of the 29000 did not think at all
> about typical Unix code like:
> 
> 	register char *p, *q;
> 	while (*p++ = *q++) ;
> 
> which takes on the order of 10 instructions PER BYTE.  (I'd be
> interested in seeing the generated code for this program.)  The 680x0
> does it in 2 instructions, and even the dumb PCC compiler generates them.

Ok, we have taken  deep breaths and counted to 10, so now we can respond
without flaming (that comment *is* a personal insult).  One of the goals
of the Am29000 is flexibility for the user.  The Am29000 allows you to
either use the byte manipulation instructions in conjunction with a word-
oriented memory system *or* implement a byte-oriented memory system and
use the option bits in the load/store control field to select the size of
a memory access.  The flexibility is there, but we feel that a word-
oriented memory has the best overall performance.  Therefore, our simulator
uses this model.  While the above code does appear (and it does compile
to about 10 instructions using our internal compiler) in programs, it is
not frequent enough to warrant optimizing the system for it.  Besides,
according to your own earlier comments,"fix the application, not the
system," we should replace the above code with a call to strcpy ().

> I helped to write some of the code to do rasterops on the Sun, and
> I remember what kind of code it takes to do bit-aligned copying
> on an instruction set that doesn't support bit fields.
> Character strings are to the 29000 what bit fields are to the 68010.
> First you see if the operands overlap, then...are they aligned, then...
> are they wider than a word, then...wider than two words?, then...
> You can make it fast, or you can make it simple...or maybe neither.

You are making too many assumptions!  (We'll be happy to send you a copy of
the manual....)  You say that speed and simplicity are mutually-exclusive,
but we think they are compatible:  the Am29000 has both speed and simplicity.

Re:  Bit fields.  The Am29000 does not have explicit bit field instructions;
instead, we have a 32-bit from 64-bit funnel-shifter/extractor (it slices,
dices, and makes hundreds of julien fries in minutes :-).  This is the
general primitive needed; graphics guys tell us they like it.  It even
comes in handy for strcpy () and strcmp ().

> I think the lack of byte stores, in particular, and byte addressing, in
> general, is the worst bug in the 29000.  Then again, it's better than an 8088

Well, we have established that there is no lack of byte stores or addressing
in the Am29000.

To see the 8088 and the Am29000 compared, as if they were comparable, is
worse than our worst nightmares.

Tim Olson
Brian Case
Smeeta Gupta