[comp.arch] AMD 29000 byte access -- apology and further commentary

gnu@hoptoad.UUCP (04/05/87)

I stand somewhat corrected.  The AMD 29000 does have byte insert and
extract operations.  They work register-to-register, and take one
cycle.  They *do* depend on state hidden in the system's status
register (the low 2 bits of the last address loaded) so an optimizer is
not free to move the instruction past another load, but they're better
than shifts and masks for sure.  At the time of my posting, I didn't
have the real 29000 manual; it has since arrived.  (Thanx to Phil Ngai for
posting how to order it.)

I said:
> In other words, the designers of the 29000 did not think at all
> about typical Unix code...
In article <15337@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes:
> Ok, we have taken  deep breaths and counted to 10, so now we can respond
> without flaming (that comment *is* a personal insult).

Sorry for the 'personalness' of the insult; it was unintended.  My 
complaint was with the architecture, not the designers.  I said it badly,
and I'm sorry.

>                                                         One of the goals
> of the Am29000 is flexibility for the user.  The Am29000 allows you to
> either use the byte manipulation instructions in conjunction with a word-
> oriented memory system *or* implement a byte-oriented memory system and
> use the option bits in the load/store control field to select the size of
> a memory access.  The flexibility is there...

Does this mean that a compiler can't tell whether a load instruction
will return the entire word, or just a byte?  It would seem that you'd
need two different compilers, libraries, etc for a machine that really
implemented a byte-oriented memory for the 29000.  If the compiler
generated a 'load' and then an 'extract byte', it would extract the
wrong byte, since the 'load' would have already extracted the
relevant byte into the low end of the destination register.

As I recall, the 29000 also has a mode in which it will trap to the OS
on any unaligned access to memory; this would seem to make both of
the above methods fail.  (Normally it ignores the low 2 bits of the
address.)  Wouldn't this require a third compiler?  And how would you
ever fetch unaligned bytes with this mode turned on?  You could mask
the address before using it, do the word load, then somehow move the
2 low bits into the system status register (may take a few instructions),
then do the extract byte, but it all seems very slow and cumbersome
as a way to get one character from RAM.

> > First you see if the operands overlap, then...are they aligned, then...
> > are they wider than a word, then...wider than two words?, then...

Tim objected to this characterization of how to do strcpy on a word
oriented machine.  However, it still looks to me like a good
description of the strcpy code they posted.

I do like the funnel shifter.

> To see the 8088 and the Am29000 compared, as if they were comparable, is
> worse than our worst nightmares.

AMD sells them both, does it not?
-- 
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu

bcase@amdcad.UUCP (04/06/87)

In article <1960@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>I stand somewhat corrected.  The AMD 29000 does have byte insert and
>extract operations.  They work register-to-register, and take one
>cycle.  They *do* depend on state hidden in the system's status
>register (the low 2 bits of the last address loaded) so an optimizer is
>not free to move the instruction past another load, but they're better
>than shifts and masks for sure.

Yeah, the residual control has its restrictions, but an optimizer *can*
move instructions between the load and the byte-insert/extract.

>>                                                         One of the goals
>> of the Am29000 is flexibility for the user.  The Am29000 allows you to
>> either use the byte manipulation instructions in conjunction with a word-
>> oriented memory system *or* implement a byte-oriented memory system and
>> use the option bits in the load/store control field to select the size of
>> a memory access.  The flexibility is there...
>
>Does this mean that a compiler can't tell whether a load instruction
>will return the entire word, or just a byte?  It would seem that you'd
>need two different compilers, libraries, etc for a machine that really
>implemented a byte-oriented memory for the 29000.  If the compiler
>generated a 'load' and then an 'extract byte', it would extract the
>wrong byte, since the 'load' would have already extracted the
>relevant byte into the low end of the destination register.

You are sorta right here, but rather than two compilers, one compiler
with a switch should be sufficient.  There are library implications as
well.  However, one system has one kind of memory, so a set of tools
for that system will do the right things.  The main reason we included
the byte-oriented memory support is for the obscure controller application
where the cost for byte-oriented memory is low (maybe this is when the
total amount of memory is small?) but the cost for software byte-support
is high.

>As I recall, the 29000 also has a mode in which it will trap to the OS
>on any unaligned access to memory; this would seem to make both of
>the above methods fail.  (Normally it ignores the low 2 bits of the
>address.)  Wouldn't this require a third compiler?  And how would you
>ever fetch unaligned bytes with this mode turned on?  You could mask
>the address before using it, do the word load, then somehow move the
>2 low bits into the system status register (may take a few instructions),
>then do the extract byte, but it all seems very slow and cumbersome
>as a way to get one character from RAM.

Well, the unaligned-access trap facility exists in order to provide some
level of support for "old" databases.  That is, lots of machines allow
access to any size of data on any boundary (unaligned 32-bit words,
unaligned 16-bit halfwords).  There must (we guess) be many databases
out there that were created under the assumption that such access will
always be possible.  A program accessing these data bases can, on
the Am29000, be run with this trap enabled.  Whenever the lower two
address bits are not both zero (and, therefore, the access might be
unaligned), a trap will be taken and the situation can be correctly
dealt with.  Yeah, its grungy, but it can work to provide compatibility
and costs very little in the implementation.  Note that the compiler must
assume the second method of byte accesses (option bits are used to select
the access size) making this trap a supplementary facility (not a third
method).  If the memory provides support for variable size accesses
*and* unaligned variable size accesses, then the trap need not be
turned on.

>> > First you see if the operands overlap, then...are they aligned, then...
>> > are they wider than a word, then...wider than two words?, then...
>
>Tim objected to this characterization of how to do strcpy on a word
>oriented machine.  However, it still looks to me like a good
>description of the strcpy code they posted.

Well, in the code we posted (and that I "wrote" with the help of my C
compiler), all the alignment and overlap checking is done before the
movment (or compare) method is chosen.  Yes, there is some overhead, but
the effect has been demonstrated to be positive (although for *very*
short strings it might not be positive.  Sigh, there are so many variables
in computer architecure).

>I do like the funnel shifter.

>> To see the 8088 and the Am29000 compared, as if they were comparable, is
>> worse than our worst nightmares.
>
>AMD sells them both, does it not?

Sigh, you got me there (but *I* had nothing to do with the decision to
carry the 8088 :-), but seeing the 8088 and the Am29000 compared is
*still* worse that our worst nightmares.

    bcase