[comp.arch] Word vs. Byte Orientation; some kernel #s

mash@mips.UUCP (John Mashey) (04/17/87)

[I've been busy, and writing serious answers takes time!}:

In article <16122@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:

(various comments on the word-vs-byte addressing: Brian has commented in
a later posting that he still believed that word accessing was an
appropriate choice for the 29000, although he did note that one of
his circuit designers allowed that byte-alignment nets weren't necessarily
that expensive in time after all.)

>I don't know how to reconcile the facts that some of the best, most
>important computer scientists think word addressing is wrong with the fact
>that it seems so right for us.  All I can say is that Titan and MIPS
>machines have the advantage of being designed as a "closed" system; i.e.
>nearly all (system) details are controllable.

This is, of course, not necessarily irreconcilable, since minds got changed,
at least partially, by actually implementing more complete environments
AFTER the paper was published, and the primary focus of these efforts
has been on using the chips as the primary computing engines for
complete systems, which may well have different requirements from
controllers, depending on the application.

>I/O, especially where older chips (serial ports, etc.) are concerned, is
>a grungy issue.... just trying to point out that I/O is something
>to be dealt with separately from the processor-memory channel design.
>Dual-ported memory is *not* the only way:  how about a DMA chip to do
>all the alignment/bus-isolation?

I/O is indeed grungy.  The problem is that you can't always separate
the I/O design from memory system design [even though they are often
separate buses], in that:
	a) you MUST solve every problem.
	b) If you solve problems twice, you are probably wasting money.

>>	b) Some systems use block-oriented buses, often with write-back
>>	caches.  If the system is doing write-back for you, doing....
>I think you have a good point here.  Caches are nice in that they often
>don't have ECC so byte writing is much more feasible.  However, this is
>only one possible memory system design.  The Am29000 will be interfaced
>to many different kinds of memory systems.  At 30 MHz and beyond (where
>the Am29000 is intended to be), word-addressing is thought by us to be
>beneficial in many of these environments.

I think the real issue here is that one must think very hard about the
memory system surrounding the chip, i.e., what kinds of designs work,
and which don't.  If you are not exceedingly careful, you can easily
get designs that differ in performance by 1.5-2X, using the same
clock rate and instruction set, by varying the external memory
system.  For example, at higher clock rates, off-chip cache control
can become a severe bottleneck.  This may not be relevant to some
controller applications, but it is absolutely critical for systems
to have big, fast caches.
>
>>B. Performance reasons.
>>Domain: running UNIX and UNIX programs well.
>>When I was at CT, I spent a bunch of time tuning 68K C compilers....
>
>Sigh, please don't tell me about how a *vastly* different processor with
>*vastly* different time/instruction tradeoffs behaved.  I believe every
>word with respect to the 68K, and it would be naive of me to say that
>there is *nothing* valuable to be learned from your experience in that
>experiment.  But to say that the results of that experiment have binding
>implications for a processor like the Am29000 (and I am tempted to say
>the MIPS, but I am certainly not qualified to do so) seems just wrong to
>me.
Oops: I didn't make this clear.  Let me say some more: the 68K experience
included examining huge masses of generated .s files.  What was important
was that:
	fetch partial word to register and convert to long (signed or unsigned)
occured:
	a) very frequently.
	b) in many places where they could NOT be optimized away, even if
	one had many more registers.
I.e., the comment was really supposed to be a note on the implications of
C code in a generic sense, as derived from what I saw in the 68K code,
rather than on the 68K code itslef.  It obviously didn't say this well.
(Note that this belief was confirmed by the numbers I posted: even with
great optimizing compilers, you still see lots of lb/lbu/lh/lhu instrs.)
>
>>If simulations are based only on user-level programs, you can get
>>some horrible surpises when you see what UNIX kernels do.  For example,
>>are halfword operations really necessary?
>>ANS: not if you look at their frequency in most UNIX C programs.
>>ANS: if you look at kernel: you bet! many kernel structures are packed
>>for efficiency, some are packed for necessity (you should see the pile
>>of halfword operations in Ethernet code... and you CANNOT sanely get
>>rid of them without rewriting everything).

Here are some static frequencies for UMIPS-BSD 2.0 [4.3BSD + NFS]:
	% total loads		% total stores
18523 lw	80.9%
10770 sw			81%
1708 sh				12.8%
1455 lh		 6.4%
1433 lhu	 6.3%
 969 lbu	 4.2%
 818 sb				 6.1%
 176 lb		 0.7%
(Apparently the NFS code had a lot of lw/sw in it, the lw/sw %
used to be around 75%).  UMIPS-V (SVR3+TCP/IP) is  essentially similar,
although the fullword percentages are 1-2% lower).
REMEMBER THESE ARE STATIC PERCENTAGES, NOT DYNAMIC.
>
(analysis of costs for MIPS to not have the partial word ops)
>
>This seems valid, at first glance, for your situation.  But it is not
>directly applicable to the Am29000 because there is a *cost* associated
>with on-chip byte support.  Thus, you gain some, you lose some.  We
>see about twice as many loads as stores.  Plus, the stack cache decreases
>the load/store percentage overall with respect to a machine (like the
>MIPS) with "only" 32 fixed registers.  We seem to have about half as
>many loads/stores, but it varies (and my compiler ain't the best, e.g.
>no register coloring for memory-resident stuff).  This lower load/store
>percentage might be another reason that word orientation is more appropriate
>for the Am29000 ....

Actually, there is some evidence that the marginal utility of extra
registers in machines like IBM 801, MIPS R2000, etc [i.e., 32-register
ones] drops off in the 24-28 range.  We have a bunch of statistics that
seem to support that, at least somewhat [i.e., we have numbers of
times each register was used, dynamically.]  The only exception is if
you're using ultra-state-of-the-art interprocedural register allocation,
in which case more registers might be useful.
Also, you may find you'll need to allocate more registers for holding
addresses, or else burn the cycles to rematerialize absolute or 
nonzero-offset addresses.  [The numbers I've got on those issues
are not obciosuly applicable elsewise.]
>
(analysis of statistics for various programs, incl. Dhrystone)
>But, just a few lines later you'll point out how having a word-oriented
>processor-memory channel *helps* (artifically since dhrystone is
>artrificial) dhrystone performance.  I'm sorry, but you must to stick
>to one argument. :-)
Oops. The point was: Dhrystone is a program whose word-heavy orientation
is beneficial to word-accessed machines, i.e., that such machines would
do better on Dhrystone than the class of programs that had more typical
partial-word statistics.
>
>
>Just in case you are trying to make a subtle intimation:  WE DID NOT
>"OPTIMIZE" THE AM29000 ARCHITECTURE FOR ANY PARTICULAR PROGRAM.  The
>architecture was pretty much fixed before we had significant simultion
>results (I know, I know; that was the wrong way to do things, but we
>had no choice).  We *did* add the now-infamous compare-bytes instruction
>very late (after we had simulation results).....

Actually, I hadn't been intending to intimate that the 29000 had been
designed by analyzing Dhrystone.  The comment was a more a caution
to anybody starting now to watch out for it, and other small benchmarks.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@amdcad.UUCP (04/22/87)

In article <311@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>>Plus, the stack cache decreases
>>the load/store percentage overall with respect to a machine (like the
>>MIPS) with "only" 32 fixed registers.  We seem to have about half as
>>many loads/stores, but it varies (and my compiler ain't the best, e.g.
>>no register coloring for memory-resident stuff).  This lower load/store
>>percentage might be another reason that word orientation is more appropriate
>>for the Am29000 ....
>
>Actually, there is some evidence that the marginal utility of extra
>registers in machines like IBM 801, MIPS R2000, etc [i.e., 32-register
>ones] drops off in the 24-28 range.

I believe your numbers are right.  Without the stack cache model,
I think the Am29000 would have load store percentages very close to
those for machines like the MIPS, and indeed, more than 32 registers
would not be of much use for one process (but could be used to improve
process context switch times somewhat).  A stack cache performs, if
you will, a "dynamic" interprocedual analysis.  Not really, but some
of the benefits of real interprocedural analysis are gained.

One of the things I know for sure is that some potential Am29000 users
are really happy about the large register file, but not necessarily
because of its ability to implement a stack cache.

    bcase

mash@mips.UUCP (John Mashey) (04/23/87)

In article <16295@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
....notes on where dropoff is on extra register utility....
>One of the things I know for sure is that some potential Am29000 users
>are really happy about the large register file, but not necessarily
>because of its ability to implement a stack cache.

This certainly seems reasonable, in that I can think of some
controller applications where the use of multiple register sets
may be quite useful.  As I've noted earlier, our analyses didn't
seem to indciate that as a good tradeoff for general-purpose systems.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086