[comp.arch] Word vs. Byte Orientation

bcase@amdcad.UUCP (04/13/87)

In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:
>Well, I've been staying away from this, but since you asked, OK.
>But Brian, I'm sad to say you may not be happy when you're finished
>reading this.

Well, I guess am not overjoyed or anything, but if you meant "I think you
will regret designing the Am29000 with the emphasis on word-addressing"
you are wrong.  If you meant "You'll be sorry you asked", never!  Let's
have some lively discussions!  I'll try make my points with as few words
as possible.

>First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall
>analysis of the issues, so I won't repeat that, except:
>"Re bus width, byte extraction, unaligned operands, and memory speed:
>  1) Byte extraction from words should be free in time; it'll cost a few gates.
>	Basically this requires one or more cross-couplings in the memory path.
>
>Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time
>penalty for having load-byte, load-byte unsigned, load-half (signed and
>unsigned), and even load-partial-word (left and right) for dealing with
>unaligned operands).  It take some silicon, but it didn't add to the
>critical path. [Don't ask me how this is done, but I assure you it is
>possible.]

Sigh, byte extraction *does* take only a few gates.  Silicon area was
never the issue.  Depending, I guess greatly, on implementation details,
it is *not* free in time.  I'll try to get our circuit designer(s) to post
a comment on this one.  All I can say is that we designed the Am29000 to
*start* at 25 Mhz and go from there.  There is a critical path in the chip
from the pins to the execute unit.  Yes, I know this circuitry will scale
as well as any on the chip, but the set-up and hold time effects may not.

>Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software
>Papers for Increased Performance" by Hennessy, et al, which argued
>fairly heavily for word-addressed memory with byte extract/insert.
>Now, there are the following facts:
>	I.e., they have changed their minds, at least somewhat.

I don't know how to reconcile the facts that some of the best, most
important computer scientists think word addressing is wrong with the fact
that it seems so right for us.  All I can say is that Titan and MIPS
machines have the advantage of being designed as a "closed" system; i.e.
nearly all (system) details are controllable.

>The remainder of this will deal with some structural reasons why word
>addressing with extract/insert is painful in certain environments,
>followed by a bunch of statistical analyses that describe the performance
>loss MIPS machines would suffer if we did it that way.
>
>A. Structural reasons: (this is mostly systems, maybe not some controllers)
>1) Memory system design:
>	a) Memory system designs with dual-porting of memory, including
>	an I/O bus, usually have to respond to partial-word operations
>	to keep I/O controllers happy.  Having done that, it is perfectly
>	feasible to deal with byte writes, either cheaply, with parity,
>	or somewhat more expensively with ECC.

I/O, especially where older chips (serial ports, etc.) are concerned, is
a grungy issue.  But I would most *certainly* not slow the memory system
with respect to the processor just for the sake of very infrequent
interchanges with I/O chips.  Notice:  an on-chip alignment network for
byte extraction/insertion (the only alignment network implementation that
makes sense in the majority of instances and the one that we are debating
here, I am assuming) does *not* solve this problem (it cannot do the
alignment needed by the I/O devices when they deal directly with memory),
the bulk of I/O to memory transactions are block transfers from disk and 
tape devices.  Why can't these be word-oriented?  For the cheapest systems,
it might be nice to hang the old serial port chip right on the processor
bus, but I don't think you want to buy an Am29000 (or a MIPS chip) just
so you can slow the thing down with stupid system design.  Don't get me
wrong, I am not flaming:  just trying to point out that I/O is something
to be dealt with separately from the processor-memory channel design.
Dual-ported memory is *not* the only way:  how about a DMA chip to do
all the alignment/bus-isolation?

>	b) Some systems use block-oriented buses, often with write-back
>	caches.  If the system is doing write-back for you, doing
>	load-word [causing WB-cache to fetch the cache line, if needed]
>	insert byte
>	store-word
>	VERSUS JUST:
>	store-byte [causing cache line to be fetched, if necessary]
>	sure looks like there is at least a 2-cycle hit, maybe a 3-cycle
>	hit, if you don't have 1-cycle cache accesses.

I think you have a good point here.  Caches are nice in that they often
don't have ECC so byte writing is much more feasible.  However, this is
only one possible memory system design.  The Am29000 will be interfaced
to many different kinds of memory systems.  At 30 MHz and beyond (where
the Am29000 is intended to be), word-addressing is thought by us to be
beneficial in many of these environments.

>	c) Some systems use write-thru caches with write-buffers
>	[VAX-780, I assume 8700s, etc, although not 8600/8650].
>	Sometimes the write-buffers gather contiguous bytes, then send
>	a whole word to memory. Again, having code that does lw/insert/sw
>	just adds cycles.

Another good point.  Same comments in general.

>2) I/O system design.  This is clearly not true of all systems, but
>you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT
>TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH
>DEVICE CONTROLLERS.  [other stuff]

Agreed, but again, why not solve the problem (with an interface chip
or design approach) instead of propagating it to the processor-memory
channel?  I have sympathy for OS people:  I was an OS person for just
a short while.  The choice between dumb system design and creating
problems for the OS people when they must deal with older chips/boards
is a tough one (really, because:  the OS is part of the system design
too).

>As I read the 29000 specs, maybe it would be possible to use both modes,
>where main memory uses word+insert/extract, and the I/O path has the
>alignment network, and uses the load/store control fields to yield
>partial-word ops.  It will be interesting to see how the C compiler
>compiles a device driver that uses both memory and I/O addresses...
>There's probably some way around it, but I do belive that it's more than
>picking up an off-the-shelf controller and it's associated driver,
>making a few tweaks, and running it.

Well, we don't have any plans right now for a compiler which would
allow "mixed-mode" memory orientation.  More likely, some (a significant
amount?  Just a little?  In between?  I don't know...) assembly language
programming will have to be done.  Perhaps the OS guys will start forcing
the hardware guys to design-in only coherently-designed peripheral chips
(if they exist) or forcing them to design hardware to hide (is this
possible? in some cases it is) the problems.

>B. Performance reasons.
>Domain: running UNIX and UNIX programs well.
>
>1. Some qualitative observations:
>
>When I was at CT, I spent a bunch of time tuning 68K C compilers.
>In particular, I looked at the prevalance of code like:
>	move byte to register, extend, extend
>	move byte to register, and with 255 to get byte alone OR
>	clear register, move byte to register
>I was able to get noticable improvements in at least some programs
>by optimizing away some of the unnecesary cases.  IT was sure clear that
>a lot of cycles were burnt by the extends, or the and/clear, i.e.,
>one really wished for load byte [signed or unsigned].

Sigh, please don't tell me about how a *vastly* different processor with
*vastly* different time/instruction tradeoffs behaved.  I believe every
word with respect to the 68K, and it would be naive of me to say that
there is *nothing* valuable to be learned from your experience in that
experiment.  But to say that the results of that experiment have binding
implications for a processor like the Am29000 (and I am tempted to say
the MIPS, but I am certainly not qualified to do so) seems just wrong to
me.

>If simulations are based only on user-level programs, you can get
>some horrible surpises when you see what UNIX kernels do.  For example,
>are halfword operations really necessary?
>ANS: not if you look at their frequency in most UNIX C programs.
>ANS: if you look at kernel: you bet! many kernel structures are packed
>for efficiency, some are packed for necessity (you should see the pile
>of halfword operations in Ethernet code... and you CANNOT sanely get
>rid of them without rewriting everything).

I am sure that you are right; I really can't speak too well from experience.
The fact that we were simply inequipped to do kernel-level simulations
was one of our biggest weaknesses.  But again, even if in light of the
fact that the kernel does lots of sub-word size stuff, does this really
mean that the Am29000 should assume a byte-oriented/half-word oriented
memory?

>2. Some quantitative observations.  As most people in this newsgroup
>know, we do a huge amount of simulation on very large programs
>to analyze performance, look at different board designs and future
>chip tradeoffs.  We get complete instruction traces, so we get outputs
>that look like:
>Summary:

Wow, our simulation output looks much the same, with some of the numbers
being represented differently.  Great minds think alike.  :-)

>Thus, we have really precise statistics on what's going on, at least on
>our machines, at the user-level, for anything form typical UNIX programs
>(like nroff), to large simulators [spice, espresso],
>parts of the compiler system [assembler, optimizer, debugger],
>to benchmarks like whetstone, dhrystone, linpack.

Sigh, I wish we could do such simulations.

>I think one can find a gross cost [to us, in our architecture, no
>necessary applicability to others] in user programs, as follows,
>if we had done byte extract/insert, instead of what we did:
>
>For each partial-word load, add 1 cycle. (for the extract)
>For each partial-word store, add 2+N cycles (where you have a load,
>insert added, and where N (might be 0) is the extra actual cycle cost
>to get data from the cache, noting that some of the cost might be
>taken care of by pipeline scheduling.

This seems valid, at first glance, for your situation.  But it is not
directly applicable to the Am29000 because there is a *cost* associated
with on-chip byte support.  Thus, you gain some, you lose some.  We
see about twice as many loads as stores.  Plus, the stack cache decreases
the load/store percentage overall with respect to a machine (like the
MIPS) with "only" 32 fixed registers.  We seem to have about half as
many loads/stores, but it varies (and my compiler ain't the best, e.g.
no register coloring for memory-resident stuff).  This lower load/store
percentage might be another reason that word orientation is more appropriate
for the Am29000 (but note that a given system need not implement a stack
cache in the local register file (register banking for fast context
switching may be a better use of the registers); in this case, the load/
store percentage will go back up and bets are off; Sigh, what's a
computer architect to do?).

>So here a re a few example: I'll give the % of instruction cycles
>for each instruction, and compute the penalty by using N=0.
>I'll ignore numbers too small to matter much.
>
>as1 (assembler 1st pass)
>opcode	%	penalty (dynamic)
>lbu	4.6%	4.6%
>sb	1.5%	3.0%
>lh	0.27%	0.27%
>lhu	0.07%	0.07%
>sh	0.02%	0.02%
>TOT		8%  penalty in instruction cycles, asssuming N=0 (best case)

This is OK assuming that byte/halfword alignment costs nothing.  Again, I
am just drawing attention to this missing side of the argument.

>There is also a static code-size penalty, I'll only do one since I don't
>think this is a major issue, but it is interesting;
>opcode	%	penalty (static)
>lbu	4.7%	4.7%
>sb	3.2%	6.4%
>lh	0.27%	0.27%
>sh	0.14%	0.28%
>TOT		11.6%

Unquestionably there is a code size penalty.  This may or may not be an
issue given ROM/RAM constaints in some environments.

>Note the significance of the static numbers: the byte operations are all over
>the place, i.e., the dynamic counts aren't substantial just because they're
>in strcpy or something like that [actually we have tuned routines anyway],
>but because there's partial word code all over the place.

You are so right in pointing out that there is partial word code sprinkled
throughout many existing applications.  As an after-the-fact observation,
I guess that many Am29000 applications will be running "new" code.  Now,
whether or not the coders will know the right things to do (use the fast
library routines, etc.) is not knowable but nonetheless critical.  I guess
that means that we need to print some sort of "Am29000 Programming
Suggestions."

>
>Now, this is an ultra-simplistic analysis, because there are things like:
>write-buffer effects, cache effects, memory system interference,
>pipeline scheduling, etc, etc.  Consider this a first approximation.
>
>Now, a few more examples:
>
>Dhrystone:
>lbu	6.9%	6.9%
>lb	4.7%	4.7%
>lwl	1.2%	1.2%	(unaligned word stuff)
>lwr	1.2%	1.2%
>sb	0.43%	0.8%
>swl	0.14%	0.3%
>TOT		14.1%

But, just a few lines later you'll point out how having a word-oriented
processor-memory channel *helps* (artifically since dhrystone is
artrificial) dhrystone performance.  I'm sorry, but you must to stick
to one argument. :-)

>(This has nothing do to do with word-vs-byte, but I ran across it in
>looking at these numbers).
>QUIZ: how many load/stores use 0-displacements off the base register,
>rather than non-zero ones?
>
>ANS: a few were around 50%.
>	most were in the 10-20% range.
>	some were down in the 5-10% range.
>	Dhrystone: 50%
>I.e., Dhrystone uses zero-offset addressing considerably more than
>most programs, although not more than all programs. [Relevant to 29000
>discussion, if you remember how they did things.]

Just in case you are trying to make a subtle intimation:  WE DID NOT
"OPTIMIZE" THE AM29000 ARCHITECTURE FOR ANY PARTICULAR PROGRAM.  The
architecture was pretty much fixed before we had significant simultion
results (I know, I know; that was the wrong way to do things, but we
had no choice).  We *did* add the now-infamous compare-bytes instruction
very late (after we had simulation results).  I wanted the load/store
instructions to have only register-indirect addressing mode from the
beginning, but only for the sake of simplicity and optimization
opportunities.  In the end, we realized that we had done a great thing:
As far as normal instruction execution is concerned, there cannot be
contention between jump and load/store instructions for the TLB.  With
our pipeline, an addressing mode would have been a minor disaster.

>WHEW.  That was a lot of info.  Sorry about that, but architectural
>arguments cannot be settled by intuition.  Note again that these are
>the numbers we get, and you cannot analyze choices in a vacuum,
>so they may or may not be relevant to other architectures and software.

Yes; this is an important point.  Rarely, if ever, does a team implement
in the same technology two versions of a processor with just one variable
(e.g. byte alignment/no byte alignment) changed.  That would be nice.

>In our case, this does say:
>	a) Byte instructions are a substantial win on many real programs.
>	b) Non-zero offsets are frequently-used.

(But less frequently when there is a stack cache.)

>and finally, for everybody:
>	c) Be very, very careful on WHICH benchmarks you use to tune
>	your architecture.  DON'T use Dhrystone.

This is good advice.

Thanks, John, for taking the time to post.

    bcase

phil@amdcad.UUCP (04/14/87)

In article <16122@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:
>In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>>2) I/O system design.  This is clearly not true of all systems, but
>>you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT
>>TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH
>>DEVICE CONTROLLERS.  [other stuff]
>
>Agreed, but again, why not solve the problem (with an interface chip
>or design approach) instead of propagating it to the processor-memory
>channel?  I have sympathy for OS people:  I was an OS person for just
>a short while.  The choice between dumb system design and creating
>problems for the OS people when they must deal with older chips/boards
>is a tough one (really, because:  the OS is part of the system design
>too).

If we are talking about trying to use existing controllers, such as
(particularly, actually) Unibus controllers, it's likely they jammed
the 16-bit registers one after another and a 32-bit word machine
will find it hard to cope with these controllers.

If we're talking about building new controllers there's no reason why
you couldn't give each register its own word. It uses a little more
address space but only a few bytes more, nothing really.

What the chips do is irrelevant. You can always set the chip decode
logic up (at the board level) to make the chip think word addresses
are byte addresses. 

>>Thus, we have really precise statistics on what's going on, at least on
>>our machines, at the user-level, for anything form typical UNIX programs
>>(like nroff), to large simulators [spice, espresso],
>>parts of the compiler system [assembler, optimizer, debugger],
>>to benchmarks like whetstone, dhrystone, linpack.
>
>Sigh, I wish we could do such simulations.

Brian, there are plenty of faster machines available at AMD and you
ought to consider using them if CPU time is the only constraint.  Or
(hee hee) buy an R2000 to do simulations on. Limiting yourself to your
existing resources is very silly, I think. (I won't post that Brian
does his simulations on an IBM-PC to avoid embarassing him.)

-- 
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com

tim@amdcad.UUCP (04/14/87)

In article <16122@amdcad.AMD.COM>, bcase@amdcad.AMD.COM (Brian Case) writes:
> In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
> 
> >Thus, we have really precise statistics on what's going on, at least on
> >our machines, at the user-level, for anything form typical UNIX programs
> >(like nroff), to large simulators [spice, espresso],
> >parts of the compiler system [assembler, optimizer, debugger],
> >to benchmarks like whetstone, dhrystone, linpack.
> 
> Sigh, I wish we could do such simulations.

I think Brian misread the previous paragraph to mean that the MIPS simulator
is able to run these programs in a simulated UNIX environment (i.e.
simulating the entire UNIX kernel), but I see only user-level mentioned, above.
Note that we *are* able to perform such simulations, but only in a single-
tasking, stand-alone environment.

John -- does the MIPS simulator incorporate a simulated UNIX kernel, and have
you performed multiprogramming simulations with it?

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.AMD.COM)

hansen@mips.UUCP (04/15/87)

In article <16126@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> In article <16122@amdcad.AMD.COM>, bcase@amdcad.AMD.COM (Brian Case) writes:
> > In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
> > >Thus, we have really precise statistics on what's going on, at least on
> > >our machines, at the user-level, for anything form typical UNIX programs
> > >(like nroff), to large simulators [spice, espresso],
> > >parts of the compiler system [assembler, optimizer, debugger],
> > >to benchmarks like whetstone, dhrystone, linpack.

> > Sigh, I wish we could do such simulations.

> I think Brian misread the previous paragraph to mean that the MIPS simulator
> is able to run these programs in a simulated UNIX environment (i.e.
> simulating the entire UNIX kernel), but I see only user-level mentioned, above.
> Note that we *are* able to perform such simulations, but only in a single-
> tasking, stand-alone environment.

> John -- does the MIPS simulator incorporate a simulated UNIX kernel, and have
> you performed multiprogramming simulations with it?

Yes - for example, we've run the Byte multi-shell benchmark on our simulator.

Actually, we have several MIPS simulators, and in addition to the user-level
simulations, we've been running kernel-level simulations as well.  The
kernel-level simulations take just a little longer, since we use an
instruction-level simulator to generate the address trace.  (The user-level
simulations generate the address trace by an object-code recompilation
technique that is also employed by our profiler.  As a side note, by using
this technique, we can simulate and/or profile any MIPS object code, without
having separate profiling libraries.)

The user-level simulations take into account the effects of multiprogramming
and kernel code execution in simple ways. We have found that our simulations
have matched within better than 5% to the actual run-time as reported by
the csh time command on our M-series systems. Now that we've got all our
MIPS systems connected by NFS, I've been able to run plenty of simulations.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

jesup@steinmetz.UUCP (04/15/87)

[re: discussion on word addressed memory vs. byte addressed w/ alignment net]

	Having a alignment network on chip does not necessarily cost you
in critical path, depending on your design.  In the one design I am familar
with, the net doesn't cost us anything, even at considerably more than 30 MHz.
It is done in the end of the cycle that latches it onto the chip (if I remember
correctly).  In any case, it is not on the critical path.
	In the other direction, it goes through a network again, and 4 lines
are driven as appropriate (write lines for each byte.)
	According to our figures, load/stores are about 40-50%, with about
2 loads/1 store.
	Lack of direct byte support can (depending on application) cost you
a fair amount.
	It all comes down to the hardware:  If it costs you more cycles
(on average) to add the alignment net than it would cost to synthesize the
the byte/halfword load/stores from word load/stores and byte insert/extract
instructions, then don't use the net.  But if it's even, or in favor of the 
net, definitely use the net.  If you don't, you'll need to decode at least
two more instructions.
	Picking numbers out of the air, if the alignment net costs 1 cycle
on loads and stores, and 65% are loads, and 40% of instructions are loads or
stores, and the extra cycle can't be filled 50% of the time, it will cost you:
	40% * 65% * 1 cycle * 50% ~= .1 cycles/instruction
(the extra cycle on a store doesn't block later instructions, just takes
longer for it to get to memory.)
	If you must do byte insert/extract, each of which costs one cycle,
(I assume there are halfword insert extract, otherwise it'll be worse), and
assuming 80% are word, 20% non-word, it will cost you:
	40% * 100% * 1 cycle * 20% ~= .1 cycles/instruction
	If there are destination/source interlocks, and 50% of the interlocks
are fillable, add .5 cycles/instruction to that, making it .6 cycles/
instruction.
	Now, all these numbers are fiction, but they aren't far from the
actual numbers we see in our data (I'm at home now).  From my point of view,
any work that might reduce the penalty of the alignment net to less than
1 cycle is a big win (and as I said, 0 is definitely possible, even above
30MHz, depending on design.)  Also, you reduce the decode complexity (maybe),
by having a smaller number of instructions.  If you can save .1 cycles/
instruction, you should get about 10% performance increase.  Worth a lot
of work and silicon, if you ask me.

	Randell Jesup
	jesup@steinmetz.uucp
	jesup@ge-crd.arpa

mash@mips.UUCP (04/15/87)

In article <16125@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes:
> ...sequence begun by inconvenience of I/O Controllers and
word-addressing, with rebuttals, and comments....

>If we are talking about trying to use existing controllers, such as
>(particularly, actually) Unibus controllers, it's likely they jammed
>the 16-bit registers one after another and a 32-bit word machine
>will find it hard to cope with these controllers.
Yes.
>
>If we're talking about building new controllers there's no reason why
>you couldn't give each register its own word. It uses a little more
>address space but only a few bytes more, nothing really.

Yes, this is clearly the thing to do. I've been assuming that part of the
logic behind all of this is to expect AMD to come out with carefully-
designed controller chips that do this [which will help us all anyway].
However, it is sad but true, that when you go to build densely-packed
high-performance systems, your choice is often limited.
I.e., the original posting was a current and near-term reality analysis,
not a "how it should really be" discussion.
>>...

(bcase)
>>Sigh, I wish we could do such simulations.
>
(phil again)
>Brian, there are plenty of faster machines available at AMD and you
>ought to consider using them if CPU time is the only constraint.  Or
>(hee hee) buy an R2000 to do simulations on. Limiting yourself to your
>existing resources is very silly, I think. (I won't post that Brian
>does his simulations on an IBM-PC to avoid embarassing him.)

I'm sure MIPS would be happy to sell Brian a nice M/500 suitable
for running lots of simulations. :-)  Doug VanLeuven is the local sales guy...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

phil@amdcad.UUCP (04/15/87)

In article <305@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>>If we're talking about building new controllers there's no reason why
>>you couldn't give each register its own word. It uses a little more
>>address space but only a few bytes more, nothing really.
>
>Yes, this is clearly the thing to do. I've been assuming that part of the
>logic behind all of this is to expect AMD to come out with carefully-
>designed controller chips that do this [which will help us all anyway].
>However, it is sad but true, that when you go to build densely-packed
>high-performance systems, your choice is often limited.
>I.e., the original posting was a current and near-term reality analysis,
>not a "how it should really be" discussion.

I think my point has been overlooked here. The question of whether a
chip's registers appear at byte, 16-bit, or 32-bit boundaries is
outside of the control of the chip designer. The board designer
determines this. 

To be boringly explicit about this, consider a chip with 8 registers.
You'll get 3 address pins (call them CA0, CA1, CA2) to select one of
the 8 registers and a Chip Select line (CS) to select the chip. Now,
if the board designer connects these three address lines to the low 3
address lines on the boards (CA0-BA0, CA1-BA1, CA2-BA2), the registers
will appear at byte boundaries. If the board designer skips the bottom
address line and instead hooks up (CA0-BA1, CA1-BA2, CA2-BA3), the
registers will appear at 16-bit boundaries. Finally, (CA0-BA2,
CA1-BA3, CA2-BA4) will space the registers on 32-bit boundaries. 

There are other problems associated with trying to fit all the needed
functionality into a limited number of pins but register placement is
not one of them. 

If you're not tired of reading about hardware yet, consider the
interface of the Z8530, a dual serial communications controller. Let
us consider just one half of the device, the other half is essentially
identical.  It has only one control "register" at the chip interface
level, (it has a data register too). First, you load the register with
a pointer value in the range 0-15 and then access the actual register,
one of 15. The problem comes in dealing with interrupts. If one comes
in after the pointer is loaded and before the actual register is used,
and the interrupt handler needs to use the SCC, two bad things happen.
1) the IH thinks it is writing a pointer when it is really writing a
register 2) after the IH returns, the routine thinks it is writing a
register but the chip thinks it is receiving a pointer. 

The only way to deal with this is to use a software locking mechanism
to reserve the SCC, since it has this hidden state. Unfortunately,
this makes using a DMA controller rather hard, since it won't respect
any software locks. Rather, the DMA controller must be turned off
before accessing the SCC. 

This interface saves three pins. Even though I design hardware, I
think this is incredibly ugly. 

Let's not talk about write only registers, chips with weird timing
dependencies (the SCC has a cycle recovery time requirement.  When I
first used it, I thought I could just warn the programmers about the
problem. But they don't read the board manual so I put in extra
hardware to hide the cycle recovery time.) and registers with inverted
logic and other atrocities perpetrated on helpless programmers by
narrow-minded or otherwise mis-guided hardware engineers. 

-- 
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com

rpw3@amdcad.UUCP (04/16/87)

In article <305@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <16125@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes:
>>[...sequence begun by inconvenience of I/O Controllers and
>> word-addressing, with rebuttals, and comments....]

[...then some bemoaning the difficulties of interfacing old chips...]

>>If we're talking about building new controllers there's no reason why
>>you couldn't give each register its own word. It uses a little more
>>address space but only a few bytes more, nothing really.

Note that you can do the same thing with the older chips, by using "partial
address decode" in your I/O address decoders. Say you have a chip with an
8-bit data path, and it has 16 internal registers (byte-wide) which are 
addresses with A<3:0>. Simply connect chip D<7:0> to bus D<7:0>, and connect
chip A<3:0> to bus A<5:2>, and there you are -- "word-addressed" registers.

Of course, you'll read garbage on D<31:8>, so if you want to avoid masking
that in software, you should add some gates to drive zeros on D<31:8> whenever
an 8-bit device is being read. This need exist only once per system, and can
be enabled by the central I/O decode system or, if you have plug-in cards with
8-bit I/O on them, can be a common bus line that such devices pull when they
are being read. (Exercise for the reader: Make this work if you have a mix
of 8- & 16-bit devices, without adding any extra "zero" drivers. Hint: You
may have to add another common bus line.)

If you put on your hardware/software tradeoff hat, you'll remember that we
have all been through this before, in one form or another. Nobody "waited"
for Motorola to come out with 68k peripheral chips, we just took our old
favorite Z-80 (or whatever) peripherals and glued them on.

<<begin bit-level digression>>

Sometimes the solutions get really weird, such as once when I had to use
Z-80 PIO chips to build a 16-bit PIO. But I needed to be able to talk
to the Z-80 PIOs separately, since only one set of registers on each was
involved in the 16-bit thing. Solution? Use "unary decoding" in the address
decoder. (Now that address space is really cheap, at least in the I/O sub-
systems, we shouldn't forget this ancient but useful technique.) "Unary decode"
or "unit select" or linear selection" (it never really had a standard name)
means using a single bit of the address space to select a device (usually
using some high-order binary-decode field to say we were doing this, so as
not to chew up the WHOLE address space!) In this way, if you turn on more
than one address bit at a time (in the affected range), you select ALL of
the addressed devices! (Obviously, not good for reading, unless they drive
different bits on the data bus.)

So for example, let's say we have a 32-bit word machine, and we want
to create a 32-bit parallel register for some I/O function, and for
some godawful reason we want to use Z-80 PIOs to do this, rather than
74F374's. (Look, contrived examples are contrived, o.k.?) So use four
Z-80 PIOs, and put each PIO on a separate byte of the data bus. Then
let's say that address 0x87650000 enables this whole mess as a unit
(that is, we tie up 64K of address space -- no biggy). Then give the
upper unit (the one on D<31:24>) a unit select address of 0x8000, the next
0x4000, and so on. Finally, wire each chip's address <1:0> to bus A<3:2>.
(There is another, even weirder way to do this, described below.)

So to simultaneously access Register 2 of all the chips, use the address
0x8765F008. To access Register 1 of the D<23:16> chip (but no others),
use the address 0x87654004 (but remember that the data is going in and
out on bits <23:16>, so your code has to shift/mask). If you wanted to,
this arrangement could give you three 32-bit registers, or six 16-bit ones,
or twelve 8-bit registers, OR... one 32-bit register plus two 16-bit plus
four 8-bit, etc. (permutations ad nauseum).

If you want to get even weirder (as mentioned above), wire the address
lines of the PIOs to separate bit fields (we've got plenty), for example,
the <31:24> PIO's A<1:0> could be wired to bus A<11:10> (mask 0xC00),
the <23:16> PIO's A<1:0> to A<9:8> (mask 0x300), etc. Then to simultaneously
access register 0 of the high byte, register 1 of the next, register 2 of the
next, and register 3 of the low (<7:0>) byte use address 0x8765F1B0.
[I know, I know, PIOs only have 3 data regs.]

Actually, if I were to do this again, I would probably allocate the addresses
a bit differently, putting the unit select and the register address in a
separate nybble for each PIO, making hex debugging easier. In that case,
you's use address 0x87658888 to access reg0 of all chips, and 0x876589AB
to get the weird ripple addressing of the previous example.

<<end of bit-level digression>>

Finally, the existence of 8- or 16- or 32-bit "peripheral chips" is not going
to make any difference at all the the deployment of the new 32-bit processors,
since any good-sized system is going to have complete CPUs as I/O processors
anyway, and it's on the I/O processor that such games as the above examples
will be played.

The real factor will be the bandwidth *and* latency of the channel between
the "main" CPU (or CPUs, in an mP system) and the I/O system as a whole.
Note that for development reasons one may choose to use the same processor
chip on the I/O system as for the "main" CPU(s), but it's on the I/O system
that the compatibility interface with 8-bit chips will take place.

[Counter-point: We may find these new 32-bit chips also being used in
"embedded controller" applications for traditional mainframes, where
some of the bit-tweaking above still applies.]

Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

henry@utzoo.UUCP (Henry Spencer) (04/18/87)

> I think my point has been overlooked here. The question of whether a
> chip's registers appear at byte, 16-bit, or 32-bit boundaries is
> outside of the control of the chip designer. The board designer
> determines this. 

However, the chip designer is within his rights to put constraints on this
(e.g. "memory-mapped I/O will be in big trouble if you don't put the
registers at 32-bit boundaries, however this offends your esthetic sense").
Doesn't mean the board designer will listen, of course... and chip-company
management might object on the grounds that the restriction might reduce
sales 0.01%.
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

haas@msudoc.UUCP (Paul R Haas) (04/18/87)

In article <16122@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:
>Unquestionably there is a code size penalty.  This may or may not be an
>issue given ROM/RAM constaints in some environments.
>
A code size penalty can become a performance penalty, if a loop won't
fit in the cache.  There is also a cost for the time taken to move the
extra code around on various data paths, delaying other users of that
data path (delayed writes, other processors, dma devices, etc...).

- Paul Haas, haas@msudoc.egr.mich-state.edu ...!ihnp4!msudoc!haas

steckel@alliant.UUCP (Geoff Steckel) (04/19/87)

One point that I believe needs to be emphasized in the word-vs-byte discussion:

Many existing and useful boards (VME, Multibus I & (yuck) II, QBUS, etc, etc,
not excepting IBMPCbus) cannot be used except by a CPU with bus-level byte
addressing.  The life cycle cost of redesigning an entire system of peripherals
just to be able to use the newest CPU seems very high.  If I/O space is
separate from memory space, one can kluge some very clumsy fixes, but memory
mapped peripherals have problems...

Under my OS and I/O hat, I have quite a stock of horror stories about CPUs
with this problem.  Given the actual cost of CPUs vs peripherals, guess which
one got canned when the problems showed up?  Hint - we HAD to keep the
peripherals.
	Geoff Steckel (steckel@alliant.uucp, gwes@wjh12.uucp)