[comp.arch] How to use silicon

schow@bnr-public.uucp (Stanley Chow) (03/19/89)

In a thread discussing what to do with all that transister sites 
becoming available on big chips, various suggestion of what is
good and not good has been brought on.

It is not often that I disagree with the likes of Tim Olson and
Henry Spencer, so I make a lot of splash when it happens. (Its
okey, I have my asbestas-suit).

In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>| >I predict that the next hardware features to come back will be
>| >auto-increment addressing and the hardware handling of unaligned data.
>| 
>| Again, why?  Auto-increment addressing is useful only if instructions
>| are expensive, because it sneaks two instructions into one.  However,
>| the trend today is just the opposite:  the CPUs are outrunning the
>| main memory.  Since instructions can be cached fairly effectively,
>| they are getting cheaper and data is getting more expensive.  Doing
>| the increment by hand often costs you almost nothing, because it can
>| be hidden in the delay slot(s) of the memory access.  Autoincrement
>| showed up best in tight loops, exactly where effective caching can be
>| expected to largely eliminate memory accesses for instructions.  Why
>| bother with autoincrement?
>
>Also, auto-incrementing addressing modes imply:
>
>	- Another adder (to increment the address register in parallel)
>
>	- Another writeback port to the register file
>
>Unless you wish to sequence the instruction over multiple cycles :-(
>
>I'm certain that most people can find something better to do with these
>resources than auto-increment.

For many people, auto-increment *is* something better!

The discussion is that with increasing density, the tendency is to add
complexity to the chips. There can be debates on the trade-off of 
different additions, but I would doubt that high performances chips of
the future (near future) will worry about another adder. Adding more
ports to register files is a challenge for the silicon groups, but, hay,
they got to earn their money too!

Caches (especially Data-cache) top out very easily. For many Unix-box
type application, we have already reached the point of vastly-diminished
return. Adding more cache won't bring your performance up.  (Or I could
talk about our application where the diminishing return starts *before*
you start adding cache.)

It is precisely because the CPU is running faster than memory (even 
cached memory) that one have to maximize the amount of work done in
each memory cycle.

Adding address modes does not mean a difficult machine to pipeline or
to design. Don't assume that all architecture with many addressig modes
will be as messy as the VAX instruction encoding.
It is quite possible to have good, clean instruction encoding
that has lots of modes - it just requires lots of gates to be *really*
fast.  Fortunately, lots of gates is exactly where we are headed.

With enough gates, the CPU will get more functional units, this means
small RISC instructions will not be able to keep all the functional
units busy. The i860 solves this by essentially have a short VLIW mode
(hmmm, Short Very Long?). It is also possible to have bigger instructions
that keep more units busy longer. Please note that different implementations
of the architecture can have a super-fast version that does all instructions
in single clock (at least in dispatching of instruction) and also have a 
cheap version that is (here it comes:) *micro-coded* or whatever.

Just because DEC could/would not do it for the VAX, don't conclude that
the concept is bad.

>
>| As for hardware handling of unaligned data, this is purely a concession
>| to slovenly programmers.  Those of us who have lived with alignment
>| restrictions all our professional lives somehow don't find them a problem.
>| Mips has done this right:  the *compilers* will emit code for unaligned
>| accesses if you ask them to, which takes care of the bad programs, while
>| the *machine* requires alignment.  High performance has always required
>| alignment, even on machines whose hardware hid the alignment rules.
>| Again, why bother doing it in hardware?
>
>The R2000/R3000 can also trap unaligned accesses and fix them up in a
>trap handler.  This is what the Am29000 does, as well.  This is mainly a
>backwards compatibility problem (FORTRAN equivalences, etc.) It is
>infrequent in newer code, mainly appearing in things like packed data
>structures in communication protocols.
>

It would be more accurate to call this "a concession to past constraints".
Remember, many of these old programs were writen in the days when memory
was not cheap and performance was expensive. It is not fair to call the
people "slovenly" just because you now have bigger and faster machines.
(If you were talking about some programmers who didn't know what they were
doing, them I agree with you.)

Even now, there are real money issues in memory alignments. If you have
a system with a 100 MegaBytes main memory and "correct" alignment makes 
it 150 MB, you have just made the system 50% more expensive. Or how about
alignment bumps your memory requirement from 63K to 65K causing extra chips
and possible board layout problems (not to mention the cost)?

Having the H/W be tolerant of alignment means a lot of flexibility in the
design trade-off.

Also, with more and more gates on a chip, it is conceivable that someone 
will put together a cache that can handle misalignment in the cache, as
long as the whole data item is in the same line. I.e., data can cross
word boundry with little or no penalty, but crosing line boundary will
be slow or disallowed.  With the trend to wider
buses (i.e., wider line size), this may well make the performance penalty
of mislaignment neglectable.

Stanley Chow  ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
	      (613)  763-2831

Please don't tell Bell Northern Reaearch about these silly ideas, I have
them convinced that I know everything about processor architecture. They
are even paying me to work on it. [If I don't want to tell them; do you
think I could represent them?]
pay me to work

baum@Apple.COM (Allen J. Baum) (03/21/89)

[]
In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <1989Mar16.190043.23227@utzoo.uucp> (Henry Spencer) writes:
>| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>| >I predict that the next hardware features to come back will be
>| >auto-increment addressing and the hardware handling of unaligned data.
>| 
>| Again, why?  Auto-increment addressing is useful only if instructions
>| are expensive, because it sneaks two instructions into one.  However,
>| the trend today is just the opposite:  the CPUs are outrunning the
>| main memory.  Since instructions can be cached fairly effectively,
>| they are getting cheaper and data is getting more expensive.  Doing
>| the increment by hand often costs you almost nothing, because it can
>| be hidden in the delay slot(s) of the memory access.  Autoincrement
>| showed up best in tight loops, exactly where effective caching can be
>| expected to largely eliminate memory accesses for instructions.  Why
>| bother with autoincrement?
>
>Also, auto-incrementing addressing modes imply:
>
>	- Another adder (to increment the address register in parallel)
>
>	- Another writeback port to the register file
>
>Unless you wish to sequence the instruction over multiple cycles :-(
>
>I'm certain that most people can find something better to do with these
>resources than auto-increment.

Well, I'll have to slightly disagree here. Auto-increment does not cost another
adder (for my particular definition of auto-increment); it just writes the
result of the effective address calculation back to the base register. If you 
want to be tricky, you can use a multiplexor to select the memory address to
be the base register itself, or the effective address calculation, giving you
pre- or post-  auto-increment. It does cost an extra writeback port. This
can be finessed, perhaps, by waiting for a cycle not using the writeback port,
but you can't count on it.
   Now, the question is, can loops profitably use this kind of addressing mode?
Or, should you just schedule the address updates in branch and load shadows
because you can't find anything else to put there?
   Note that if you have a superscalar architecture, and can do two inst.
in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you
can do this kind of thing as a matter of course; but its a lot more expensive
to do it that way- you do need a separate read port and adder, as well as a 
write port. 
   If, in fact, compilers can generate this code (and I believe they
can), and it can be scheduled (i.e. there aren't lots of dead cycles hanging
around just waiting to be filled with these address update instructions),
then it looks like a reasonable tradeoff. It's probably time to dust off those
benchmarks and see how often it occurs, and how many cycles it will save.
   Since this kind of operation is used almost exclusively inside a loop,
it has quite a bit of leverage. Yes, instruction caching is most effective
there, but that just means it won't cost you additional cycles, above and 
beyond the separate update instruction, not that it won't save you any cycles.
   Besides, who says you can't find soething else to
do with the extra write port when you're not doing address updates?
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

bcase@cup.portal.com (Brian bcase Case) (03/21/89)

>..., but I would doubt that high performances chips of
>the future (near future) will worry about another adder. Adding more
>ports to register files is a challenge for the silicon groups, but, hay,
>they got to earn their money too!

Right, another adder is not a problem.  Right, more ports on register
files is not too much of a problem, but is adding a port just so that
auto-increment goes fast the right thing?  I say no.  Keep reading.

>Caches (especially Data-cache) top out very easily. For many Unix-box
>type application, we have already reached the point of vastly-diminished
>return. Adding more cache won't bring your performance up.  (Or I could
>talk about our application where the diminishing return starts *before*
>you start adding cache.)

Maybe so, but I think that we have a little farther to go than 4K
instruction and 8K data, virtual no less.  Even if we have 2 million
transistors, I don't think the caches are going to be "too big."

>It is precisely because the CPU is running faster than memory (even 
>cached memory) that one have to maximize the amount of work done in
>each memory cycle.

"Running faster than memory" is a very misleading statement.  Mabye
latencies are a problem, but bandwidth can be had in abundance.  The
problem is what to do with it, not how to get it!

>Adding address modes does not mean a difficult machine to pipeline or
>to design. Don't assume that all architecture with many addressig modes
>will be as messy as the VAX instruction encoding.

It's true that not all architectures with many addressing modes will be as
messy as the VAX.  However, you simply have to answer the question:  are
those addressing modes, beyond register+register and register+offset,
really buying you anything?  (BTW, the lack of even those doesn't seem
to cripple the 29K too much.... If you dislike the 29K because of its
lack of addressing modes, blame me.  :-)

>With enough gates, the CPU will get more functional units, this means
>small RISC instructions will not be able to keep all the functional
>units busy. The i860 solves this by essentially have a short VLIW mode

I thought RISC instructions were too big!  :-) :-)  Note that the i860's
dual-instruction mode is essentially a VLIW mode.

>(hmmm, Short Very Long?). It is also possible to have bigger instructions
>that keep more units busy longer. Please note that different implementations
>of the architecture can have a super-fast version that does all instructions
>in single clock (at least in dispatching of instruction) and also have a 
>cheap version that is (here it comes:) *micro-coded* or whatever.

I claim a much better use of multiple functional units is to execute many
"small" RISC instructions at the same time, i.e., "super scalar" or
multiple instructions per clock.  It just doesn't make sense to bundle,
bind is a better word, many operations into one instruction.  Doing so
simply thwarts compiler optimization.  Adresssing modes are probably the
worst form of semantic binding, in my opinion.  So, if we are going to
have "too many transistors," we should use them to realize a superscalar
*implementation*, not a complex *architecture.*

>Even now, there are real money issues in memory alignments. If you have
>a system with a 100 MegaBytes main memory and "correct" alignment makes 
>it 150 MB, you have just made the system 50% more expensive. Or how about
>alignment bumps your memory requirement from 63K to 65K causing extra chips
>and possible board layout problems (not to mention the cost)?

If you can't afford to go to 150 Mbytes (or more likely, paging) or you
can't afford to go to 65K of RAM (try getting such a small amount, I 
challenge you), then performance must not be the most important thing.
By all means, then, you should allow un-naturally aligned data and you
can handle it in hardware or software, as you wish.  If performance is
your first priority, which it pretty much is in everything but the
cheapest embedded systems (which is also accounts for the highest volume!),
then you *don't* want to allow un-natural alignment.

>Having the H/W be tolerant of alignment means a lot of flexibility in the
>design trade-off.

?????

>Also, with more and more gates on a chip, it is conceivable that someone 
>will put together a cache that can handle misalignment in the cache, as
>long as the whole data item is in the same line. I.e., data can cross

The problem is not implenting hardware handling of misalignment; the
problem is the performance implication.  A mis-aligned load/store takes
two accesses; a good compiler or programmer will know this and align the
access whether the hardware can handle it or not.  So what's the point
of having the hardware?  If data must be packed as tightly into memory
as possible, then fine, but you must know that you are giving up
performance.  At this point, performance no longer is the first priority,
so handling it in software is probably acceptable (with simple primitives
like those of the MIPS processor, e.g.).

>word boundry with little or no penalty, but crosing line boundary will
>be slow or disallowed.  With the trend to wider
>buses (i.e., wider line size), this may well make the performance penalty
>of mislaignment neglectable.

I'm not sure how feasible it is to force the compiler/programmer to know
whether or not data is going to cross a cache line.  Things like
dynamically-allocated data structures might be a problem; this would
take some thought.  But, without much thought needed, I do know that the
hardware needed to permit misaligned access within a cache line is likely
to make cache access slower.  Since cache access is probably the limiting
stage in integer pipelines (maybe not in FP, but maybe), this is not a
good idea.

stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/21/89)

In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
> ...lotsa stuff deleted.
> 
>Caches (especially Data-cache) top out very easily. For many Unix-box
>type application, we have already reached the point of vastly-diminished
>return. Adding more cache won't bring your performance up.  (Or I could
>talk about our application where the diminishing return starts *before*
>you start adding cache.)

Before you start giving away the on-chip caches, lets get them to a respectable
size first.  The average on-chip cache we are seeing now is 512 to 1K bytes
for both D and I.  Hit rates of 80% for the I cache and 40-50% for the D
cache are prevalent.  We still have a way to go in this arena from what
I can see. (Yeah, I know the 88200 has 16K, but it isn't the same chip
as the 88100...;-)

Caches are a proven method of keeping the off-chip accesses to a minimum
and we still haven't been able to put really large caches on the same
chip as the CPU.  Let's really saturate this path first...

Steve Wilson

These are my opinions, not those of my employer.

w-colinp@microsoft.UUCP (Colin Plumb) (03/21/89)

schow@bnr-public.UUCP (Stanley Chow) wrote:
> [A lot of things about addressing modes I agree with]

> Even now, there are real money issues in memory alignments. If you have
> a system with a 100 MegaBytes main memory and "correct" alignment makes 
> it 150 MB, you have just made the system 50% more expensive. Or how about
> alignment bumps your memory requirement from 63K to 65K causing extra chips
> and possible board layout problems (not to mention the cost)?
> 
> Having the H/W be tolerant of alignment means a lot of flexibility in the
> design trade-off.

And a lot of headaches in the cache miss and page fault recovery departments.

In any structure, if you rearrange the components, you can lose at most
n-1 bytes to padding, where n is the strictest alignment restriction.  For
most processors, the worst case is a double and a char, 7 bytes out of 16
wasted.  But if this is a major concern, rewrite the code to use two parallel
arrays.  You'll waste at most 7 bytes total (in your 100Meg).

In C, this is a bit of a bother, but not too bad.  I think requiring alignment
is one thing that'll never go out of style.  On any chip, you want to do it
because it's more efficient, anyway.  The only need for unaligned accesses is
to handle old data formats, which presumably need old programs run on them,
which will (except in pathological cases) run faster on the new machine
anyway.
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

henry@utzoo.uucp (Henry Spencer) (03/22/89)

In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
>With enough gates, the CPU will get more functional units, this means
>small RISC instructions will not be able to keep all the functional
>units busy....

Right, we will find things like VLIW getting more popular, assuming that
the compiler technology is up to it.  (It's not clear that Intel's is.)
However, we will *not* find dedicated adders being thrown in just for
address arithmetic -- we will find extra ALUs that *can* be used for
that but can also be used for other things.  We will not find autoincrement
addressing modes coming back, but we may find VLIWish machines that can
do the memory access and address-register increment in the same cycle,
as two separate operations independently controlled by the program.
Autoincrement addressing modes simply aren't a worthwhile investment.

>Having the H/W be tolerant of alignment means a lot of flexibility in the
>design trade-off.

It's just as easy to have the software cope with it (either directly or
via the compiler generating special code) in the rare cases where it is
needed.  This lets occasional needy software use it, *without* investing
any hardware complexity in it.

>Also, with more and more gates on a chip, it is conceivable that someone 
>will put together a cache that can handle misalignment in the cache...

Sure.  But who would *bother*?  It's just not worth it.
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bcase@cup.portal.com (Brian bcase Case) (03/22/89)

>Well, I'll have to slightly disagree here. Auto-increment does not cost another
>adder (for my particular definition of auto-increment); it just writes the
>result of the effective address calculation back to the base register. If you 

Yes, this is the i860's way of autoincrement for the floating-point memory
reference instructions.

>   Note that if you have a superscalar architecture, and can do two inst.
>in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you
>can do this kind of thing as a matter of course; but its a lot more expensive
>to do it that way- you do need a separate read port and adder, as well as a 
>write port. 

Exactly my point about superscalar.  But note that for the expense of the
added data path (I assume it is essentially a duplicate of the primary
integer (and/or) floating-point pipe), you can now execute *any two*
instructions that don't have deliterious dependencies.  Sure, adding only
the hardware needed for auto-increment is cheaper, but do you really want
that garbage in your architecture forever?  When you do go to a super-
scalar implementation (and you will, whomever you are, just to keep up
with the joneses), you now have two data paths that have the added
complexity of auto-increment.  Super-scalar is a good argument for simple
architectures.

>It's probably time to dust off those
>benchmarks and see how often it occurs, and how many cycles it will save.

Well, I'm all for simulation and experimentation.  If it is better and the
cost now and in the future is not prohibitive, then great.  But it ceratainly
isn't clear that auto-increment is the right thing!  My position is that it
is reasonably clear that one should be skeptical.

>   Since this kind of operation is used almost exclusively inside a loop,
>it has quite a bit of leverage.

Yes, this is true.  This is why one would like to look at the idea seriously.

>   Besides, who says you can't find soething else to
>do with the extra write port when you're not doing address updates?

I'm surprised to hear you say that!  I think a more realistic outlook is
to say that "Besides, who says you *can* find something else to do with
the extra write port."  Conjecturing, instead of proving, that an added
feature (with a significant cost) can be used for something else does not
constitute the rigorous persuit of good computer architecture.  Shame!  :-)
:-) :-) :-) :-)

stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (03/22/89)

In article <16058@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>I claim a much better use of multiple functional units is to execute many
>"small" RISC instructions at the same time, i.e., "super scalar" or
>multiple instructions per clock.  It just doesn't make sense to bundle,
>bind is a better word, many operations into one instruction.  Doing so
>simply thwarts compiler optimization.  Adresssing modes are probably the
>worst form of semantic binding, in my opinion.  So, if we are going to
>have "too many transistors," we should use them to realize a superscalar
>*implementation*, not a complex *architecture.*

If anything, VLIW is more RISCy than RISC in the sense that it exposes
all of the functional units' pipelines to the compiler.  You just can't
say VLIW in the same sentence with "simply thwarts compiler optimization."
Your denying the basis of why to go to VLIW in the first place.  The compiler
has the option to "optimize" along a straight line code just like you suggest,
or, in the inner most loop of an application you can have multiple 
instances of the same loop running simulataneously.  All due to the compiler!
If anything, you've got more opportunity for compiler optimizations, not
less.

Steve Wilson

The above opinions are mine, not those of my employer.

schow@bnr-public.uucp (Stanley Chow) (03/22/89)

In article <2160@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:
>In article <355@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes:
>> ...lotsa stuff deleted.
>> 
>>Caches (especially Data-cache) top out very easily. For many Unix-box
>>type application, we have already reached the point of vastly-diminished
>>return. Adding more cache won't bring your performance up.  (Or I could
>>talk about our application where the diminishing return starts *before*
>>you start adding cache.)
>
>Before you start giving away the on-chip caches, lets get them to a respectable
>size first.  The average on-chip cache we are seeing now is 512 to 1K bytes
>for both D and I.  Hit rates of 80% for the I cache and 40-50% for the D
>cache are prevalent.  We still have a way to go in this arena from what
>I can see. (Yeah, I know the 88200 has 16K, but it isn't the same chip
>as the 88100...;-)
>
>Caches are a proven method of keeping the off-chip accesses to a minimum
>and we still haven't been able to put really large caches on the same
>chip as the CPU.  Let's really saturate this path first...
>
>Steve Wilson
>
>These are my opinions, not those of my employer. 

Different people have different ideas of what a "respectable" cache is.
Most Unix-type programs/filters/... are pretty happy with small caches
(by small, I mean <64K). This is of course how RISC (and CISC) workstations
get their performance.

Some one mentioned (in an article about vector processing and the i860,
I think) that the 32 Meg Cache in an ETA machine is not enough for some
applications.

Realistically, I do not foresee on-chip cache being expanded to a megabyte
anytime soon. Going from 512 to 4K to 16K to 64K will buy performance for
some (even many) applications but other applications will still be killed
by the miss-rate.

I am suggesting that instead of spending the silicon real-estate on bigger
caches, other options may be open. In particular, I am suggesting that
more complex instructions can be a useful way to decrease band-width 
demand on the cache/memory system. John Mashy talks about the almost
constant band-width needed for a MIP. Caches are a way to up the perceived
band-width of the memory. I am suggesting that silicon can be used to
decrease the demand.

Stanley Chow   ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
		 (613) 763-2831

Cache? Don't carry it, I use platic money. 

[This way, my employer knows I have to go to conferences and courses
to learn about this stuff. Unfortunately, this means they don't let
me represent them either.]

mash@mips.COM (John Mashey) (03/22/89)

In article <1989Mar21.194914.3284@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
...
>Autoincrement addressing modes simply aren't a worthwhile investment.
Really, this isn't an absolutely obvious conclusion.   There are certainly
intruction encodings, pipelines, register file designs where the incremental
cost of this might be almost zero, in which case one would certainly put it in.
Of course, there might be later implementations where you'd be sorry.
It's just like anything else: you have to simulate it and see if the cost
is worth it.  Depending on what you already have, it might or might not
be worth it. At least HP thought it was OK, I think (HP PA).
...
>>Having the H/W be tolerant of alignment means a lot of flexibility in the
>>design trade-off.
...
>It's just as easy to have the software cope with it (either directly or
>via the compiler generating special code) in the rare cases where it is
>needed.  This lets occasional needy software use it, *without* investing
>any hardware complexity in it.

The alignment issue is MUCH nastier and more impactful on a design
than auto-increment because it's much more likely to complexify critical paths,
the pipeline, and the cache design.  (Yes, we certainly prefer to provide
a few primitives and let the software do it!)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

baum@Apple.COM (Allen J. Baum) (03/23/89)

[]
>In article <16080@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
(after quoting me about auto-increment, its cost, etc. ....)

>Exactly my point about superscalar.  But note that for the expense of the
>added data path (I assume it is essentially a duplicate of the primary
>integer (and/or) floating-point pipe), you can now execute *any two*
>instructions that don't have deliterious dependencies.  Sure, adding only
>the hardware needed for auto-increment is cheaper, but do you really want
>that garbage in your architecture forever?  When you do go to a super-
>scalar implementation (and you will, whomever you are, just to keep up
>with the joneses), you now have two data paths that have the added
>complexity of auto-increment.  Super-scalar is a good argument for simple
>architectures.

If auto-increment is frequent enough, then it can be done in addition to 
executing *any two* operations at once. The leverage really hits you- a
10 inst. loop, including a couple of atuo-incs. shrinks to 5  if you can
average two instructions/cycle. At very little cost in hardware (I assert
this as a hardware design type), maybe this shrinks to 4 insts., a 20%
saving. Try to get 20% some other way- it's real tough! Your mileage may
vary, of course.

>>It's probably time to dust off those
>>benchmarks and see how often it occurs, and how many cycles it will save.
>
>Well, I'm all for simulation and experimentation.  If it is better and the
>cost now and in the future is not prohibitive, then great.  But it ceratainly
>isn't clear that auto-increment is the right thing!  My position is that it
>is reasonably clear that one should be skeptical.

Um, that was my point also, although perhaps I lean towards less skeptical than
you.

>>   Since this kind of operation is used almost exclusively inside a loop,
>>it has quite a bit of leverage.
>
>Yes, this is true.  This is why one would like to look at the idea seriously.
>
>>   Besides, who says you can't find something else to
>>do with the extra write port when you're not doing address updates?
>
>I'm surprised to hear you say that!  I think a more realistic outlook is
>to say that "Besides, who says you *can* find something else to do with
>the extra write port."  Conjecturing, instead of proving, that an added
>feature (with a significant cost) can be used for something else does not
>constitute the rigorous persuit of good computer architecture.  Shame!  :-)
>:-) :-) :-) :-)

I did have something in mind for that hardware. I dispute the signifcant cost
issue- it is roughly equivalent to register scoreboarding logic, and if you
have that, the additional cost is small (again, I assert this in my capacity
as a hardware design type that has gone through the exercise). I didn't
conjecture that it might be used for something else, I know it can, and I
know the kind of speedup it will give me, as well as the extra cost to use
it for that something else. This is an exercise for the reader- Part A: what
can an extra write port to a register file be used for (and what other hardware
is required to make it useful)? Part B: Now, suppose this extra write port can
be a read/write port?

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

bcase@cup.portal.com (Brian bcase Case) (03/23/89)

>>It just doesn't make sense to bundle,
>>bind is a better word, many operations into one instruction.  Doing so
>>simply thwarts compiler optimization.  Adresssing modes are probably the
>>worst form of semantic binding, in my opinion.
>
>If anything, VLIW is more RISCy than RISC in the sense that it exposes
>all of the functional units' pipelines to the compiler.  You just can't
>say VLIW in the same sentence with "simply thwarts compiler optimization."

I didn't!!!!  I'm sorry if there was some confusion.  My reply, from which
the above quote is taken, was to a plea for auto-increment.  I would
certainly never say that VLIW thwarts compiler optimization.  I'm sorry
if you misunderstood, but I was not knocking "semantic bundling" in the
context of VLIW, but in the context of complex addressing modes and the
like.  VLIWs are sorta like super scalars, and for that similarity, I
like them.

dps@halley.UUCP (David Sonnier) (03/24/89)

In article <21931@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>
>The cost is certainly far from 0 if an incrementation has to be "undone"
>when a conditional branch is taken.
>
>   Norm

This is actually the critical point.  It is very difficult to make
auto-increment idempotent.  However, the branch case is not the critical
case (the compiler could handle that).  The critical case is for exception
processing.  In PDP-11 unix(TM), a great deal of software effort is required
to un-increment registers in case of exceptions.  The VAX adds a great
deal of hardware to solve the same problem.  The MIPS Rn000 on the other
hand, simply defines all instructions to be idempotent.  Which means that
if you get an exception, you can simply restart the instruction.

With interrupt processing time being such a critical path in the system,
I think not having auto-increment is a definite win (IMHO).

I think that the most important thing RISC has done for us is to move
system design from "gut feeling" and "neat features" towards real
engineering.  Measure the cost of each feature and quantify the
trade-offs.
-- 
David Sonnier @ Tandem Computers, Inc.  14231 Tandem Blvd.  Austin, Texas  78728
...!{rutgers,ames,ut-sally}!cs.utexas.edu!halley!dps        (512) 244-8394

baum@Apple.COM (Allen J. Baum) (03/24/89)

[]
>In article <21931@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>>In article xxx henry@utzoo.uucp (Henry Spencer) writes:

>>>Autoincrement addressing modes simply aren't a worthwhile investment.
>
>>Really, this isn't an absolutely obvious conclusion.   There are certainly
>>intruction encodings, pipelines, register file designs where the incremental
>		       ^^^^^^^^^                                  ^^^^^^^^^^^
>>cost of this might be almost zero,in which case one would certainly put it in
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>The cost is certainly far from 0 if an incrementation has to be "undone"
>when a conditional branch is taken.
>
>   Norm

Why would an autoincrement be undone on a branch? Do you mean on a trap?
Besides, who says that the hardware cost for undoing the autoincrement is
significant? It could be that the register wasn't actually updated until
all possible traps have been evaluated, in which case the the cost IS zero.
His statement stands as it is written: there ARE pipelines where the
incremental cost IS zero (oh yes, not counting that pesky extra register
file port. That's when you have to decide if it is worth it).
 It seems to me that optimizing compilers can figure out when to use 
autoincrement, so by that (RISC) criteria, it shouldn't be counted out.
Only the statistics should count it out, and I ain't seen none ( in either
direction, to be sure).

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

bcase@cup.portal.com (Brian bcase Case) (03/24/89)

>If auto-increment is frequent enough, then it can be done in addition to 
>executing *any two* operations at once. The leverage really hits you- a
>10 inst. loop, including a couple of atuo-incs. shrinks to 5  if you can
>average two instructions/cycle. At very little cost in hardware (I assert
>this as a hardware design type), maybe this shrinks to 4 insts., a 20%
>saving. Try to get 20% some other way- it's real tough! Your mileage may
>vary, of course.

Allen, you got me.  (This is quite fair since I said the same thing, "try
getting 20% some other way," about something I felt strongly about in an
internal report when Allen and I worked for the same company!)  You are
quite right, if it really does make 20% difference.  However, "your mileage
may vary" is the right caveat:  can that loop really be executed at 2 inst.
per cycle if some of the parallelism is taken away by adding the autoinc?
I don't know the answer, I am just in violent agreement with you (and
John Mashey):  you must simulate and measure and think, and then be able
to predict the future :-).

>I did have something in mind for that hardware. I dispute the signifcant cost
>issue- it is roughly equivalent to register scoreboarding logic, and if you
>have that, the additional cost is small (again, I assert this in my capacity
>as a hardware design type that has gone through the exercise). I didn't
>conjecture that it might be used for something else, I know it can, and I
>know the kind of speedup it will give me, as well as the extra cost to use
>it for that something else. This is an exercise for the reader- Part A: what
>can an extra write port to a register file be used for (and what other hardware
>is required to make it useful)? Part B: Now, suppose this extra write port can
>be a read/write port?

Oh, if you already have the answer, what it will also speed up, then
I stand corrected.  Maybe I am thinking about a different set of
implementation trade-offs.  What is the answer to your exercise?  Does
if have anything to do with loads/stores?  I don't
mean to say that autoincrement is *absolutely* wrong, I don't know all
the possible implications for every architectural-cross-implementation
approach.  But without proof that it is good, I tend to be skeptical.  (I
guess you can tell!  ;-) :-).  Will/can you say how it fits in and why
it is very good to have?

news@bnr-fos.UUCP (news) (03/24/89)

In article <13@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
>schow@bnr-public.UUCP (Stanley Chow) wrote:
>
>In any structure, if you rearrange the components, you can lose at most
>n-1 bytes to padding, where n is the strictest alignment restriction.  For
>most processors, the worst case is a double and a char, 7 bytes out of 16
>wasted.  But if this is a major concern, rewrite the code to use two parallel
>arrays.  You'll waste at most 7 bytes total (in your 100Meg).
>

I only wish this were true. Many many applications have "natural" data
structures that are inconvenient to align. Using parallel arrays means more
obscure code and more indexing time. These is of course other problems in
multi-processing (like keeping different bits of a word from being written
by different processors or processes).

Typically, the worst problems come from many copies of a small data structure
that is not a nice multiple of the word size. A million copies of a 33 bit
structure wastes 4 Mega bytes. It is not possible to pack them together.

For example, gate level simulation typically has an array of gate description
with connectivity and state information. The natural (or logically clear)
ordering of the fields is probably not the most compact ordering.

>In C, this is a bit of a bother, but not too bad.  I think requiring alignment
>is one thing that'll never go out of style.  On any chip, you want to do it
>because it's more efficient, anyway.  The only need for unaligned accesses is
>to handle old data formats, which presumably need old programs run on them,
>which will (except in pathological cases) run faster on the new machine
>anyway.

Ah, but this is precisely the point. Many old programs *need* misaligned
accesses. If you don't allow that, the old programs will not run at all!

Incidentally, the historical trend is to be progressively more tolerant of
misalignment, e.g. IBM /360 /370, Motorola 68K families. All the "tolerant"
machines always attach a *penalty* to misalignment. It is only the very recent
crop of so-called RISC chips that is requiring alignment again. (Please note
that I said historical *trend*, not *all* CPU families.)

------------------

In article <16058@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
  ('>>' is Case quoting from my article)
>>Caches (especially Data-cache) top out very easily. For many Unix-box
>>type application, we have already reached the point of vastly-diminished
>>return. Adding more cache won't bring your performance up.  (Or I could
>>talk about our application where the diminishing return starts *before*
>>you start adding cache.)
>
>Maybe so, but I think that we have a little farther to go than 4K
>instruction and 8K data, virtual no less.  Even if we have 2 million
>transistors, I don't think the caches are going to be "too big."
>

The point is the for small applictions, the existing workstations are already
getting hit-rates in the high 90's. Some big application will thrash any
cache. No matter how big. Some applications will only be happy with megabyte
caches.

I am not saying the caches will be too big, I am saying that there are
different kinds of applications:
  - the small ones that are already running very fast with a 64K cache
  - the medium ones that will run faster with a bigger (say 256K) cache
  - the large ones that will not run any faster until you get to 128M

By putting more cache on chip (and this obviously depends on the board-level
cache and memory system), the small ones will not speed up, the large ones
will not speed up. Only a portion of the medium applications will run faster
by varying amounts.

Depending on which applications you care about, the bigger cache may or may
not be worth it to you.

>>It is precisely because the CPU is running faster than memory (even
>>cached memory) that one have to maximize the amount of work done in
>>each memory cycle.
>
>"Running faster than memory" is a very misleading statement.  Mabye
>latencies are a problem, but bandwidth can be had in abundance.  The
>problem is what to do with it, not how to get it!

By "the CPU running faster than memory", I mean that within the current
semiconductor and PCB processes, we can build:

  - very fast ALU functions.
  - very fast on-chip register files.
  - not so fast on-chip cache (I & D).
  - slow off-chip (but on-board) memory.
  - very slow off-board memory.

As a result, the problem is getting instructions into the pipeline, not
the execution of it. Latency is a very bad problem, bandwidth is merely
a bad problem.

I like to know why you think "bandwidth can be had in abundance".

>
>>Adding address modes does not mean a difficult machine to pipeline or
>>to design. Don't assume that all architecture with many addressig modes
>>will be as messy as the VAX instruction encoding.
>
>It's true that not all architectures with many addressing modes will be as
>messy as the VAX.  However, you simply have to answer the question:  are
>those addressing modes, beyond register+register and register+offset,
>really buying you anything?

That is of course the $64K question. As other people have pointed out, the
question is worth as least serious simulation. I suspect everyone will come
up with different answers anyway, but is seems to me premature to dismiss
it.

>
>>With enough gates, the CPU will get more functional units, this means
>>small RISC instructions will not be able to keep all the functional
>>units busy. The i860 solves this by essentially have a short VLIW mode
>
>I thought RISC instructions were too big!  :-) :-)  Note that the i860's
>dual-instruction mode is essentially a VLIW mode.
>
Do I hear an echo here? :-)

>
>I claim a much better use of multiple functional units is to execute many
>"small" RISC instructions at the same time, i.e., "super scalar" or
>multiple instructions per clock.  It just doesn't make sense to bundle,
>bind is a better word, many operations into one instruction.  Doing so
>simply thwarts compiler optimization.  Adresssing modes are probably the
>worst form of semantic binding, in my opinion.  So, if we are going to
>have "too many transistors," we should use them to realize a superscalar
>*implementation*, not a complex *architecture.*

Since you believe that is lots of bandwidth to burn, your conclusions above
are quite logical. Since I believe there is insufficient bandwidth now, I
disagree.

>
>>Even now, there are real money issues in memory alignments. If you have
>>a system with a 100 MegaBytes main memory and "correct" alignment makes
>>it 150 MB, you have just made the system 50% more expensive. Or how about
>>alignment bumps your memory requirement from 63K to 65K causing extra chips
>>and possible board layout problems (not to mention the cost)?
>
>If you can't afford to go to 150 Mbytes (or more likely, paging) or you
>can't afford to go to 65K of RAM (try getting such a small amount, I
>challenge you), then performance must not be the most important thing.
>By all means, then, you should allow un-naturally aligned data and you
>can handle it in hardware or software, as you wish.  If performance is
>your first priority, which it pretty much is in everything but the
>cheapest embedded systems (which is also accounts for the highest volume!),
>then you *don't* want to allow un-natural alignment.
>

Am I the only one who worries about production cost and preformance/cost
ratios? I thought only the US government gets to go for maximum performance
at any cost.

If I can put out a product at maximum performance with 150 Mbytes or another
product at 95% performance with 100 MBbytes (thereby costing only 75%), how
do you think the decision with go?

Peformance may or may not be "the most important" criterion, I have never
work in a project where performance is the *only* criterion.

From: schow@bnr-public.uucp (Stanley Chow)
Path: bnr-public!schow

Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
	     (613) 763-2831

I do not represent anyone except myself. Even then, I don't often let me
represent myself.

aglew@mcdurb.Urbana.Gould.COM (03/25/89)

>With interrupt processing time being such a critical path in the system,
>I think not having auto-increment is a definite win (IMHO).
>-- 
>David Sonnier@Tandem Computers, Inc. 14231 Tandem Blvd. Austin, Texas  78728
>...!{rutgers,ames,ut-sally}!cs.utexas.edu!halley!dps        (512) 244-8394

What data can people provide that says that interrupts are a critical
path in the system?
     That didn't come out quite right - how do you distinguish between 
[A] the work that has to be done *some* time, that you can choose to be done at
interrupt time or elsewhere, and [B] the work that is inherent in
processing interrupts, and can be minimized by making the interrupt 
dispatch/return more efficient.
     Several comments to the effect that "the work the hardware does
in interrupt dispatch is only a miniscule fraction of our interrupt
bottleneck" make me think that [A] is the bottleneck.
     If so, then how really important would be a bit of work in the
interrupt handler, to decode and back up autoincremented registers?
Ie. how really important is a [B] << [A]?  (Especially if you have an
instruction set that is easy to decode, unlike the VAXes', so the 
autoincrement would be easy to back up (BTW, on Gould machines we had 
to do a little bit of instruction decoding to restore state on an
interrupt (usually just +/- 2 or 4 on the PC) -- it was fast and easy)).


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

alan@rnms1.paradyne.com (0000-Alan Lovejoy(0000)) (03/27/89)

An idea:

Suppose every machine instruction for some hypothetical machine were of 
the form <load/store op><non-load/store op> Ra, Rb, Rc, Rx, Ry, Rz, where
Ra is the source/destination of a load/store, Rb is a base address, Rc is
an optional index, Rx is the destination of some operation (e.g., addition)
and Ry and Rz are operands.  For example:

LD.B/AND.L R1, (R2), (R3), R4, R4, R5; R1.B := *(R2 + R3), R4.L &= R5.L

Such instructions name six registers.  If there are 32 (visible) registers,
then 30 bits are required per instruction just to specify all the registers.
If all instructions were...say...forty eight bits long, then this machine
should achieve 25% more work per instruction bit than conventional RISCs.
If the machine fetches 48 bits of instruction per cycle, and can execute 
both the load/store and register-data operations in one cycle, then it
also gets UP TO twice as much work done per cycle as conventional RISCs.
Since the ALU operations and the data access function can easily be made
to operate in parallel, executing such instructions in once cycle should 
not be difficult.

Given the frequency of load/store operations, this could be a very big win.

Comments? 

Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
__American Investment Deficiency Syndrome => No resistance to foreign invasion.
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

bcase@cup.portal.com (Brian bcase Case) (03/29/89)

>If all instructions were...say...forty eight bits long, then this machine
>should achieve 25% more work per instruction bit than conventional RISCs.

Two comments:  non-power-of-two-sized instructions are not a good idea
(you can't fit an integral number into a page, unless your pages are
non-power-of-two-sized, and that is a *bad* idea).  Also, you should
look at Wulf's WM machine; his architecture is designed along the same
lines, i.e., two things per instruction, but WM has 32-bit instructions
and two ALU ops per instruction and data streaming (i.e., loads and
stores happen without an explicit initiating load or store instruction).
Bill, if you're out there, what's the status of the WM machine?

Your idea is a good one:  it is VLIW, in some sense.  However, it is
probably not possible to keep the load/store side of the instruction
busy a large fraction of the time in general code, although scientific
codes can probably take advantage of this "vector" caapability.