[comp.arch] RISC machines and scoreboarding

tom@nud.UUCP (Tom Armistead) (06/18/88)

    On RISC processors without a scoreboard*, how are the results of memory
references guaranteed to be available before they are used in subsequent
computations?  I thought through several scenarios and couldn't find a very
good solution and being being unfamiliar with RISCs other than the 88k,
thought the net might have some answers.

1)   If the processor waited until the load was complete before launching
the next instruction, that would guarantee the correct result was available
before an attempt was made to use it.  This obviously is very poor from
a performance standpoint though.

2)   If the compiler/assembly writer is required to wait a certain number
of ticks after a load before using the results of the load, how is the
required amount of delay determined?   Can you assume you always get a
cache hit (I doubt it)?  Do you have to use worst case (then why use a
cache at all)?  What if the access is across a shared bus and the tape
controller is currently hogging the bus and will be for the next 10ms
(you certainly wouldn't want to use a 10ms delay after each load)?  Does
the memory system have to give absolute top priority to CPU requests
so that the required delay is fixed?  What if memory refresh were also
pending - it seems here you would have a nasty choice of letting memory
be trashed or letting the CPU use bogus results?

    Does #2 mean that a binary compiled for a particular system won't run on
a different system (having a different memory speed) even though they have
the same CPU chip and that instead, the source must be recompiled for each
different system the program is to run on (or if the current system is
sped up/slowed down)?  This seems like it would be clumsy and difficult
to manage from a software distribution aspect.

3)  Is there some internal register that can be referenced that indicates
when the result is available and software is required to interogate
the register to determine if the result is ready?  This would add overhead to
every load instruction.

4)  Is there some other method I haven't thought of that is the method of
choice?  Which RISC processors don't have scoreboarding and what method
do they use?

  * For those unfamiliar with what scoreboarding is, the scoreboard is
merely an internal register which keeps track of which registers have
stale data (for example the destination register of a load instruction
which has been launched but not completed).  If the CPU encounters an
instruction which uses a register which is stale, it waits until the
scoreboard indicates the register is valid again before it begins
the instructions execution thus assuring the correct operands are used
by the instruction.  Note this is a hardware feature - software is
not required to examine the scoreboard to accomplish this.

    As an example, on the 88k, you could do the following sequence:

	ld	r2,r0,0		;  Assume this ld takes a long time.
	add	r3,r2,1		;  Add 1 to result of ld and put in r3.

and the result would be correct (for any length of ld time).  Without
a scoreboard, the result in r3 would be based on the stale data in r2
(which is probably the wrong value).
-- 
Just a few more bits in the stream.

The Sneek

tim@amdcad.AMD.COM (Tim Olson) (06/19/88)

In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
|     On RISC processors without a scoreboard*, how are the results of memory
| references guaranteed to be available before they are used in subsequent
| computations?  I thought through several scenarios and couldn't find a very
| good solution and being being unfamiliar with RISCs other than the 88k,
| thought the net might have some answers.

The scoreboarding on the 88k (correct me if I'm wrong) associates a bit
for each register in the register file to tell whether the value is
stale or not.  During the issuing of an instruction, the operand
registers are checked for staleness, and the pipe is stalled if
required.

The bit-per-register is one way of detecting dependencies.  Another way
(used on the Am29000) is to associate a destination register pointer
(and a "ready bit") with each functional unit.  Forwarding logic matches
the register operand fields of the instruction being issued with the
destination register pointers of the functional units, stalling if it
finds a match and the value is not available, yet.  When the value is
available, it is forwarded to the correct operand bus (A or B, or
perhaps both) so that the instruction does not have to wait for the
value to be written into the register file.  This method has the benefit
of integrating closely with the forwarding logic.

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

mash@mips.COM (John Mashey) (06/19/88)

In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>    On RISC processors without a scoreboard*, how are the results of memory
>references guaranteed to be available before they are used in subsequent
>computations?....

This was discussed a while back.  Sounds like it's time to do it again.
NOTE: caveats: there are certainly architectures for which some ofthe
following doesn't fit.  It is intended to be mostly applicable to
"mainline" RISC micros.

>2)   If the compiler/assembly writer is required to wait a certain number
>of ticks after a load before using the results of the load, how is the
>required amount of delay determined?   Can you assume you always get a
>cache hit (I doubt it)?....

Following are:
1) How the MIPS R2000/R3000 does it.
2) Some thoughts on when scoreboarding does/doesn't make sense, and why.

HOW THE R2000/R3000 WORK

1) A load requires one cycle to initiate, plus one "architectural"
cycle of latency, which must be covered by the software,
i.e., it must insert a nop if it can't find anything useful.

2) If we ever built a machine in which the primary cache had 2 or more
cycles of latency, the software would be responsible for the 1st cycle of
latency, and the hardware responsible for cycles 2,...n. (This preserves the
ability to run the existing object code unchanged.)  However, given the
domain of problems we worry about, this is probably irrelevant, as we'll
probably never build a system whose first-level cache has more than 1 cycle of
latency, given the performance impacts.  Of course, it's hard to build
a machine that has a zero-cycle load latency :-)  I.e., there's a
reason for having exactly 1 cycle of latency, not 2 or 0.

3) A lot of statistics say:
	a) Reorganizers get to fill 70-80% of load delays in 1-cycle
	latency machines.
	b) Roerganizers get to fill (as I recall, it's been awhile)
	30% of delays in 2-cycle latency machines.
	c) After that, it gets bad (10% or less).  [What does this mean?
	ans: if you load something, you expect to use it pretty soon.]

4) All of that assumes a cache-hit (which is, of course, the most
frequent case).  If there is a cache-miss, the main pipeline freezes
while the cache refill is done:
	a) One or more words of data are brought into the D-cache.
	b) The stalled instruction resumes after the fixup.
Thus, the software could care less how far away memory is, whether or
not DRAM refresh is running, etc.

5) There are split I & D-caches, and a 1-cycle latency branch is used,
with software also being required to fill the branch delay slot.
Approximately similar numbers (1 slot rather than 0 or 2) apply as
in the load case.

6) As it happens, when I said the main pipeline freezes, what I glossed over
was that there are numerous independent functional units:
	a) The main integer unit (i.e., that really knows where the PC is),
	including all loads/stores.
	b) Integer multiplier/divider
	c) FP adder
	d) FP multiplier
	e) FP divider
The results of b) - e) are interlocked, i.e., if you attempt to use the
result of b) - e), and it isn't ready, it is interlocked by the hardware,
rather than software.  You can consider this "partial scoreboarding",
although we don't particularly call it that, as the main pipeline does
completely freeze, rather than trying to issue the following instruction.

7) Now, why should it be that we interlock some things, and not others?

a) Note that the 2 things that are NOT interlocked (load and branch):
	Have small (1-cycle delays) that are unlikely to be shortened,
		and are undesirable to lengthen
	Have about a 70% chance of having a 1-cycle delay slot filled
	by software.

b) Note that the things that ARE interlocked:
	Are multiple cycle operations (DP ADD = 2, rest are longer).
	Are operations whose performance might well change in a future
	version (i.e., you might throw more silicon at DP MULT to improve
	it from 5 cycles, etc)
	Are generally difficult for compilers to handle completely, for
	both ofthe above reasons.

SCOREBOARDING

1) If you look at where scoreboarding came from (I remember it first from
CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
you had machines with:
	a) Many independent functional units
	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
	for example, and the FP operations were of course longer).
	c) Multiple memory pipes, with a long latency to memory, and no caches.

2) Note that RISC micros probably have a), but normally, only the FP ops,
and maybe a few others have long multi-cycle operations, and they sure don't
have multiple memory pipes, and they often have caches.

3) Whether or not a RISC machine has no scoreboarding, some, or complete,
it had better be designed with a pipeline reorganizer in mind, because it's
absolutely silly not to be using one.  With one exception (in our case),
a scoreboard can NEVER do any better than a reorganizer.  I'll show the
exception later, but here is an example why I say this.  Consider:
	a = b + c; d = 0;
1	lw	r1,b
2	lw	r2,c
3	add	r3,r2,r1
4	sw	r3,a
5	sw	zero,d

On a machine with a simple load-interlock, and 1-cycle-latency loads,
the machine would stall for 1 cycle before instruction 3.
Suppose you do more complex scoreboarding, i.e., you continue attempting
to issue instructions?  Then, you might do 1, 2, try 3 and save it,
and then either come back and do 3 again (probably) or go on to 4 and
discover that it stalls also.  If one looks at integer code sequences,
one discovers that it is hard to discover many things that don't quickly
depend on something else (barring, of course, Multiflow-style compiler
technology and hardware designs not currently feasible in VLSI micros).

Now, the point is that no simple scoreboarding hardware is going to be able
to figure out that the best sequence is:
	a = b + c; d = 0;
1	lw	r1,b
2	lw	r2,c
5	sw	zero,d
3	add	r3,r2,r1
4	sw	r3,a

which ALWAYS wins (given the memory system described).  Note that
a 2-cycle latency adds 1 cycle more of delay, even with the reorganized
code.  Also, note that if you're willing to add mucho more
hardware, you can lookahead further (really, keeping 1 independent PC or
pipeline for each extra instruction, but compilers can always lookahead
further).  Similar things go on for floating-point, i.e., no matter
how much scoreboarding you do, the compiler MUST do some code-spreading
or you will lose performance.

Note also that streamlined RISC micros usually do 2 register reads and
1 write/cycle, it's hard to get anything else sque

As one last example, one of the MIPers here once wrote a reorganizer for
a machine architecture that was fully-interlocked, and improved the
performance at least 5% on some benchmarks.

SUMMARY OF THIS PART: any pipelined machine ought to be done expecting to have
a good reorganizer, even if fully scoreboarded.

The exception I mentioned:
	function(x)	=>

	lw	r2,x
	jal	function
	nop

We can't move the lw to cover the jal-delay slot, because we're not sure
the function's first instruction doesn't use it.  It turns out that the
number of jal-delay slots left unfilled is fairly low, and the fraction
that might have been filled by a load is even lower, as there is almost
always some argument-gathering that can be moved into the delay slot.
(Most of the jal-delay slots come from code like this:
	if (something)
		func1();
	else
		func2();
of course, one can turn on a linker-optimizer that looks at the targets
of jal's and copies the target into the delay slot if it can.

Anyway, I think the bottom-line is that full scoreboarding:

	Is not very relevant to the integer part of the CPU.

	Might be useful for long multi-cycle operations (such as FP),
	maybe (although Earl Killian has a long analysis that
	argues otherwise that I don't have time to condense & post;
	maybe Earl will),

	Might be helpful for compilers that can't do code scheduling,
	even though it is silly to do high-performance pipelined RISCs
	without having a good reorganizer.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@urbsdc.Urbana.Gould.COM (06/20/88)

>    On RISC processors without a scoreboard*, how are the results of memory
>references guaranteed to be available before they are used in subsequent
>computations?  I thought through several scenarios and couldn't find a very
>good solution and being being unfamiliar with RISCs other than the 88k,
>thought the net might have some answers.
>...
>2)   If the compiler/assembly writer is required to wait a certain number
>of ticks after a load before using the results of the load, how is the
>required amount of delay determined?   Can you assume you always get a
>cache hit (I doubt it)?  

Usually by assuming a cache hit; the entire processor stalls on a cache miss.
Details of the stall may vary.

kds@blabla.intel.com (Ken Shoemaker) (06/21/88)

But one very important constraint that John imposed was that the processor/
data memory system allow only one external data access at a time.  This
just isn't a requirement with the current level of integration possible on
chips: having multiple pending external cycles is something that is
certainly possible.  I don't know if the 88000 supports this, but judging
from some of the microarchitectural tradeoffs that they performed, it would
seem that they had something along these lines in mind, maybe in a future
chip?

Another thing that register scoreboarding can be used for is to prefetch
data into caches: if you know that you are going to use a particular vector
at some time in the future, you can preload it into the cache so that it
is available when you get back to it.  This certainly is a more elegant
solution than a "do the load, but don't wait for the data to ever come back"
method of doing the same thing.  This could be useful when working with
very large data sets, i.e., those that don't fit into your cache in a
typical system where the main memory latency is long, but transfer time is
short.  But what do I know?  And why am I defending Mot?

You don't have to break many eggs to hate omlets -- Ian Shoales

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds

mce@tc.fluke.COM (Brian McElhinney) (06/21/88)

In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>Anyway, I think the bottom-line is that full scoreboarding:
>
>	Is not very relevant to the integer part of the CPU.
>
>	Might be useful for long multi-cycle operations (such as FP),
>	maybe
>
>	Might be helpful for compilers that can't do code scheduling,
>	even though it is silly to do high-performance pipelined RISCs
>	without having a good reorganizer.

Register scoreboarding is also useful to software types and other end users of
this complicated compiler technology.  No matter how mature the compiler is, it
is still a large software program and will have bugs.  It may not be worth the
hardware trade offs, but the software costs should also be considered.
Anything that makes the software "safer" seems like a win to me, but I'll admit
I'm biased!

Brian McElhinney
Software Engineer
mce@tc.fluke.com

tom@nud.UUCP (Tom Armistead) (06/22/88)

In article <22147@amdcad.AMD.COM> tim@delirun.UUCP (Tim Olson) writes:
>The scoreboarding on the 88k (correct me if I'm wrong) associates a bit
>for each register in the register file to tell whether the value is
>stale or not.  During the issuing of an instruction, the operand
>registers are checked for staleness, and the pipe is stalled if
>required.

    This sums it up very well.

[ Concerning different method of handling stale registers ]
>value to be written into the register file.  This method has the benefit
>of integrating closely with the forwarding logic.

    The scoreboard method can allow register forwarding also.
-- 
Just a few more bits in the stream.

The Sneek

tom@nud.UUCP (Tom Armistead) (06/22/88)

In article <28200166@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>>    On RISC processors without a scoreboard*, how are the results of memory
>>references guaranteed to be available before they are used in subsequent
>>2)   If the compiler/assembly writer is required to wait a certain number
>>of ticks after a load before using the results of the load, how is the
>>required amount of delay determined?   Can you assume you always get a
>
>Usually by assuming a cache hit; the entire processor stalls on a cache miss.
>Details of the stall may vary.


     This makes sense for RISC processors with a common data and code bus.
After all, if the data cache misses, the bus is being hogged by the data 
unit and the instruction unit then has to stall until the data transaction
finishes (although a few instructions might have been prefetched).  I had
gotten used to the split instruction and data buses on the 88k which
yields different behaviour in the above case.  On the 88k, a data cache 
miss doesn't neccesarily create an instruction pipe stall.  I think this
is better from a performance standpoint but it does require some type of 
scoreboarding to handle the data dependencies properly.
-- 
Just a few more bits in the stream.

The Sneek

tom@nud.UUCP (Tom Armistead) (06/22/88)

In article <2417@mipos3.intel.com> kds@mipos2.intel.com. (Ken Shoemaker) writes:
>But one very important constraint that John imposed was that the processor/
>data memory system allow only one external data access at a time.  This
>just isn't a requirement with the current level of integration possible on
>chips: having multiple pending external cycles is something that is
>certainly possible.  I don't know if the 88000 supports this, but judging

    It does, up to 3 memory transactions (4 in a special case) can be pending
simultaneously.  2 of the transactions can be in progress at the same time
(although at different stages of completion) on the external bus (the PBUS).
For example, the following sequence can be done

	ld	r1,r2,0
	st	r3,r4,0
	ld	r5,r6,0

and all 3 transactions will be then be simultaneously in progress by
the memory unit.  The PBUS timing would look like (assuming cache hits):

Clock		PBUS address	PBUS data	PBUS R/W	PBUS response

1 rise		r2		-		R		-
1 fall		r2		-		R		-
2 rise		r4		data at *r2	W		-
2 fall		r4		data in r3	W		SUCCESS to ld
3 rise		r6		data in r3	R		-
3 fall		r6		-		R		SUCCESS to st
4 rise		-		data at *r6	-		-
4 fall		-		-		-		SUCCESS to ld

    Briefly put, the address lines on the current clock tick correspond
to the next ticks data lines and the current tick data lines corresponds
to the previous tick address lines so 2 transactions are in progress at the
same time.  This allows a memory access to be completed each clock tick on
a sustained basis.
-- 
Just a few more bits in the stream.

The Sneek

tom@nud.UUCP (Tom Armistead) (06/23/88)

In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>Following are:
>1) How the MIPS R2000/R3000 does it.
>result of b) - e), and it isn't ready, it is interlocked by the hardware,
>rather than software.  You can consider this "partial scoreboarding",
>although we don't particularly call it that, as the main pipeline does
>completely freeze, rather than trying to issue the following instruction.

    This is similar (sort of) to what the 88k does although we do call it
scoreboarding since we interlock on a per register stale flag rather than a 
per functional unit done flag.  We use the scoreboard on memory loads as well
the FP unit.

>a) Note that the 2 things that are NOT interlocked (load and branch):
>	Have small (1-cycle delays) that are unlikely to be shortened,
>		and are undesirable to lengthen
>	Have about a 70% chance of having a 1-cycle delay slot filled
>	by software.

   But why not interlock on loads as well?  It removes the *requirement* that
a load be followed by nop if something useful can't be found to put there
and yet still allows the compiler/coder to put a useful instruction in
the delay slot if one exists.  This provides the following benefits.

1) More compact code since the loads that don't have a useful delay slot
instruction don't have to have a nop.  Of course, this is a less common
case (only 30% of loads) than those that do so the difference will likely
be slight but nonetheless, is a slight improvement over not interlocking.

2) As one other poster pointed out, it makes the coding easier and more
bug proof since the compiler or coder doesn't have to remember to insert
a nop on the loads with no useful delay slot instruction.  On the 88k,
there is no way a user program can access stale registers
regardless of how dumb/bug ridden the compiler or writer is.

   I see the above 2 items as improvements created by interlocking on
load results and since the interlock circuitry is already partially in
place (for the FP instructions) I don't see any additional disadvantages so
why not interlock on loads?

>SCOREBOARDING
>1) If you look at where scoreboarding came from (I remember it first from
>CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
>you had machines with:
>	a) Many independent functional units
>	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
>	for example, and the FP operations were of course longer).
>	c) Multiple memory pipes, with a long latency to memory, and no caches.
>2) Note that RISC micros probably have a), but normally, only the FP ops,
>and maybe a few others have long multi-cycle operations, and they sure don't
>have multiple memory pipes, and they often have caches.
      ^^^^^^^^^^^^^^^^^^^^^
    The 88k has this as well as a & b (on board FP unit).

>On a machine with a simple load-interlock, and 1-cycle-latency loads,
>the machine would stall for 1 cycle before instruction 3.
>Suppose you do more complex scoreboarding, i.e., you continue attempting
>to issue instructions?  Then, you might do 1, 2, try 3 and save it,
>and then either come back and do 3 again (probably) or go on to 4 and
>discover that it stalls also.  If one looks at integer code sequences,
>one discovers that it is hard to discover many things that don't quickly
>depend on something else (barring, of course, Multiflow-style compiler
>technology and hardware designs not currently feasible in VLSI micros).

    The 88k scoreboarding logic doesn't perform this complex of a task.  It
merely monitors the instruction stream and stalls the instruction unit if
an attempt is made to execute an instruction which uses stale
registers and then lets the instruction unit continue once the correct 
operands are available.  For example, in the following code sequence
(assume the ld is a cache miss):

1)	ld	r2,r3,0		; Get value.
2)	add	r3,r3,16	; Bump pointer
3)	add	r2,r2,1		; Increment value.
4)	sub	r4,r4,1		; Dec count.

the instruction unit will stall on instruction 3 since it attempts to
use stale data.  Even though instruction 4 theoretically could be 
executed (as it isn't dependent on the ld results), it won't be started
until the ld is complete, and instruction 3 is completed.

  Efficient code on the 88k is just as dependent on a good compiler
(or coder) as any other RISC micro.  Scoreboarding on the 88k simply
ensures you won't use stale data inadvertantly - it's up to the
compiler to produce efficient code that minimizes stalls.

>	Might be useful for long multi-cycle operations (such as FP),
>	maybe (although Earl Killian has a long analysis that
>	argues otherwise that I don't have time to condense & post;
>	maybe Earl will),

   Consider the case where multiple FP operations can be in progress at 
the same time (as on the 88k).  How could you handle this situation without
a per register good/stale indicator (read scoreboard)?  Without a 
scoreboard, the processor would seem to be required to process memory
accesses and FP operations sequentially.  This might be acceptable but
it probably will impact performance adversely in FP or memory intensive
applications.

    So it seems to me that scoreboarding is not required as long as the
multi-cycle functional units are single threaded.  Here, the per functional
unit done flag seems like it would work ok.  However, to obtain the
performance boost from pipelined functional units (like on the 88k), 
register scoreboarding is a requirement and not an option.
-- 
Just a few more bits in the stream.

The Sneek

tim@amdcad.AMD.COM (Tim Olson) (06/24/88)

In article <1095@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
| [ Concerning different method of handling stale registers ]
| >value to be written into the register file.  This method has the benefit
| >of integrating closely with the forwarding logic.
| 
|     The scoreboard method can allow register forwarding also.
| -- 
| Just a few more bits in the stream.
| 
| The Sneek

Can you elaborate a bit?  Forwarding requires that you also know which
functional unit(s) to forward from -- don't you need other hardware than
the scoreboard bits for this?

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

-- 
	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

tom@nud.UUCP (Tom Armistead) (06/25/88)

In article <22196@amdcad.AMD.COM> tim@delirun.amd.com (Tim Olson) writes:
>|     The scoreboard method can allow register forwarding also.

>Can you elaborate a bit?  Forwarding requires that you also know which
>functional unit(s) to forward from -- don't you need other hardware than
>the scoreboard bits for this?
>	-- Tim Olson

    Yes other hardware is required.  My posting was intended to elaborate that
scoreboarding doesn't *preclude* register forwarding - not that scoreboarding
itself is sufficient to implement the feature.

   Why do you say that forwarding requires knowledge of which functional 
unit will be supplying the fed forward result?  The instruction unit knows
which register(s) it is waiting on and can enable feed forward circuitry
so that when the result(s) is available, it is fed to the instruction unit
simultaneously with the result is being written to the register file.  With
this method, the feed forward logic can be independent of the functional units.
The instruction unit doesn't know, and doesn't care, what is supplying the
result but just knows that it needs that particular result for the next
instruction and snags it.

   Forwarding could also be implemented by connecting the input of the
instruction unit to the output of the particular functional unit that
will provide the result but this would require a seperate feed forward
path for each functional unit.  This would also work but as you say,
requires knowledge of which functional unit the result(s) is expected
from. 
-- 
Just a few more bits in the stream.

The Sneek

earl@mips.COM (Earl Killian) (06/25/88)

In article <1098@nud.UUCP>, tom@nud.UUCP (Tom Armistead) writes:
> >SCOREBOARDING
> >1) If you look at where scoreboarding came from (I remember it
> >first from CDC 6600 (see Bell & Newell book, for example), but
> >there may be earlier ones), you had machines with:
> >	a) Many independent functional units
> >	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
> >	for example, and the FP operations were of course longer).
> >	c) Multiple memory pipes, with a long latency to memory, and no caches.
> >2) Note that RISC micros probably have a), but normally, only the FP ops,
> >and maybe a few others have long multi-cycle operations, and they sure don't
> >have multiple memory pipes, and they often have caches.
>       ^^^^^^^^^^^^^^^^^^^^^
>     The 88k has this as well as a & b (on board FP unit).

John meant the ability to do multiple data references per cycle, not
the ability to do an instruction fetch and data fetch per cycle (which
any well designed RISC supports).  So I don't think the 88k qualifies
here.

>    Consider the case where multiple FP operations can be in progress at 
> the same time (as on the 88k).  How could you handle this situation without
> a per register good/stale indicator (read scoreboard)?  Without a 
> scoreboard, the processor would seem to be required to process memory
> accesses and FP operations sequentially.  This might be acceptable but
> it probably will impact performance adversely in FP or memory intensive
> applications.

You don't need a scoreboard to do what you say, but when the number of
pending results is large, it is the most appropriate technique.  The
alternative, providing a register field comparator for each pending
result, is appropriate when the number of pending results is small.

But my question in all this is why did the 88000 choose to fully
pipeline floating point and thus allow such a large number of pending
results?  I understand why the 6600 and its successors did it, but the
same analysis for the 88000 suggests it is unnecessary.  You don't
have the cache bandwidth to make fp pipelining useful even on large
highly vectorizable problems (i.e. 32b per cycle isn't enough).  You
can't feed the fp pipeline fast enough.

For example, for linpack on a load/store machine with 32b/cycle you
need only start a new add every 8 cycles and a new multiply every 8
cycles to run at memory speed.  With fp latencies < 8 cycles, no
pipelining is necessary in the individual fp units.  Go through the 24
livermore loops and count the number of words loaded/stored per fp
op and you'll see similar results.

The price you paid for pipelining appears to be enormous: the 88100's
fp op latencies average 2.5x longer than the R3010's when measured in
cycles; 3x longer when measured in ns.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

mash@mips.COM (John Mashey) (06/26/88)

In article <1098@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>   But why not interlock on loads as well?....

>1) More compact code...
This is a legitimate issue, although it's about a 5% effect.
We considered doing load-interlocks (and can do so in the future in
an upward -compatible way), but didn't for philosophical reasons,
i.e., we had a rule that we wouldn't put features in that we'd have to
live with forever if we couldn't prove they were worthwhile for
performance (1%); there was some concern that for some design directions,
there might be a cycle-time hit if this were in the critical path somewhere.
Anyway, it is a legitimate topic for debate.

>2) As one other poster pointed out, it makes the coding easier and more
>bug proof since the compiler or coder doesn't have to remember to insert
>a nop on the loads with no useful delay slot instruction.  On the 88k,
>there is no way for a user program can access stale registers
>regardless of how dumb/bug ridden the compiler or writer is.

This one, however, is completely a non-issue, although, to be fair,
it's frequently asked by people running into the R2000 for the first time:
	a) It is trivial to make the assembler insert load-nops where
	needed.  I doubt that our code took more than 10 minutes to write
	and debug, and I don't ever remember having problems with that
	in almost 4 years.
	b) Any software system that couldn't be trusted with this problem,
	shouldn't be trusted with ANY problem: really, there are zillions
	of harder problems that need to be solved.  (well, maybe not zillions,
	but many :-)
	c) Figuring that a RISC compiler should do optimization, but worrying
	that this feature might be buggy, is like worrying about the safety
	of flying in a 747 and bringing your own seat-belt because you
	don't really trust Boeing to remember to include them :-)

....
>    The 88k scoreboarding logic doesn't perform this complex of a task.  It
>merely monitors the instruction stream and stalls the instruction unit if
>an attempt is made to execute an instruction which uses stale
>registers and then lets the instruction unit continue once the correct 
>operands are available.  For example, in the following code sequence
>(assume the ld is a cache miss):
>
>1)	ld	r2,r3,0		; Get value.
>2)	add	r3,r3,16	; Bump pointer
>3)	add	r2,r2,1		; Increment value.
>4)	sub	r4,r4,1		; Dec count.
>
>the instruction unit will stall on instruction 3 since it attempts to
>use stale data.  Even though instruction 4 theoretically could be 
>executed (as it isn't dependent on the ld results), it won't be started
>until the ld is complete, and instruction 3 is completed.

Thanx: we weren't sure whether it had multiple streams or not.
The example seems to indicate that the 88K indeed has a load with
2 cycles of latency (i.e., cycles 2 & 3 above).  From the example in
<1097@nud.UUCP> that gave cycles for the ld/st/ld code, one would have
thought there was only 1 latency cycle.  Can you say:
	a) Are there indeed 2 latency cycles (i.e., that instruction 3
	will indeed stall above)?
	b) If so, what is the reason for the second latency slot?
	(I realize that you may not want to answer this one :-)
Note that our numbers say that in our machines, it would cost us
10-15% in overall performance to go from 1 cycle latency to 2,
and the similarity of machines probably means about the same amount
for an 88K.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ron@mucmot.UUCP (Ron Voss) (06/28/88)

In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> 	c) Figuring that a RISC compiler should do optimization, but worrying
> 	that this feature might be buggy, is like worrying about the safety
> 	of flying in a 747 and bringing your own seat-belt because you
> 	don't really trust Boeing to remember to include them :-)

I have observed a few cases where hours were wasted searching for a
programming bug, only to find that the CISC (68020) compiler's optimizer
is buggy (so one turns on the optimizer only for a release, since it
slows compilations anyway, thoughouly test again after turning on, etc.).

Are there really *good* reasons to put more trust in RISC optimizers?

Methinks the metaphor a bit overdone.
-- 
------------------------------------------------------------
Ron Voss                Motorola Microsystems Europe, Munich
mcvax!unido!mucmot!ron                         CIS 73647,752
my opinions are just that, and not necessarily my employer's
------------------------------------------------------------

colwell@mfci.UUCP (Bob Colwell) (06/29/88)

In article <318@mucmot.UUCP> ron@mucmot.UUCP (Ron Voss) writes:
>In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
>> 	c) Figuring that a RISC compiler should do optimization, but worrying
>> 	that this feature might be buggy, is like worrying about the safety
>> 	of flying in a 747 and bringing your own seat-belt because you
>> 	don't really trust Boeing to remember to include them :-)
>
>I have observed a few cases where hours were wasted searching for a
>programming bug, only to find that the CISC (68020) compiler's optimizer
>is buggy (so one turns on the optimizer only for a release, since it
>slows compilations anyway, thoughouly test again after turning on, etc.).
>
>Are there really *good* reasons to put more trust in RISC optimizers?

Ron, I couldn't let that one slide on by.  I think it's a truism that
good designers produce good designs, including compilers/optimizers,
and not-so-good designers produce everything else.  My experience has
been that the quality of a piece of software is a far stronger
function of the abilities of its creators than of its intrinsic
complexity.

One could perhaps make the case the a RISC optimizer is likely to be
intrinsically less complex than its CISC counterpart due to the
more-easily-understood machine spec, but it's probably impossible to
prove, and I don't think it's as strong an argument as the above.

I also have the feeling that more talented designers are turning
their attentions to compilers than ever before, so perhaps the
overall quality of this kind of software is improving, but I won't
even try to defend that thesis if you don't like it.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

phil@osiris.UUCP (Philip Kos) (06/29/88)

In article <318@mucmot.UUCP>, ron@mucmot.UUCP (Ron Voss) writes:
> Are there really *good* reasons to put more trust in RISC optimizers?

I have to agree.  I once saw an old version of the Pyramid cc optimizer
totally trash a for(i = 0; i < 7; i++) loop with a printf() in the middle.
I guess that bit of code didn't really do anything.  :-)  (Talk about FUN
to debug...)  And we've been told to avoid the global optimizer like the
plague when any of the code to be tuned is asynchronous, like the signal
handlers we have in 99.99% of our applications...

                                                                 Phil Kos
                                                      Information Systems
...!uunet!pyrdc!osiris!phil                    The Johns Hopkins Hospital
                                                            Baltimore, MD

tom@nud.UUCP (Tom Armistead) (06/30/88)

In article <2459@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>But my question in all this is why did the 88000 choose to fully
>pipeline floating point and thus allow such a large number of pending
>results?  I understand why the 6600 and its successors did it, but the
>same analysis for the 88000 suggests it is unnecessary.  You don't
>have the cache bandwidth to make fp pipelining useful even on large
>highly vectorizable problems (i.e. 32b per cycle isn't enough).  You
>can't feed the fp pipeline fast enough.

    Assuming the FP operands are not in cache, this is true.   However,
there will be some class of problems which can make effective use of
the FP pipelining and assuming that FP pipelining has no bad side effects
(see paragraph below),  it only makes sense to provide the feature.

>The price you paid for pipelining appears to be enormous: the 88100's

    I didn't design the 88100 (I am not a chip designer at all).  However,
I doubt that any speed difference is due to a tradeoff made to provide FP
pipelining.  Whether the FP unit was pipelined or not, I think the FP
latencies would still be the same.  Do you have some reason to think that
not pipelining can decrease FP latency?  If no increase in latency is
incurred to provide pipelining, then providing it can only help
performance.

>fp op latencies average 2.5x longer than the R3010's when measured in
>cycles; 3x longer when measured in ns.

    Would you post the R3010 FP stats?  Not that I doubt you, I'm just
interested in how each particular R3010 FP instruction compares to the
equivalent 88K instruction.
-- 
Just a few more bits in the stream.

The Sneek

tom@nud.UUCP (Tom Armistead) (06/30/88)

In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:

>	a) Are there indeed 2 latency cycles (i.e., that instruction 3

    Yes, 2 on loads. 0 latency on stores.

>	b) If so, what is the reason for the second latency slot?

    Address calculation.  The 88k provides a basic set of three addressing
modes.  The most frequently used construct for referencing memory is 
"register pointer + offset" (offset is rarely 0).  Without providing this basic
addressing mode, the code would have to do something like:

	add	ptr,ptr,offset		; 
	ld	dest,ptr,0
	sub	ptr,ptr,offset		; If pointer must be preserved.

for most loads and stores.  This overhead is worse than a 2 tick latency
on loads.  Also note that there is 0 latency on 88k store instructions*.
Since many lds have a corresponding st instruction somewhere, the averaged 88k
latency will be < 2 ticks.  

* What is the latency on a R3000 store instruction?

>Note that our numbers say that in our machines, it would cost us
>10-15% in overall performance to go from 1 cycle latency to 2,

    Assuming the addressing modes and other aspects of the machine remain
the same, this figure is in the ballpark (although a little high). 
However, I think the static machine assumption is not a valid one to make.
-- 
Just a few more bits in the stream.

The Sneek

guy@gorodish.Sun.COM (Guy Harris) (06/30/88)

> And we've been told to avoid the global optimizer like the plague when any
> of the code to be tuned is asynchronous, like the signal handlers we have
> in 99.99% of our applications...

Umm, this has nothing whatsoever to do with RISC vs. CISC nor optimizer bugs;
the problem here is that signal handlers (and other forms of asynchrony)
violate assumptions that optimizers tend to make.  For example, in the
following case:

	extern int j;
	int i, l, a, b, c;

	...

	i = j;
	k = a + b + c;
	l = j;

if the value assigned to "i" happens to be lying around in a register (as a
side effect of the code that performs that assignment) when the value of "j" is
assigned to "l", you can just store that register into "l", right?  Wrong.  A
signal could arrive between the point at which the value was loaded into the
register and the point at which the value was stored into "l", and the signal
handler can change the value of "j" in that interval.  The same can happen with
values stored in registers on I/O devices, or variables shared between
processes.

Unfortunately, optimizers tend not to know that certain external variables are
"safe", and won't change asynchronously, while others are "unsafe" and may
change asynchronously.  Some optimizers may simply assume that all external
variables are "safe", in which case you'd better disable the optimizer when
writing code where not all external variables are "safe" (or, at least, cut the
optimization down to a level where the optimizer no longer makes this
assumption).

To see a painfully elaborate discussion of ways to handle this problem, just
search for the word "volatile" in all articles in "comp.lang.c" on your machine
and print all the articles in question.  :-)

mash@mips.COM (John Mashey) (06/30/88)

In article <318@mucmot.UUCP> ron@mucmot.UUCP (Ron Voss) writes:
>In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
>> 	c) Figuring that a RISC compiler should do optimization, but worrying
>> 	that this feature might be buggy, is like worrying about the safety
>> 	of flying in a 747 and bringing your own seat-belt because you
>> 	don't really trust Boeing to remember to include them :-)

>I have observed a few cases where hours were wasted searching for a
>programming bug, only to find that the CISC (68020) compiler's optimizer
>is buggy (so one turns on the optimizer only for a release, since it
>slows compilations anyway, thoughouly test again after turning on, etc.).

>Are there really *good* reasons to put more trust in RISC optimizers?

>Methinks the metaphor a bit overdone.

Sorry, let me try again.  I wasn't trying to say optimizers were easy
or safe at all: they're not, in fact, they are like building a 747.
The point was, in fact, that if you're doing an optimizer (building
a 747), THAT's what you worry about, because it's hard.  You don't
worry about getting nops after loads ("this feature") (seatbelts),
because if a somebody can't get THAT right, they have no business
building optimizers (747s).  [No disrespect to Boeing, I just picked the
biggest plane I've flown in.]

1) I have marginally more trust in RISC optimizers than in CISC ones,
on general principles [i.e., at least some parts are simpler.]

2) What I mostly trust/distrust is the optimizer's author(s);
I'd much rather have an expert's CISC optimizer than a turkey's RISC one.
(Fortunately, I get to use a super-expert's RISC optimizer. :-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@urbsdc.UUCP (07/01/88)

>              ... we've been told to avoid the global optimizer like the
>plague when any of the code to be tuned is asynchronous, like the signal
>handlers we have in 99.99% of our applications...
>
>                                                                 Phil Kos
>                                                      Information Systems
>...!uunet!pyrdc!osiris!phil                    The Johns Hopkins Hospital
>                                                            Baltimore, MD

I'm interested... at first I thought you said that 99.99% of your applications
were in the signal handlers, but that was wrong. Still, what do you do in
the signal handlers - set flags that other steps can use? Sounds like you
need volatile.

smryan@garth.UUCP (07/01/88)

In article <1639@osiris.UUCP> phil@osiris.UUCP (Philip Kos) writes:
>In article <318@mucmot.UUCP>, ron@mucmot.UUCP (Ron Voss) writes:
>> Are there really *good* reasons to put more trust in RISC optimizers?

Hey! Why not ask somebody who has to write the compiler? Don't you trust us?

No? Smart idea. Seriously, the reason given for delayed branches is that
(1) the compiler can do it, (2) nobody uses assembly anyway.

I would like to think that (1) is true, but compilers have just as many
problems as anybody else. (2) I have not yet heard of any machine in general
production without an assembler. Even if that were the case, because (1) is
false, somebody has to debug the compiler. It is hard enough to backtranslate
assembly through the compiler to the source language on a conveniential
machine.

Given the current ratio of software cost to hardware cost, I feel the more
extreme proposals of RISKs are misguided.

earl@mips.COM (Earl Killian) (07/01/88)

In article <1110@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:

TA#     I didn't design the 88100 (I am not a chip designer at all).
TA# However, I doubt that any speed difference is due to a tradeoff made
TA# to provide FP pipelining.  Whether the FP unit was pipelined or not, I
TA# think the FP latencies would still be the same.  Do you have some reason
TA# to think that not pipelining can decrease FP latency?  If no increase
TA# in latency is incurred to provide pipelining, then providing it can only
TA# help performance.

.........................................................................
The answer to this is, yes, I believe the lack of pipelining does
partially explain why the R3010 has smaller latency than the pipelined
88k.  Do you have an alternate suggestion?

For example, non-pipelined functional units can reuse hardware
resources.  If it takes 2 shifts to do an fp add, a non-pipelined adder
can have just one shifter but a pipelined adder needs 2 shifters.
This is especially nasty for things like the divider which reuse some
hardware entities (e.g. quotient digit lookup table) many, many times.
An N-stage-pipelined divider would have needed N of these lookup tables.

By avoiding the need for extra hardware, the non-pipelined implementation
frees up valuable chip area.  In the R3010 we used that area for
(i) 64-bits x 16-words x 4-ports register file; (ii) separate divider
independent of multiplier; (iii) delay-optimized physical layout and hand
tweaked circuit design employing LARGE transistors; (iv) control unit's
resource scheduler that permits 4 instructions to execute concurrently.
.........................................................................


TA#     Would you post the R3010 FP stats?  Not that I doubt you, I'm just
TA# interested in how each particular R3010 FP instruction compares to the
TA# equivalent 88K instruction.

.........................................................................
		88100	R3010	cycle
Operation	cycles  cycles	ratio
======================================
sp add		  5	  2	  2.5
dp add		  6	  2	  3.0
sp mul		  5	  4	  1.2
dp mul		 10	  5	  2.0
sp div		 30	 12	  2.5
dp div		 60	 19	  3.2
sp convert	  5	1-3	1.7-5
dp convert	  6	1-3	  2-6
abs/neg/mov	  5?	  1
.........................................................................
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

earl@mips.COM (Earl Killian) (07/01/88)

In article <1110@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:

eak> You don't have the cache bandwidth to make fp pipelining useful
eak> even on large highly vectorizable problems (i.e. 32b per cycle
eak> isn't enough).  You can't feed the fp pipeline fast enough.

TA#    Assuming the FP operands are not in cache, this is true.   However,
TA# there will be some class of problems which can make effective use of
TA# the FP pipelining and assuming that FP pipelining has no bad side effects
TA# (see paragraph below),  it only makes sense to provide the feature.
.........................................................................
I *was* talking about when the operands are in cache.  Let me repeat the
example in my original posting:

    For example, for linpack on a load/store machine with 32b/cycle you
    need only start a new add every 8 cycles and a new multiply every 8
    cycles to run at cache speed.  With fp latencies < 8 cycles, no
    pipelining is necessary in the individual fp units.  Go through the 24
    livermore loops and count the number of words loaded/stored per fp
    op and you'll see similar results.

What's going on here?  The DAXPY inner loop of linpack
	DY(I) = DY(I) + DA*DX(I)
has a fp add, fp multiply, two loads, and one store (ignoring loop
overhead, which can be effectively eliminated by unrolling).  With a
32b interface to the data cache, the two loads and one store take 6
cycles.  The two fp ops take 2 cycles to issue.  Total 8 cycles for
2 flops.  So if your add latency and multiply latency are <= 8 cycles
you can execute DAXPY without any pipelining, running at 75% of your
cache bandwidth.

Let's formalize this:

Let m be the ratio of memory reference cycles to floating point ops
for a computation.  Let f1, f2, ... fn be the frequencies of the
different floating point ops (they sum to 1).  Let t1, t2, ... tn be
the latencies of the different floating point ops.  Let r1, r2, ... rn
be the pipeline rates of the different op units.  Assume the
computation is infinitely parallelizable (the most favorable
circumstance for pipelining).

Consider a load/store architecture with one instruction issue per
cycle.  This means the time to do one flop is bounded below by m+1
cycles.  A given functional unit will require pipelining (ri < ti) to
run at this rate iff ti*fi > m+1.  The pipelining required is a new
op every ri=(m+1)/fi cycles.  Alternatively, the latency required for
no pipelining is ti=(m+1)/fi.

Example: linpack with one 32-bit access per cycle (e.g. r3000 or
88000): n=2, f1=0.5, f2=0.5, m=3.  Thus r1=r2=8.

Example: linpack with one 64-bit access per cycle, or half-precision
linpack with one 32-bit access per cycle: m=1.5.  So 5-cycle adds and
multiplies are sufficient (R3010 latencies work fine).  Or pipelining
to start new ops every 5.

So what you need to do to refute my claim is to show important
problems with low m values and with fi almost one.

One such computation is something like livermore loops 11 and 12 in
single precision, for which n=1, f1=1, m=2, so you need a new add
every 3 cycles.  But of course the R3010's 2 cycle add handles that
just fine without pipelining.

Another such comutation is something like the multiple-vector
techniques for LU-decomposition.  E.g. for 8-way vectors, m=0.625,
f1=0.5, f2=0.5, so you need 3.25 cycle add and multiply to run at peak
rate for sp, 4.5 cycle add and multiply for dp.

Still, I would not consider these to prove the need for pipelining
with 32b cache interfaces.  Do you know a computation which does?
What are its parameters in the above terms?
.........................................................................
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

mash@mips.UUCP (07/01/88)

In article <1111@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>>	a) Are there indeed 2 latency cycles (i.e., that instruction 3

>    Yes, 2 on loads. 0 latency on stores.

>>	b) If so, what is the reason for the second latency slot?

>    Address calculation.  The 88k provides a basic set of three addressing
>modes.  The most frequently used construct for referencing memory is 
>"register pointer + offset" (offset is rarely 0).  Without providing this basic
>addressing mode, the code would have to do something like:

>	add	ptr,ptr,offset		; 
>	ld	dest,ptr,0
>	sub	ptr,ptr,offset		; If pointer must be preserved.

>for most loads and stores.  This overhead is worse than a 2 tick latency
>on loads.  Also note that there is 0 latency on 88k store instructions*.
>Since many lds have a corresponding st instruction somewhere, the averaged 88k
>latency will be < 2 ticks.  

We agree on the desirability of non-zero offsets 100% (although our friends
down the street at AMD may not :-)  For this kind of design, I can't imagine
that people would do add/ld/sub, but would use a temp register:
	add	temp,ptr,offset		; 
>	ld	dest,temp,0

Anyway, maybe I'm missing something: I still don't understand where the
second latency cycle comes from in the hardware design.  You compute
the address and get it on the bus, and the third instruction following
can actually use the data.  Moto info says that the CMMU returns data in one
cycle, so it almost sounds like it's costing a cycle on the CPU to
make the data available (i.e., forwarding logic is not fast enough, or
something).  Anyway, I still don't understand what address-mode computation
has to do with it, at least not from the given example [which is identical
to what R2000s do].  So please say some more.

>* What is the latency on a R3000 store instruction?

0.

>>Note that our numbers say that in our machines, it would cost us
>>10-15% in overall performance to go from 1 cycle latency to 2,

>    Assuming the addressing modes and other aspects of the machine remain
>the same, this figure is in the ballpark (although a little high). 
>However, I think the static machine assumption is not a valid one to make.

Can you say what your numbers are, i.e., why is this high?
My reasoning is as follows:
	a) Loads are about 20% of the instructions.
	b) You get to schedule approx. 70% of the 1st delay slots.
	c) You get to schedule approx 30% of the 2nd delay slots, i.e.,
	the penalty for a second delay slot is the 70% that you don't
	get to schedule: .7 * 20% = 14%

Now, in this area, R2000s and 88Ks are similar enough that I think the
statistics carry over, but counterarguments would be useful.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mpaton@oakhill.UUCP (Michael Paton) (07/02/88)

In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:

>>	b) If so, what is the reason for the second latency slot?

The reason for the additional delay slot is twofold.  First, the M88100
has allotted .2 clocks to the data path, .7 clocks to a 32 bit adder,
and more than .25 of a clock for driving a signal off chip with .1
clock setup margins.  Since we desired all signals to be synchronous,
we rounded this 1.25 clocks up to 1.5 clocks.  These 1.5 clocks are
spent performing the addition and delivering the address to the M88200
cache.  The cache has 1 clock from valid address to valid data (1.2
valid address to hit).  Then the processor uses the remaining .5 clock
to perform byte extraction, sign extension, delivering the result to
the result bus.  Therefore the memory pipeline is 3 clocks in length.

The MIPS Co. designers sliced this into 2 clocks and integrated it into
their basic (only) pipeline.  This means that when the data cache
misses or is not available (e.g. bus snooping) that their entire
processor skips a beat.  Since the M88100 data pipeline is decoupled
from the instruction pipeline, the beat is not skipped unless a data
dependency occurs.  This is more beneficial in the case of outbound
store traffic on write-through pages than on the nominal load traffic.

In two clocks they must perform a 32+16 addition and the mmu function
before driving the pins.  Moreover, the multiplexed bus requires that
the addresses be latched (externally on both the R2000 and the R3000).
Memory must still be accessed and delivered to the result bus/register
file/forwarding unit.  Therefore, the time must come from somewhere,
and it seems to have come out of the memory SRAM speed in the R3000 and
the circuit speeds and margins.  According to the data sheet (using a
F373 transparent latch), the MIPS design requires 20ns SRAMs.  Our
design (assuming a glue part cache implementation: F373 latch and 3ns
t-sub-OHA* SRAM with spec sheet timing) allows 32ns SRAM parts at 40ns cycle
or 42ns SRAM at 50ns.  The MIPS design requires 0ns SRAM at 40 MHz,
while we do not hit this brick wall until 120 MHz.

Since the two SRAM arrays' data pins are connected together and to
three other interfaces, they must be sampled at 2/3 of the phase period
by all interested parties and then rapidly removed from the multiplexed
data bus (the data is only valid for 3.5ns on the data bus). This
requires very strict timing control over this vital feature.  In
addition, the Cypress spec's I checked had t-sub-HZOE** of 15ns maximum
while the R3000 requires this in 8ns.  Note that this is not a problem
as the other bank of SRAMs has a minimum output enable time of 9ns
while the Cypress parts (CY7C161 16K*4) did not.

The MIPS processors do not snoop their bus and therefore leave memory
coherence to the write-through mechanism.  In multiprocessing
applications, the memory bus can become saturated with a few processors
on the bus (~4?).  Write-back caches cause a sufficient reduction in
memory bus traffic to allow twice the number of processing ensembles to
utilize the bus.

Therefore, the answer to the original question is: we were more
conservative in allotting nanoseconds to functions, and, in particular,
we attempted to beat on the SRAM technology less hard.  If we are
correct, this should be more scalable in the future (read ECL/GaAs) as
off-chip delays approach .4 cycle.  Alternatively, our design costs
less to manufacture in high volume and allow less costly SRAM parts
than the MIPS Co. design.

This should allow us to push the clock frequency farther than MIPS Co.,
but not at the current mask set.  Remember the recent announcement of
33 MHz M68020 and M68030: a little shrink and a lot of tuning can
obtain many MHz.  We have just begun to tune these M88k parts.

>>Note that our numbers say that in our machines, it would cost us
>>10-15% in overall performance to go from 1 cycle latency to 2,

Our numbers show that with the decoupling of the pipeline and earliest
load scheduling we suffer 6-8% cycle-to-cycle performance degradation
when comparing the MIPS design to the M88100 design.   This is not too
bad when one considers the relaxation of external RAM speed requirements
and allowing the hit logic to run in arrears of the data delivery.

Independent of this current discussion on the length of load pipelines,
one might want to ask the folks at MIPS Co. this question:

	      Why did you multiplex your memory bus?

Consider factors related to power dissipation.  The current M88100
processors are running between .25 and .5 watts @ 20Mhz.  If we were to
multiplex the 2 memory ports as did MIPS Co., our worst case power
consumption would be 4 watts.  The problem is that the AC power
dissipation is given by:

                            2*C*V**2*F*N,          
where:
                    C = load capacitance (F),
                    V = voltage swing (V),
                    F = frequency (Hz), and
                    N = number of pins which make transitions.

      (For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz)

The addresses from the instruction port are very highly correlated
(about 1.4 bits per cycle change).  The addresses from the data port
are only partially correlated (less so with better compilers).  Mixing
these two streams results in almost uncorrelated address streams and
therefore a bigger N, resulting in more power dissipation.  Notice that
the pin counts on the two packages are not that much different (144 for
the R3000 vs. 180 for the MC88100) and neither are the power/grounds
pin counts (30 for the R3000 vs. 36 for the MC88100).

So why multiplex the memory bus?  Your pin count isn't that much lower,
your power dissipation suffers greatly and you tend to create
difficulties in interfacing to your memory system.

*  t     = time of output hold from address change
    OHL

** t     = time of output enable high to high Z (impedance)
    HZOE

       /\        /\        Mitch Alsup
      //\\      //\\       Manager of Architecture for the M88000
     ///\\\    ///\\\      
    //    \\  //    \\     Remember: SPARC spelled backwards is .....!
   /        \/        \
  /                    \

earl@mips.COM (Earl Killian) (07/06/88)

In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:

mp> The MIPS processors do not snoop their bus and therefore leave
mp> memory coherence to the write-through mechanism.  In
mp> multiprocessing applications, the memory bus can become saturated
mp> with a few processors on the bus (~4?).  Write-back caches cause a
mp> sufficient reduction in memory bus traffic to allow twice the
mp> number of processing ensembles to utilize the bus.

Write through to a 32b bus does indeed limit you to about 4 processors
(the max supported by the 88000).  If you want more than that, use a
64b bus (~8 processors), or use a secondary cache (which has its own
benefits) and make it write-back.  Both approaches have already been
implemented in R2000-based systems.

When you build a R3000-based MP, you don't have to limit the amount of
cache per processor (unlike, e.g., the 88000, which allows only 16KB
per processor in a 4-processor system).  If you're building an MP,
presumably you're interested in performance, so it seems strange to
cripple each processor with a small cache.


mp> ...in particular, we attempted to beat on the SRAM technology less
mp> hard.  If we are correct, this should be more scalable in the
mp> future (read ECL/GaAs) as off-chip delays approach .4 cycle.

It's hard to envision ECL output drive-times approaching 0.4 clocks;
modern ECL parts (e.g. Sony's CXB1100Q 3-NOR) can receive signals from
off-chip, do the NOR function, and drive off chip again (using 100K
levels) in 390 picoseconds. {EDN 6-23-88, p. 97}  So (0.4 * Tclock) =
390ps giving Tclock = 975 picoseconds (1.03 GHz) !!!  A more
"believable" clock period might be 4ns (a la Cray-2), in which case
the drive-off time is 0.1 clock.... a smaller fraction of the cycle
than you quote for your CMOS design.


mp> Alternatively, our design costs less to manufacture in high volume
mp> and allow less costly SRAM parts than the MIPS Co. design.

An R3000 may require faster SRAMs, but these are multiple-sourced,
off-the-shelf, commodity devices, and the price of 30 16Kx4 20ns SRAMs
is actually lower than that of eight 88200s (sole-sourced).
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

earl@mips.COM (Earl Killian) (07/06/88)

In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:

mp> Independent of this current discussion on the length of load
mp> pipelines, one might want to ask the folks at MIPS Co. this
mp> question:
mp>            Why did you multiplex your memory bus?
mp> Consider factors related to power dissipation.  The current M88100
mp> processors are running between .25 and .5 watts @ 20Mhz.  If we
mp> were to multiplex the 2 memory ports as did MIPS Co., our worst
mp> case power consumption would be 4 watts.  The problem is that the
mp> AC power dissipation is given by:
mp>                          2*C*V**2*F*N,          
mp>                 N = number of pins which make transitions.
mp> The addresses from the instruction port are very highly correlated
mp> (about 1.4 bits per cycle change).  The addresses from the data
mp> port are only partially correlated (less so with better
mp> compilers).  Mixing these two streams results in almost
mp> uncorrelated address streams and therefore a bigger N, resulting
mp> in more power dissipation.

Your observation is interesting.  But I hope the 88100's packaging
isn't based on an _average_case_ analysis.  Malicious programmers (who
write, for example, a branch-to-branch infinite loop) can't melt the
88100, can they?  So the worse case is really 4 watts, right?  What
does the datasheet say?

Also the cpu subsystem power consumption isn't much reduced by your
observation, since it only applies to the address outputs on the cpu,
which is a small fraction of the total pins in the system (the I-cache
SRAM pins change every cycle even if the address doesn't change much).
Hmm, I guess the address outputs in an 88000 system are a significant
fraction of the pins, unlike the R3000, because you send the full 32b
virtual address off chip, instead of only 18b as in the R3000.  That
wastes as much power as you save on average from the demultiplexing,
doesn't it?


mp> Notice that the pin counts on the two packages are not that much
mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
mp> MC88100).

MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
time the difference between 180 pins and 144 pins was significant.
The i80386, designed at roughly the same time, actually has fewer
pins -- 132.  The 2nd generation (R3000) could have changed the
interface, but we felt it was desirable to provide an upgrade path for
R2000 designs.

Back when RISC vs. CISC was still an issue, people complained RISC
required more bandwidth.  We just provided the necessary bandwidth.
When we need more, we'll add more.  If that means modifying the
R3000-style cache interface, so what?
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

oconnor@nuke.steinmetz (Dennis M. O'Connor) (07/08/88)

An article by earl@mips.COM (Earl Killian) says:
] In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:
] mp> [...] one might want to ask the folks at MIPS Co. this
] mp> question:
] mp>            Why did you multiplex your memory bus?
] mp> Consider factors related to power dissipation.  The current M88100
] mp> processors are running between .25 and .5 watts @ 20Mhz.  If we
] mp> were to multiplex the 2 memory ports as did MIPS Co., our worst
] mp> case power consumption would be 4 watts.  The problem is that the
] mp> AC power dissipation is given by:
] mp>                          2*C*V**2*F*N,          
] mp>                 N = number of pins which make transitions.
] mp> The addresses from the instruction port are very highly correlated
] mp> (about 1.4 bits per cycle change).  The addresses from the data
] mp> port are only partially correlated (less so with better
] mp> compilers).  Mixing these two streams results in almost
] mp> uncorrelated address streams and therefore a bigger N, resulting
] mp> in more power dissipation.
] 
] Your observation is interesting.  But I hope the 88100's packaging
] isn't based on an _average_case_ analysis.  Malicious programmers (who
] write, for example, a branch-to-branch infinite loop) can't melt the
] 88100, can they?  So the worse case is really 4 watts, right?  What
] does the datasheet say?

Seems to me a branch-to-branch i-loop would execute entirely
out of an onboard I-Cache if there is one. But, if not, a
5V 20MHz signal driving, say, 30pf dissipates 30mW. Multiply
by 32 and you get about 1 Watt power increases. During this
branch-to-branch, the data address bus would, of course, be idle.

Nastier would be something like :
                   7FFFFFFF :   STORE-0-TO-H55555555
                   80000000 :   BRANCH-TO-HFFFFFFFF
                         . . .
                   FFFFFFFF :   STORE-FFFFFFFF-TO-HAAAAAAAA
                   00000000 :   BRANCH-TO-H7FFFFFFF
This sequence changes, on average, 64 output-pins per
cycle : 96 during the cycles with the stores, 32 on the branches.
Assumimg the buses do the most conservative thing ( float ) during
their non-driven periods. This produces 3 Watts power dissipation.

Unless, of course, there's an on-board I-Cache, in which case
only 32 pins per cycle, average ( 64 in the store cycles, 0 in
the branch cycles ) for a dissipation of 1 Watt.

] Also the cpu subsystem power consumption isn't much reduced by your
] observation, since it only applies to the address outputs on the cpu,
] which is a small fraction of the total pins in the system (the I-cache
] SRAM pins change every cycle even if the address doesn't change much).

How to handle an extra Watt of power dissipation on a chip is a
more difficult question than how to handle it in a board or
hybrid, or how to add it to your power supply.

] Hmm, I guess the address outputs in an 88000 system are a significant
] fraction of the pins, unlike the R3000, because you send the full 32b
] virtual address off chip, instead of only 18b as in the R3000.  That
] wastes as much power as you save on average from the demultiplexing,
] doesn't it?

A 16-bit processor would consume less power than a 32-bit processor,
all else being equal. People build the 32-bit ones anyway. But I
digress. So, what can you address with 18 bits ? 256KWords ? Not
enough. I'm kinda surprised if MIPS's address bus is 18 bits,
unless they have some kind of clever external mapping scheme.

] mp> Notice that the pin counts on the two packages are not that much
] mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
] mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
] mp> MC88100).

] MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
] time the difference between 180 pins and 144 pins was significant.
] The i80386, designed at roughly the same time, actually has fewer
] pins -- 132.  The 2nd generation (R3000) could have changed the
] interface, but we felt it was desirable to provide an upgrade path for
] R2000 designs.

Big-pin-count packages are a neccesary evil. They cost more, take
up more real estate, have lower packaging yeilds, are more difficult
to reliably attach to boards, have more problems from thermal-
expansion-coefficients, et cetera. PGAs can't be surface mounted,
which means you can't put them on both sides of the board and
multi-layer routing gets painful. Fine-pitch pins have their
own problems.

The RPM40 uses a 132 pin LCC. It only outputs new instruction
addresses on branches, traps, et cetera. It uses the same pins
for instruction addresses as for operand addresses, even though
the data buses for each are separate. The instruction memory
system is therefor neccesarily a pipelined look-ahead system
much like those used on vector processors for vector fetc
( after all, instruction memory is essentially a set of vectors,
one vector per basic block ).

We felt we had to go this route, no choice : at the time ( late '85 )
the 132-pin LCC was the biggest thing we knew of that was fully
qualified at 40MHz, and trying to multiplex the instruction and
operand addresses at 80MHz was considered beyond CMOS capability.

] Back when RISC vs. CISC was still an issue, people complained RISC
] required more bandwidth.  We just provided the necessary bandwidth.
] When we need more, we'll add more.  If that means modifying the
] R3000-style cache interface, so what?
] -- 
] UUCP: {ames,decwrl,prls,pyramid}!mips!earl
] USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

"RISC vs. CISC" is, I think, not so much an absolute issue
as an issue of which fits a particular technology better.
When memory speeds are up there with adder speeds, RISC wins.
When adders are much faster than memory, perhaps CISC does.
Different horses for different courses, I think.

Woops, gotta go.
--
 Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa
    "Never confuse USENET with something that matters, like PIZZA."

hankd@pur-ee.UUCP (Hank Dietz) (07/09/88)

A while back, my brother and I sent the following abstract to ICCD88...  we
never heard anything as to how they liked it, but it seems a very reasonable
thing to post here, given the recent discussion of power dissipation....

Limiting Switching Transients in High-Speed Digital Processors

     Henry G. Dietz            Paul H. Dietz
     hankd@ee.ecn.purdue.edu   phd@speech1.cs.cmu.edu

Abstract

     Pin counts on CMOS VLSI processors are  currently  very
high  and  will  probably  continue  to grow.  This causes a
variety of problems, not least of which is  the  possibility
of  encountering unacceptable switching transients when many
output pins change state simultaneously.   These  transients
can drastically reduce the noise immunity of internal gates,
severly limiting performance.

     To limit  the  number  of  output  pins  simultaneously
changing   state,  we  propose  to  directly  manage  output
requests on the basis of predictions of the switching  tran-
sients  implied  in each output request.  Each chip would be
designed assuming a well-specified  parameter,  the  Maximum
Allowable  Switching Transient (MAST), and an output request
which could exceed the MAST would be serialized so that  the
MAST  is  not  exceeded.   This  direct control of switching
transients can be implemented in either a hardware-intensive
or  software-intensive  style.  The overall effect is that a
processor chip may incorporate many pins, yet  need  not  be
designed  to  survive  the  worst  case  of  all output pins
attempting to change state simultaneously.

1.  Background

     There are a number techniques that have  been  used  to
limit  switching  transients.  These can be grouped into two
major categories:  reduction in the number  of  output  pins
that are active at any one time or reduction of the observed
transient itself.

     The number of output pins can be reduced  by  transmit-
ting  data  serially  or  by time multiplexing data buses to
serve multiple functions.  Alternatively, output  times  for
various  signals  can be slightly skewed so that the outputs
are not set simultaneously.  Unfortunately,  the  quest  for
higher   operating   speeds   often  precludes  the  obvious
application of these techniques.

     To reduce the switching transient generated per  output
pin,  some  manufacturers  have  devoted  large die areas to
decoupling capacitors; but this is not practical for designs
which  are  already  pushing  die-size  constraints.   Other
manufactures use off-chip capacitors  mounted  in  the  same
package as the die, which can provide much larger decoupling
capacitances.  However, the series  inductance  inherent  in
going  off-chip  is  greater,  limiting  the  effectiveness.
Another approach, perhaps more generally applicable,  is  to
maintain  separate power buses for output buffers and inter-
nal state logic [Car88].  Also, by  careful  design  of  the
output buffer [GaT88], one can make buffer power consumption
more consistent, hence reducing the  worst-case  values  and
achieving significant improvement.

     It is reasonable to assume that next generation devices
will incorporate some combination of these methods, yet, all
of these techniques require that the chip  be  designed  for
the worst case:  additional performance gains can be made by
restricting simultaneous output  operations  only  when  the
MAST otherwise would be exceeded.

2.  Approach

     There are two difficulties in directly controlling out-
put  state  transitions  based  on potential MAST violation.
The first problem is how to detect or predict  when  a  MAST
violation  may occur; this may be done placing the main bur-
den either on hardware (detection) or on  software  (predic-
tion).   The second problem is, given that a particular out-
put request would exceed the MAST  if  done  simultaneously,
how can the hardware arrange to perform the output pin state
transitions without exceeding the  MAST.   We  will  discuss
this second problem first.

2.1.  Output Serialization

     Given that a particular  logically-simultaneous  output
operation  would exceed the MAST, hardware must intervene to
insure that the limit is observed.  This can be done by per-
mitting  only a fraction of the output pins to change simul-
taneously in one cycle and performing the rest of the output
on  successive  cycles which are inserted just for that pur-
pose.  We say that such an output operation has been serial-
ized.

     Although output serialization requires only  relatively
simple  hardware,  some  care  must  be taken.  For example,
strobe/ready  bits  must  change   state   only   once   the
corresponding data bits are in the correct state.

     When the MAST is not exceeded, the requested output  is
performed  in a single cycle.  (In this case, the additional
circuitry has no effect.)  This is  an  efficient  technique
because,  for  example,  localities  in  instruction address
space often correspond to minor bit changes in  the  address
outputs.[1]

     In some cases, an optimizing compiler/linker/loader can
significantly  enhance  this  kind  of locality - these code
transformations are discussed in detail in the full  version
of  the paper.  A simple example of the type of optimization
possible is to generate code so that jump and  call  targets
(labels   and  function/procedure  entries)  are  placed  at
addresses  which  differ  from  the   invoking-instruction's
address  in only a few bits (more precisely, causing changes
in fewer than MAST bits).  Another example is  that  a  loop
whose  code  would normally span a high-order memory address
bit change could be moved to a portion of  the  address  map
where  fewer address bits change.  Even data-related outputs
sometimes can be transformed to  minimize  simultaneous  bit
changes,  either  by careful layout of data structures or by
recognition of properties of operations being performed.

2.2.  MAST Violation Detection/Prediction

     As discussed in section 2.1, compiler technology (e.g.,
flow  and other static analyses [AhS86] [DiC88]) can be used
to predict, and hence to alleviate  by  code  motion,  etc.,
possible  violations  of the MAST.  This same compiler tech-
nology can be used to predict when the MAST will be violated
and  to directly encode that information in the instructions
it generates; hardware would simply serialize any  operation
which the compiler tagged as suspect.[2]

     Of course, the compiler must conservatively assume that
any operation which it can't prove is less than the MAST, is
actually greater than the MAST.  This isn't  always  true  -
some  output  changes  are always unknown until runtime, and
the compiler must assume that all of these change.

     The more hardware-intensive alternative  is  to  simply
use  a circuit to detect, at runtime, when a proposed output
would actually exceed the MAST, and to invoke  serialization
only then.  In the full-length paper, several techniques are
presented for constructing such a circuit.

     Compared to the software prediction, hardware detection
insures  that all outputs that can be done in a single cycle
are so accomplished, whereas the compiler tagging may  cause
some  to be unnecessarily serialized.  The trade-off is that
the hardware is fairly complex and that the compiler  cannot
know  precisely  how long each instruction will take to exe-
cute (which reduces the effectiveness of  many  conventional
compiler optimizations).

3.  Conclusion

     Using either the software-intensive  or  the  hardware-
intensive technique proposed, the concept of directly manag-
ing output pin state changes can provide substantial perfor-
mance  increases  with  only  minor  impact on the processor
design.  Typically, a circuit using these techniques will be
running  at  or near its MAST, thus making the best possible
use of the available bandwidth.

_________________________
  [1] Although it might not be practical, use  of  Gray
coded  rather than 2's-complement integers to represent
addresses would insure that sequential addresses differ
by only a single bit.

  [2] For  those  who would rather not place such faith
in the compiler, a simple circuit can detect  a  glitch
on  the  power  bus,  thereby  detecting an instruction
which the compiler failed to tag but which exceeds  the
MAST.  The circuit would simply initiate a cold-start.

References

[AhS86]   Aho, A. V., Sethi, R., and  Ullman,  J.  D.,  Com-
          pilers: Principles, Techniques, and Tools, Addison
          Wesley, Reading, Massachusetts, 1986.

[Car88]   Carley, L. R., Personal  communication,  Jan.  31,
          1988.

[DiC88]   Dietz, H. G. and Chi, C-H.,  "A  Compiler-Writer's
          View  of  GaAs Computer System Design," IEEE Proc.
          of HICSS-21, pp. 256-265, Jan. 1988.

[GaT88]   Gabara, T. and Thompson, D., "Ground  Bounce  Con-
          trol  in  CMOS  Integrated Circuits," to appear in
          IEEE Proc. of International  Solid-State  Circuits
          Conference, 1988.

andrew@frip.gwd.tek.com (Andrew Klossner) (07/12/88)

Earl Killian wrote:

	"When you build a R3000-based MP, you don't have to limit the
	amount of cache per processor (unlike, e.g., the 88000, which
	allows only 16KB per processor in a 4-processor system)."

Presumably Earl meant "data cache" where he wrote "cache."
Just to keep the record straight: in a 4-processor 88k system on a
single M-bus (the bus implemented by the CMMU pinouts), each processor
has 16KB data cache and 16KB instruction cache, for a total of 32KB
cache.  If you want more than that, you need more than one M-bus.  Some
system designers are doing this.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]