[comp.arch] Maximum MIPS for a given memory bandwidth?

mangler@cit-vax.Caltech.Edu (Don Speck) (06/13/88)

A while ago, Rick Richardson was looking for a microprocessor
that could squeeze 4000 Dhrystones out of a 4 MHz 16-bit bus.

Is this even possible?	That's only 3 MB/s of bandwidth per MIPS,
barely enough to fetch instructions.  Even the MC68010, which was
designed for slow memory, needs more like 7 MB/s per MIPS.

What's the lowest memory/cache bandwidth requirement per MIPS that
has been attained?  I.e. please add some numbers to this table:

  Processor	    avg read	bus	bandwidth	 MB/s:MIPS
		     latency   width   at the CPU   MIPS   ratio
SUN2 (68010)	      400ns	 16	  5 MB/s     0.7     7
Microvax II	      400ns	 32	 10 MB/s     0.9    11
VAX-11/750	     ~440ns	 32	  9 MB/s     0.6    15
VAX-11/780	     ~440ns	 32	 12 MB/s     1.0    12
PDP-11/55	      300ns	 16	  7 MB/s      1?     7?
88000		       45ns?	 32	185 MB/s?    17     11?
Cray-1S 	      137ns	 64	640 MB/s     20?    32?
16 MHz MIPSco	       ?	 32	120 MB/s?    10     13?
70 MHz WM	       ?	 32    3000 MB/s?   100?    30?
Acorn RISC Machine		 32			     ?

(I'm less sure about numbers appearing later in the table).

I'm wondering if there is some formula for the maximum number of
MIPS that can be extracted from a memory system, based on its
bandwidth, bus size, and latency, i.e. "with that memory/cache
system you can't get more than N mips"?  With a large enough table
of the above type, perhaps one could derive some rules of thumb in
this direction?

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

jesup@cbmvax.UUCP (Randell Jesup) (06/13/88)

In article <6921@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
>  Processor	    avg read	bus	bandwidth	 MB/s:MIPS
>		     latency   width   at the CPU   MIPS   ratio
>SUN2 (68010)	      400ns	 16	  5 MB/s     0.7     7
>Microvax II	      400ns	 32	 10 MB/s     0.9    11
>VAX-11/750	     ~440ns	 32	  9 MB/s     0.6    15
>VAX-11/780	     ~440ns	 32	 12 MB/s     1.0    12
>PDP-11/55	      300ns	 16	  7 MB/s      1?     7?
>88000		       45ns?	 32	185 MB/s?    17     11?
>Cray-1S 	      137ns	 64	640 MB/s     20?    32?
>70 MHz WM	       ?	 32    3000 MB/s?   100?    30?
>16 MHz MIPSco	       ?	 32	120 MB/s?    10     13?
Try these (john?)               32+32	128?	     12?   ~10?

40 Mhz Rpm-40	      100ns	32+16   120 MB/s     33     ~4
			       data inst	   native

	If one is talking Vax Mips (which from the original msg we aren't):

40 Mhz Rpm-40	      100ns	32+16   120 MB/s    14-16  ~8-9
			       data inst	     vax

Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

hankd@pur-ee.UUCP (Hank Dietz) (06/13/88)

In article <6921@cit-vax.Caltech.Edu>, mangler@cit-vax.Caltech.Edu (Don Speck) writes:
> A while ago, Rick Richardson was looking for a microprocessor
> that could squeeze 4000 Dhrystones out of a 4 MHz 16-bit bus.
...
>   Processor	    avg read	bus	bandwidth	 MB/s:MIPS
> 		     latency   width   at the CPU   MIPS   ratio
> SUN2 (68010)	      400ns	 16	  5 MB/s     0.7     7
> Microvax II	      400ns	 32	 10 MB/s     0.9    11
> VAX-11/750	     ~440ns	 32	  9 MB/s     0.6    15
> VAX-11/780	     ~440ns	 32	 12 MB/s     1.0    12
...
> I'm wondering if there is some formula for the maximum number of
> MIPS that can be extracted from a memory system, based on its
> bandwidth, bus size, and latency, i.e. "with that memory/cache
> system you can't get more than N mips"?  With a large enough table
> of the above type, perhaps one could derive some rules of thumb in
> this direction?

Well, obviously there is such a formula using your definition of
bandwidth... in fact, you effectively used the formula above.  The major
source of inconsistency is in what constitutes a MIP.  Consider:

1. The average number of bits of memory referenced per instruction executed
   (hence also per MIP) depends on the instruction set and its encoding.
   The lower bound is 0 (i.e., processor crunching a microcoded instruction
   within its own registers) and the maximum is large-but-finite.

2. Your "bandwidth at the CPU" measure simply makes the use of CPU-internal
   registers/cache/instruction-decode-logic and the operand precsion of the
   machine all important.

For example, if we assume that, on average, a 32-bit operand will be
loaded/stored from CPU-external memory every 4 instructions and there are
8-bits per instruction, we would find that we need 2MB/s (16 MBits/s) for
one MIP, giving a ratio of 2:1 in your terminology.  Once you've picked your
benchmark (persumably, Dhrystones) and set the precision of the operands,
you're measuring how space-efficiently instructions are encoded and how well
the CPU-internal memory system works -- not really all that interesting,
because the choice of what to call CPU-internal and what to call
CPU-external is completely arbitrary.

If you break-down the bandwidth measure into bandwidths of the component
parts (i.e., on-chip registers, cache, etc.), then you might get some
interesting results...?

					-hankd

tim@amdcad.AMD.COM (Tim Olson) (06/14/88)

In article <6921@cit-vax.Caltech.Edu>, mangler@cit-vax.Caltech.Edu (Don Speck) writes:
From To:mangler@cit-vax.Caltech.Edu Mon Jun 13 11:17:41 1988
Date: Mon, 13 Jun 88 11:17:42 PDT
To: mangler@cit-vax.Caltech.Edu  (Don Speck)
Subject: Re: Maximum MIPS for a given memory bandwidth?
In-Reply-To: Message from "Don Speck" of Jun 13, 88 at 2:46 am

| A while ago, Rick Richardson was looking for a microprocessor
| that could squeeze 4000 Dhrystones out of a 4 MHz 16-bit bus.
| 
| Is this even possible?	That's only 3 MB/s of bandwidth per MIPS,
| barely enough to fetch instructions.  Even the MC68010, which was
| designed for slow memory, needs more like 7 MB/s per MIPS.

The Am29000 gets 35600 Dhrystones (1.1) at 25MHz with two-cycle first
access, single-cycle burst caches.  This is well over the 1000
Dhrystones / MHz that Rick requested (although with a 32-bit bus instead
of 16 bits).  At 4 MHz, memory access times would be 250ns, easily
within DRAM range. 

| What's the lowest memory/cache bandwidth requirement per MIPS that
| has been attained?  I.e. please add some numbers to this table:
| 
|   Processor	    avg read	bus	bandwidth	 MB/s:MIPS
| 		     latency   width   at the CPU   MIPS   ratio
| SUN2 (68010)	      400ns	 16	  5 MB/s     0.7     7
| Microvax II	      400ns	 32	 10 MB/s     0.9    11
| VAX-11/750	     ~440ns	 32	  9 MB/s     0.6    15
| VAX-11/780	     ~440ns	 32	 12 MB/s     1.0    12
| PDP-11/55	      300ns	 16	  7 MB/s      1?     7?
| 88000		       45ns?	 32	185 MB/s?    17     11?
| Cray-1S 	      137ns	 64	640 MB/s     20?    32?
| 16 MHz MIPSco	       ?	 32	120 MB/s?    10     13?
| 70 MHz WM	       ?	 32    3000 MB/s?   100?    30?
| Acorn RISC Machine		 32			     ?

These numbers were derived from the Am29000 Architectural Simulator
running Dhrystone 1.1 (since the original question was pertaining to
Dhrystones).  The bandwidth requirements are actual, not theoretical
peak.  Since you didn't specify whether MIPS were native or VAX-MIPS, I
have calculated both:

			Am29000 (Video DRAM)
			--------------------
Ave Read Latency	160 ns (load/store/jump)
			 40 ns (instruction burst)

Bus Width		32 bits (* 2 -- separate instruction & data buses)

Bandwidth at CPU	44.7 MB/s instruction,
			11.5 MB/s data.

MIPS			12.7 Native, 15.2 VAX MIPS

MB/s/MIPS ratio:	3.70 (VAX)
			4.43 (Native)

			Am29000 (Caches)
			----------------
Ave Read Latency	 80 ns (load/store/jump)
			 40 ns (instruction burst)

Bus Width		32 bits (* 2 -- separate instruction & data buses)

Bandwidth at CPU	62.2 MB/s instruction,
			15.8 MB/s data.

MIPS			17.4 Native, 22.3 VAX MIPS

MB/s/MIPS ratio:	3.50 (VAX)
			4.48 (Native)

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

daver@daver.UUCP (Dave Rand) (06/14/88)

In article <22050@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
>
>			Am29000 (Video DRAM)
>MIPS			12.7 Native, 15.2 VAX MIPS
>			Am29000 (Caches)
>MIPS			17.4 Native, 22.3 VAX MIPS

I am confused. How can a risc machine have a higher "vax mips" than
native mips? MORE (not less) risc instructions are required to
do the same task, when compared to a vax.

If you are saying that the 29000 is 22.3 TIMES FASTER than a vax, then
say that - what you have said is not reasonable. I cannot believe that
you can execute a vax instruction in 78% of the time of a native
instruction (17.4/22.3).

This implies that, if you can execute a native instruction in 1 clock, that
you can execute a vax instruction (memory-to-memory add, for example), in
0.78 clocks!

-- 
Dave Rand
{pyramid|hoptoad|nsc|vsi1}!daver!daver

george@wombat.UUCP (George Scolaro) (06/14/88)

In article <22050@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
>			Am29000 (Video DRAM)
>Bandwidth at CPU	44.7 MB/s instruction,
>			11.5 MB/s data.
>MIPS			12.7 Native, 15.2 VAX MIPS

>			Am29000 (Caches)
>Bandwidth at CPU	62.2 MB/s instruction,
>			15.8 MB/s data.
>
>MIPS			17.4 Native, 22.3 VAX MIPS

The implication is that the 29000 only requires the above bandwidth to
achieve the MIPS indicated.
The 29000 has a 100 Mbyte/second bus bandwidth. I am sure if you limited the
bus bandwidth to the above stated figures the MIPS would decrease quite a
bit. Sure, the average bandwidth required is as stated, but to achieve the
high MIPS you still need the peak 100 Mbytes/second, otherwise wait states
wouldn't hurt the 29000 performance, right?

-- george scolaro.
{pyramid|hoptoad|nsc|vsi1}!daver!wombat!george

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (06/14/88)

An article by jesup@cbmvax.UUCP (Randell Jesup) says:
] In article <...>mangler@cit-vax.Caltech.Edu (Don Speck) writes:
] >  Processor	    avg read	bus	bandwidth	 MB/s:MIPS
] >		     latency   width   at the CPU   MIPS   ratio
] 40 Mhz Rpm-40	      100ns	32+16   120 MB/s     33     ~4
] 			       data inst	   native
] 
] 	If one is talking Vax Mips (which from the original msg we aren't):
] 40 Mhz Rpm-40	      100ns	32+16   120 MB/s    14-16  ~8-9
] 			       data inst	     vax
] 
] Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

How soon they forget :-). Correct figures for bandwidth of the RPM40
are 160 MBytes/sec of data, and 80 MBytes/sec of instructions, for
a total of 240 MB/s of AVAILABLE bandwidth. Unless Randell is
quoting AVERAGE figures, but those would depend on the instruction
mix (i.e. the application).

Hi Randell : got your mail, mail to you bounced. I'll try again.
--
 Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa
    "Never confuse USENET with something that matters, like PIZZA."

tim@amdcad.AMD.COM (Tim Olson) (06/14/88)

In article <291@wombat.UUCP> george@wombat.UUCP (George Scolaro) writes:
| The implication is that the 29000 only requires the above bandwidth to
| achieve the MIPS indicated.

Sorry -- I didn't mean to imply that.  Obviously if you want to execute
at close to an instruction per cycle, you must be able to supply that
peak rate at the pins.  However, I think that average bandwidth
requirements are much more interesting -- it tells more about the cost
and complexity of a memory design than the peak rating, and seemed to be
more in line with what the original poster was asking.

| The 29000 has a 100 Mbyte/second bus bandwidth.

Actually, it is 100MB/s for instruction and 100MB/s for data, although
the "sustained" peak is more like 170MB/s (running a series of loads or
stores). 

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

george@wombat.UUCP (George Scolaro) (06/15/88)

In article <22063@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:
>In article <291@wombat.UUCP> george@wombat.UUCP (George Scolaro) writes:
>| The implication is that the 29000 only requires the above bandwidth to
>| achieve the MIPS indicated.
>
>Sorry -- I didn't mean to imply that.  Obviously if you want to execute
>at close to an instruction per cycle, you must be able to supply that
>peak rate at the pins.  However, I think that average bandwidth
>requirements are much more interesting -- it tells more about the cost
>and complexity of a memory design than the peak rating,

Does average bandwith tell more about the memory design? Output from
the AMD29000 simulator (V4.21 PC) indicates that with 0 wait states
the device attains 20.71 MIPS. With 1 wait state on every memory access
the device attains 14.08 MIPS (Quote from Byte May 88). So, just 1 wait
state impacts the performance quite dramatically. For the 29000 to achieve
maximum performance the memory must support burst mode and as near to
zero wait states as possible. Thus even though the average bandwidth
requirement that was quoted was around 70 Mbytes/second, one wait state,
which reduces the bandwidth to 100 Mbytes/second has a major impact on
the Dhrystone benchmark. Of course adding cache changes the memory speed
requirements, but then cache is just high speed memory hidden behind a
high-tech. name.

>Actually, it is 100MB/s for instruction and 100MB/s for data, although
>the "sustained" peak is more like 170MB/s (running a series of loads or
>stores). 

Yeah, neat. I like the support for burst mode on both the instruction
and data paths. Also noted is support for burst write on the data path.

george scolaro.
{pyramid|hoptoad|nsc|vsi1}!daver!wombat!george

mangler@cit-vax.Caltech.Edu (Don Speck) (06/15/88)

In article <22063@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
>				   I think that average bandwidth
> requirements are much more interesting -- it tells more about the cost
> and complexity of a memory design than the peak rating, and seemed to be
> more in line with what the original poster was asking.

Average bandwidth requirements are the interesting thing for shared-memory
multiprocessors, but I was asking about uniprocessors, where all of the
bandwidth is dedicated to one processor and costs the same to provide
whether the processor uses all of it or not.

I consider caches to be part of the memory system, i.e. part of the
von Neumann bottleneck.

Instead of using the ambiguous term "MIPS", I should have said "number
of times the speed of a VAX/780".  Unfortunately it wouldn't fit in the
column headings.  Dhrystones would have been less ambiguous.  I didn't
expect enough accuracy that it would make much difference.

So the table is amended as follows:

  Processor	    avg read	bus	bandwidth   VAX  MB/s:MIPS
		     latency   width	available  "MIPS"  ratio
25 MHz 88000	       45ns?   32+32	185 MB/s?    17     11?
16 MHz MIPSco		?      32+32	120 MB/s?    10?    13?
40 MHz RPM40	      100ns    32+16	240 MB/s     15     16
25 MHz AMD 29000       80ns    32+32	170 MB/s     22      8

The AMD 29000 is remarkably bandwidth-efficient, despite using
(on average) less than half of the memory cycles available.
(Maybe this points out the efficacy of their optimizer).
How much would the 29000 slow down if it had only one 32-bit
path to a combined instruction+data cache, i.e. half as much
peak memory bandwidth available?

I had assumed that efficient use of bandwidth would require a
narrow path to memory (with bit-addressable bit-serial being
the most efficient).  Perhaps this is not necessary.

I still suspect that there's some lower bound on the number of
bytes exchanged with cache/memory to perform the work of a
"mythical" instruction.

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

tpmsph@ecsvax.UUCP (Thomas P. Morris) (06/16/88)

As many or all of the readers of this group are well aware, what most
literature refers to as "MIPS" or "VAX MIPS" is not really a "MIP",
per se. Then the more enlightened literature points out that the 
comparison is really to a mythical 1.0 "MIP" VAX 11/780. Why don't
we just refer to "VUPS", the term DEC coined an uses in their own
literature? (VAX 11/780 Unit Processor(S))  At least they are making
a nod at the apparent fact that a 780 is not `really' a 1 MIP machine...

-- 
-----------------------------------------------------------------------------
Tom Morris                                 BITNET: TOM@UNCSPHVX
UNC School of Public Health                UUCP  : ...!mcnc!ecsvax!tpmsph
-----------------------------------------------------------------------------

hankd@pur-ee.UUCP (Hank Dietz) (06/16/88)

In article <6955@cit-vax.Caltech.Edu>, mangler@cit-vax.Caltech.Edu (Don Speck) writes:
...
> I consider caches to be part of the memory system, i.e. part of the
> von Neumann bottleneck.
...
> Instead of using the ambiguous term "MIPS", I should have said "number
> of times the speed of a VAX/780".  Unfortunately it wouldn't fit in the
> column headings.  Dhrystones would have been less ambiguous.  I didn't
> expect enough accuracy that it would make much difference.
...
> I still suspect that there's some lower bound on the number of
> bytes exchanged with cache/memory to perform the work of a
> "mythical" instruction.

What the *!%# are you measuring?

1. What are "MIPS"?
   (a) Is it millions of instructions executed per second or is it
       relative speed (VAX 780 = 1 MIP)?
   (b) Is it a peak rating or an average for some code?
   (c) If for a code, what code, with what precision requirements (e.g.,
       is it fair to compare 16-bit to 32-bit operations?), and is it the
       hand-generated best code or are we benchmarking compilers?

2. Bandwidth of what?
   (a) Do you measure bandwidth at:  Main memory?  Main memory with some
       VM paging overhead?  Caches?  On-chip (i.e., CPU internal) caches
       and registers?
   (b) Peak or average?
   (c) Any concept of shared access?  I.e., are you considering that I/O
       or other processors (e.g., an FPU) might share access to "memory"?
       If so, does their bandwidth count?

I gave you the (rather trivial) formula for determining the ratio in my last
posting...  let me just repeat that the minimum bandwidth is essentially
ZERO.  This would be achieved by a machine which had a single (probably
microcoded) instruction to, for example, perform the Dhrystone benchmark
using values kept in registers.  With no memory references (we don't count
program loading, right?), the number of MIPS has nothing to do with the
bandwidth.

SO, what is MY point?  The only way to get numbers to compare is to get
numbers relating comparable things...  ALL numbers should be as completely
broken-down as possible (e.g., list register, cache, main mem. bandwidth
separately) and fully labelled.

						-hankd

jesup@cbmvax.UUCP (Randell Jesup) (06/16/88)

In article <11234@steinmetz.ge.com> oconnor%sungod@steinmetz.UUCP writes:
>An article by jesup@cbmvax.UUCP (Randell Jesup) says:
>] >  Processor	    avg read	bus	bandwidth	 MB/s:MIPS
>] >		     latency   width   at the CPU   MIPS   ratio
>] 40 Mhz Rpm-40	      100ns	32+16   120 MB/s     33     ~4
>] 			       data inst	   native
>] 
>] 	If one is talking Vax Mips (which from the original msg we aren't):
>] 40 Mhz Rpm-40	      100ns	32+16   120 MB/s    14-16  ~8-9
>] 			       data inst	     vax

>How soon they forget :-). Correct figures for bandwidth of the RPM40
>are 160 MBytes/sec of data, and 80 MBytes/sec of instructions, for
>a total of 240 MB/s of AVAILABLE bandwidth.

	Oops.  So I can't count.  Corrected figures:

 40 Mhz Rpm-40	      100ns	32+16   240 MB/s     33     ~8
 			       data inst	   native
 
 	If one is talking Vax Mips (which from the original msg we aren't):
 40 Mhz Rpm-40	      100ns	32+16   240 MB/s    14-16  ~15
 			       data inst	     vax

Actually, the real numbers we want are for what it does with a fast cache, and
a slow memory bus beyond that.  Then measure the Mips/main memory bandwidth.
Of course, then we get into cache sizing problems...

>Hi Randell : got your mail, mail to you bounced. I'll try again.
> Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa

Dennis:  Try uunet!cbmvax!jesup (or steinmetz!uunet!cbmvax!jesup)

Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (06/16/88)

>In article <11234@steinmetz.ge.com> oconnor%sungod@steinmetz.UUCP writes:
>>How soon they forget :-). Correct figures for bandwidth of the RPM40
>>are 160 MBytes/sec of data, and 80 MBytes/sec of instructions, for
>>a total of 240 MB/s of AVAILABLE bandwidth.

	I did some thinking, and have a guess at actual average bandwidth
(VERY rough):  100% of instruction bandwidth + 30-40% data bandwidth ~= 140MB/s

 40 Mhz Rpm-40	      100ns	32+16   140 MB/s     33     ~4.5
 			       data inst  avg	   native
 
 	If one is talking Vax Mips (which from the original msg we aren't):
 40 Mhz Rpm-40	      100ns	32+16   140 MB/s    14-16  ~9-10
 			       data inst  avg	     vax

Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup

tim@amdcad.AMD.COM (Tim Olson) (06/16/88)

In article <292@wombat.UUCP> george@wombat.UUCP (George Scolaro) writes:
| Does average bandwith tell more about the memory design? Output from
| the AMD29000 simulator (V4.21 PC) indicates that with 0 wait states
| the device attains 20.71 MIPS. With 1 wait state on every memory access
| the device attains 14.08 MIPS (Quote from Byte May 88). So, just 1 wait
| state impacts the performance quite dramatically.

That is true, it does if that wait-state is inserted into all
instruction requests.  I was thinking more along the lines of burst
mode: there are many schemes whereby we can supply an instruction per
cycle in burst mode, with some increased latency for starting the burst
access.  Video-DRAM designs, Static-column memories, interleaved DRAMS
are all examples.  These designs are usually much cheaper than
the single-cycle SRAM needed to provide peak bandwidth, which may not
even be required.

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

tim@amdcad.AMD.COM (Tim Olson) (06/16/88)

In article <6955@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
| So the table is amended as follows:
| 
|   Processor	    avg read	bus	bandwidth   VAX  MB/s:MIPS
| 		     latency   width	available  "MIPS"  ratio
| 25 MHz 88000	       45ns?   32+32	185 MB/s?    17     11?
| 16 MHz MIPSco		?      32+32	120 MB/s?    10?    13?
| 40 MHz RPM40	      100ns    32+16	240 MB/s     15     16
| 25 MHz AMD 29000       80ns    32+32	170 MB/s     22      8
						    ^^^^
Well, on Dhrystone 1.1, anyway! ;-) It would probably be more
"reasonable" to reduce this to 17, which is what we see for large UNIX
utilities.

| The AMD 29000 is remarkably bandwidth-efficient, despite using 
| (on average) less than half of the memory cycles available.  
| (Maybe this points out the efficacy of their optimizer).  

That certainly has to be taken into account.

| How much would the 29000 slow down if it had only one 32-bit 
| path to a combined instruction+data cache, i.e.  half as much 
| peak memory bandwidth available? 

I just ran the benchmarks.  Both models are Video-DRAM memory with
4-cycle jumps, loads, and stores, and 1-cycle instruction burst
capability.  The first model has split I/D (i.e. can have an instruction
burst concurrent with a load or store).  The second must drop I-burst
for every load or store, wait for the load or store to complete, then
start up the I-burst again (another 4 cycles).  This simulates
connection to the memory through a single shared I/D bus.

	Model		Dhrystones (1.1)
	Split I/D:	24294
	Shared I/D:	18428

This is a drop in performance of 24%.  Part of this is due to not being
able to execute other instructions concurrently with an in-progress load
or store, because they cannot be fetched simultaneously.  The other part
is due to restarting the I-burst after a random load or store breaks it.

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

mash@mips.COM (John Mashey) (06/16/88)

In article <5275@ecsvax.UUCP> tpmsph@ecsvax.UUCP (Thomas P. Morris) writes:

>As many or all of the readers of this group are well aware, what most
>literature refers to as "MIPS" or "VAX MIPS" is not really a "MIP",
>per se. Then the more enlightened literature points out that the 
>comparison is really to a mythical 1.0 "MIP" VAX 11/780. Why don't
>we just refer to "VUPS", the term DEC coined an uses in their own
>literature? (VAX 11/780 Unit Processor(S))  At least they are making
>a nod at the apparent fact that a 780 is not `really' a 1 MIP machine...

As noted before (some discussions are like boomerangs, they always come back):
1 VUP == 1 VAX 11/780, with VMS compilers.

Note that some people also compare to MicroVAX II's (an MVUP!) which are
slower than 780s, especially on floating point (be careful to compare
apples with apples when looking at the Digital Review benchmarks,
for example, which are often expressed as MVUP ratings).

For DEC, VUPs work fine, because they compare CPUs in a family, using the 
same software.  Thus, if the compilers optimize better over time,
the processor ratios remain grossly constant per benchmark, and given that
DEC uses a lot of benchmarks, I'd guess that compiler improvements that
favor one model over another probably get washed out.

When used elsewise, a VUP is a moving target, and if your compilers
don't improve as fast as DEC's, your VUP-rating can diminish over time!

As we've said many times, trying to boil even CPU performance down to
1 number is nonsense [for example, a "6-VUP" 8700 can be anywhere from
3-7X faster than a 780; other vendor's systems can very even more relative
to the 780], but as much as you hate it, you get forced into it.  argh.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

alan@pdn.UUCP (Alan Lovejoy) (06/16/88)

>In article <6955@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
>| So the table is amended as follows:
>| 
>|   Processor	    avg read	bus	bandwidth   VAX  MB/s:MIPS
>| 		     latency   width	available  "MIPS"  ratio
>| 25 MHz 88000	       45ns?   32+32	185 MB/s?    17     11?

Where did you get a 25 MHz 88000?  I believe all benchmark figures so
far are based on the 20 MHz part, including the 17 VUPS rating.

Motorola says that the pcc/88k compiler produces Dhrystone code that
runs 25,000 times/sec on the 20 MHz part, the Green Hills C/88k compiler
gets 34,000/sec, and Tadpole Technology claims 45,000/sec with their
compiler.  All at 20 MHz, but not necessarily with that same cache
sizes in the case of Tadpole.

-- 
Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.
Disclaimer: Do not confuse my views with the official views of Paradyne
            Corporation (regardless of how confusing those views may be).
Motto: Never put off to run-time what you can do at compile-time!

randys@mipon2.intel.com (Randy Steck) (06/17/88)

In article <6955@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
>So the table is amended as follows:
>
>  Processor	    avg read	bus	bandwidth   VAX  MB/s:MIPS
>		     latency   width	available  "MIPS"  ratio
>25 MHz 88000	       45ns?   32+32	185 MB/s?    17     11?
>16 MHz MIPSco		?      32+32	120 MB/s?    10?    13?
>40 MHz RPM40	      100ns    32+16	240 MB/s     15     16
>25 MHz AMD 29000       80ns    32+32	170 MB/s     22      8
>

Another data point:
  Processor	    avg read	bus	bandwidth   VAX  MB/s:MIPS
		     latency   width	available  "MIPS"  ratio

20 MHz 80960           ?        32      53.3 MB/s    8       6.7

The ratio is so low because of the on-board instruction cache and the
existence of more complete addressing modes in the load/store
instructions.  The external address bus is a multiplexed bursting bus.
Performance degradation is about 7% for each wait state.  This contrasts
nicely with the 15-20% degradations and separate busses typically seen.

A better number to give for this table would be one that took into account
the bandwidth available from the internal instruction cache.
Unfortunately, this is a relatively difficult number to calculate and
really depends on the data/instruction access mix.  When I find some time,
I will try to do this for the Dhrystone benchmark.

Randy Steck
Intel Corp.      ...intelca!mipon2!randys