[net.micro.16k] 32xxx bus cycles & CPU speed

doug@terak.UUCP (Doug Pardee) (03/15/85)

> 1) The VAX memory access speed relies on write back cache - the bus cycles
> are 400 ns.  With a 10 Mhz 32032 *or* 32016 one can get real memory
> response times of 200ns.  This saves time.  

Say what????  200 ns bus cycle times on a 10 MHz 32K?  At 10 Mhz, each
clock is 100 ns.  Except for slave processor register accesses, the
shortest possible bus cycle on a 32xxx is 4 clocks.  That's 400 ns.
If you have an MMU, it's 5 clocks, or 500 ns.

One of the "unique" aspects of the 32xxx is that there's no point in
putting a memory cache on it, 'cause the bus cycle will take forever
anyway.

BTW, another "unique" aspect is that if you're using the TCU, there is
no usable specification for how rapidly the memory must respond on a
read operation.  The data must be ready 10 ns (CPU only) or 15 ns
(CPU w/MMU) before the falling edge of PHI2 in T3, but the TCU does
not have any timing specification for the falling edge of PHI2!

> 2) it is not clear that the 32016 doesn't compare to a VAX.  With the right 
> kind of paging algorithms and hardware, one might very well outperform an 
> 11/750 WITH FPA. I haven't tried it, but it looks possible.

My experience is that a 10 MHz 32016 w/MMU and FPU is in the same
ballpark as (or slightly faster than) a VAX 11/750.  But -- the C
compiler supplied with Genix is terribly slow, taking twice as long
as the VAX/UNIX C compiler.
-- 
Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug

jss@sjuvax.UUCP (J. Shapiro) (03/20/85)

Me:
> > 1) The VAX memory access speed relies on write back cache - the bus cycles
> > are 400 ns.  With a 10 Mhz 32032 *or* 32016 one can get real memory
> > response times of 200ns.  This saves time.  

Doug Pardee:
> Say what????  200 ns bus cycle times on a 10 MHz 32K?  At 10 Mhz, each
> clock is 100 ns.  Except for slave processor register accesses, the
> shortest possible bus cycle on a 32xxx is 4 clocks.  That's 400 ns.
> If you have an MMU, it's 5 clocks, or 500 ns.

Pulling out the data book, it appears to me that you are incorrect.
Though the basic processor cycling takes 4 T states, and each of these
takes 100ns on a 10Mhz chip, the turnaround time is measured from address
valid to data valid, which is 200ns.  ~ADS goes low midway through T-state
1 and data is read midway through T-state 3.  On the VAX this is not the
case, though I don't have the hardware manual handy to spell it out.
I seem to recall the full VAX cycle takes longer. If I am wrong, would
someone yank out their hardware manual and outline it for me?

Typically, one does not just access memory, one does something with it, and
the access itself is only a fraction of the time (though a significant
one). Cut memory access time by 1/4 or 1/2 and lots of things get speeded
up, precisely because the most significant fraction of the time of an
operation is often memory (where applicable, of course).

> One of the "unique" aspects of the 32xxx is that there's no point in
> putting a memory cache on it, 'cause the bus cycle will take forever
> anyway.

At present, I can get 150ns rams (256Kx1) at $10 a piece in qty 10 or so.
100 ns rams are not that much more expensive, and how many processors out
there (really out there, not just in internal engineering samples, so no
talk about the 68020) run significantly faster than 100ns cycles? Of these,
how many expect to do a memory fetch in one cycle.  The fastest thing
around in general use that I know of is the NCR/32 series, which expects to
get a 32 bit read from memory with a turnaround of 80ns (though it will
wait as necessary).

The point I am making is, I can't thing of *any* current microprocessor
which in general use hardware can possibly benefit from a memory cache.
Memory is no longer the bottleneck in memory cycles, and hasn't been for a
few years.

> > 2) it is not clear that the 32016 doesn't compare to a VAX.  With the right 
> > kind of paging algorithms and hardware, one might very well outperform an 
> > 11/750 WITH FPA. I haven't tried it, but it looks possible.
> 
> My experience is that a 10 MHz 32016 w/MMU and FPU is in the same
> ballpark as (or slightly faster than) a VAX 11/750.  But -- the C
> compiler supplied with Genix is terribly slow, taking twice as long
> as the VAX/UNIX C compiler.

With respect, the quality of UNIX compilers (including the C compiler) is
low, in part because many of them are pcc derivatives. pcc was written to
be portable, not necessarily to take full advantage of the virtual machine
on which it is used. This doesn't change my comment. Not having used the
Genix compiler, I can't say.  Several people have commented that the
development machine National has is terribly slow in practice, again I
haven't yet had a chance to use one, so this is hearsay.  If it is the
case, might this not in part account for your speed differences?

To support further the point about C compilers, pick your favorite C
compiler benchmarks and run them under 4.1 (which is faster than 4.2) and
VMS 4.0.  Though of course there will be exceptional benchmarks, the DEC
compiler wins hands down, even where the program is fairly straightforward
computation.  This doesn't make UNIX better or worse than VMS. It means the
VMS compiler is in most places better written *for the VAX.* It is needles
to say not terribly portable.


It seems I have waxed longer than I intended. Time to pack away the soapbox
until next week.

Jon Shapiro

jans@mako.UUCP (Jan Steinman) (03/21/85)

In article <446@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes, quotes:
>> ...  With a 10 Mhz 32032 *or* 32016 one can get real memory response times
>> of 200ns...
>
>... At 10 Mhz, each clock is 100 ns... the shortest possible bus cycle on a
>32xxx is 4 clocks.  That's 400 ns.  If you have an MMU, it's 5 clocks, or 500>ns.

I don't think National literature backs either of you, however, these numbers
were heard over the phone from National:

		byte	align	unalign	align	unalign
			 word	 word	 double	 double
				 or
				aligned
				 double

'016		   3	   3	   7	   7	  11
'016+MMU	   4	   4	   9	   9	  14
'032		   3	   3	   3	   3	   7
'032+MMU	   4	   4	   4	   4	   9

These numbers represent operand fetch times, and do not include processor
overhead.  Note that multiple CPU's *could* operate the bus at these speeds
if driven lock-step from a multi-phase clock.  This may be a nit-pick, but
bus cycle time is not instruction sequence time.
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

doug@terak.UUCP (Doug Pardee) (03/22/85)

JS> 1) The VAX memory access speed relies on write back cache - the bus cycles
JS> are 400 ns.  With a 10 Mhz 32032 *or* 32016 one can get real memory
JS> response times of 200ns.  This saves time.  
 
me> Say what????  200 ns bus cycle times on a 10 MHz 32K?  At 10 Mhz, each
me> clock is 100 ns.  Except for slave processor register accesses, the
me> shortest possible bus cycle on a 32xxx is 4 clocks.  That's 400 ns.
me> If you have an MMU, it's 5 clocks, or 500 ns.
 
JS> Though the basic processor cycling takes 4 T states, and each of these
JS> takes 100ns on a 10Mhz chip, the turnaround time is measured from address
JS> valid to data valid, which is 200ns.

True, but this is of no importance to the CPU.  It is of importance to
a) the designer, who has to build a 200 ns memory to support a CPU which
has a 400 or 500 ns bus cycle; and b) multi-ported memories where the
shorter memory cycle reduces contention as seen from the other ports.

It is hard to see how "this saves time".

me> My experience is that a 10 MHz 32016 w/MMU and FPU is in the same
me> ballpark as (or slightly faster than) a VAX 11/750.  But -- the C
me> compiler supplied with Genix is terribly slow, taking twice as long
me> as the VAX/UNIX C compiler.
 
JS> Several people have commented that the
JS> development machine National has is terribly slow in practice

Not really too surprising -- the DB16000 is run at 6 MHz with 2 wait
states.

JS> If it is the
JS> case, might this not in part account for your speed differences?

No, simply because the benchmarks were not run on the DB16000.

I see that I've goofed.  I didn't state that where the Genix C compiler
falls down is *compile time*, and folks naturally assumed that I meant
that the object code was poor.  Sorry about misleading y'all.

The object code is not a problem.  But it takes so dratted long to
compile even the smallest programs that one hates to turn on the
optimization feature.
-- 
Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug

jack@boring.UUCP (03/22/85)

Hey, come on! The days that cache was used because memory
was slow lay years behind us.
CACHE IS USED SO THAT THE PROCESSOR KEEPS ITS GRABBING HANDS
OFF THE BUS!!
If you have a system wich is more or less balanced in it's
CPU and disk usage, you gain an awful lot by adding a cache.

This will enable the CPU to continue running while the disk
controller is stuffing bytes into memory, without having to add
wait states because someone else is the bus master.

For this, your cache doesn't even have to be extremely fast.
Even if you have to let the CPU wait while you are invalidating
a cache location, this won't have too much influence on performance,
since you hardly ever look at a buffer that is currently being
filled by a disk controller, so there will be hardly any hits 
because of the disk i/o.

Of course, all these things become even more valid
when you have mulitple CPU's on one bus....
-- 
	Jack Jansen, {decvax|philabs|seismo}!mcvax!jack
It's wrong to wish on space hardware.

doug@terak.UUCP (Doug Pardee) (03/26/85)

The figures reproduced below were obtained by "fudging" out the last
clock cycle of an operand access (T4).  This is the "wind-down" cycle,
where the various control lines are released.  Notice that while a
single-cycle is shown to need 4 clocks w/MMU, a double-cycle needs
9 and a triple-cycle needs 14 -- all but the last cycle are shown at
the full 5 clocks.

The figures below represent the CPU's view of bus timing, since it
has all of the operand data it needs without waiting for the wind-down
clock cycle to complete.  But it still cannot access the bus again
until that clock cycle has finished.

These figures are an accurate picture of operand access time on an
otherwise idle memory bus.  They don't represent end-to-end memory
bus cycle times.

> I don't think National literature backs either of you, however, these numbers
> were heard over the phone from National:
> 
> 		byte	align	unalign	align	unalign
> 			 word	 word	 double	 double
> 				 or
> 				aligned
> 				 double
> 
> '016		   3	   3	   7	   7	  11
> '016+MMU	   4	   4	   9	   9	  14
> '032		   3	   3	   3	   3	   7
> '032+MMU	   4	   4	   4	   4	   9
-- 
Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug