[comp.arch] recent 386 timings from Intel

gnu@hoptoad.UUCP (03/30/87)

In article <2130@intelca.UUCP>, clif@intelca.UUCP (Clif Purkiser) writes:
>                  The system we used to run was an Intel Multibus I system 
> running Unix System V Release 3.0.  The CPU board was a 386/24 
> MultiBus I which has a 64 Kbyte direct-mapped write-through
> cache and 2-3 wait states for cache misses.

Hmm, let's make sure:  cache hits run with 0 wait states, cache misses
run with 2-3 wait states?  I'm curious about the construction of such a
cache.  What is the basic cycle time of the machine, and how many
cycles does a cache hit take?  Is it accessing main memory over the
Multibus, or on a local bus?  Is main memory static ram, or dynamic?

DRAMs take at least 100ns to fire up, so unless they are starting a RAM
access even before the cache is checked, that would seem to mean 100ns
(2 cycles at 20MHz) just for RAM access, not counting bus delay and time to
drive addresses to RAM chips (required if you intend to support a
reasonably sized main memory, e.g. >128 chips), and the time required
for the RAMs to be ready for the *next* address (another 100ns
or so).  And if the cache is running the RAM all the time even when
it hits, the DRAMs will not be ready to jump into action when the miss
comes along.

For example, in a 68020 design where the address lines zap straight out
into RAM drivers, with no cache at all, the DRAM cycle time is ~270ns,
or 4 clocks (1 wait state).  Adding a cache just increases the wait
states ON A MISS, though it speeds things up on a hit.

If these figures are true, I suspect the system is a "hot rod" with a
custom static RAM main memory on a local bus.  This amounts to building
the whole main memory out of expensive cache RAMs.  I'm willing to be
corrected, and/or to learn something new about memory design.

	John Gilmore
-- 
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu

davidsen@steinmetz.UUCP (03/31/87)

In article <1946@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>+++ stuff +++
>DRAMs take at least 100ns to fire up, so unless they are starting a RAM
>access even before the cache is checked, that would seem to mean 100ns
>(2 cycles at 20MHz) just for RAM access, not counting bus delay and time to
>drive addresses to RAM chips (required if you intend to support a
>reasonably sized main memory, e.g. >128 chips), and the time required
>for the RAMs to be ready for the *next* address (another 100ns
>or so).  And if the cache is running the RAM all the time even when
>it hits, the DRAMs will not be ready to jump into action when the miss
>comes along.
>+++ stuff +++

>If these figures are true, I suspect the system is a "hot rod" with a
>custom static RAM main memory on a local bus.  This amounts to building
>the whole main memory out of expensive cache RAMs.  I'm willing to be
>corrected, and/or to learn something new about memory design.

I am not a hardware type, so bear with me if I am a bit empiric about
this: I am qualified to make some measurements, but not to justify them
on theoretical grounds.

I have an 80386 system, running 16 MHz, with 1 MB of 100ns DRAM (32
bits wide) on the motherboard, and 2 MB of 120 ns DRAM on the 16 bit
bus. I have 64k of 35ns static cache which I can disable.

Without cache the 16 bit memory runs about 180% slower than 32 bit
memory. With cache enabled the penalty drops to 30%. Also, the 32 bit
memory runs about 30% faster than it did without cache, meaning that
the 16 bit memory is running as fast with cache as the 32 bit memory
without.

-- 
bill davidsen			sixhub \
      ihnp4!seismo!rochester!steinmetz ->  crdos1!davidsen
				chinet /
ARPA: davidsen%crdos1.uucp@ge-crd.ARPA (or davidsen@ge-crd.ARPA)

tomk@intsc.UUCP (04/10/87)

> In article <2130@intelca.UUCP>, clif@intelca.UUCP (Clif Purkiser) writes:
> >                  The system we used to run was an Intel Multibus I system 
> > running Unix System V Release 3.0.  The CPU board was a 386/24 
> > MultiBus I which has a 64 Kbyte direct-mapped write-through
> > cache and 2-3 wait states for cache misses.
And John Gilmore replies:
> 
> Hmm, let's make sure:  cache hits run with 0 wait states, cache misses
> run with 2-3 wait states?  I'm curious about the construction of such a
> cache.  What is the basic cycle time of the machine, and how many
> cycles does a cache hit take?  Is it accessing main memory over the
> Multibus, or on a local bus?  Is main memory static ram, or dynamic?

The Multibus I board that was used for the measurements is a standard production
board.  It has 64K bytes of direct mapped cache based on 45ns data rams and
35ns tag rams.  The DRAMs are 120ns access time variety.  They could have
been 150's but 120's are what we buy a lot of.  The DRAM is local on the 
CPU board and is dual ported to the multibus.  The 386/20 board (that's
what we are talking about) will support up to 16MB of DRAM (Multibus I limit).
The DRAM cycles are not started until after the a cache miss is detected. 
The first access on a cache miss will cause 3 wait states.  When a cache
miss occurs the CPU is switched into pipelined address mode and any 
subsequent misses will be 2 wait states.  When a cache hit occurs again then
the CPU resumes operating in non-pipelined address mode.

With this setup we have measured an average of 0.7 wait states running UNIX
os code.

The basic bus cycle time of the machine is 2 CPU clocks.  At 16MHz that is
125ns, 100ns @ 20MHz.  Each wait state adds 61.25ns @ 16MHz and 50ns @ 20MHz.
The basic instruction execution time is 4.5 clocks on the average with some
magical instruction mix (details available on request). Adding a wait state
slows down execution approx. 20%.  

For those curious about the compiler.  The benchmarks were run with the 
greenhills C compiler with the opitmization switch OFF.  The greenhills
technology does a lot of optimization even without the -O switch so it
is hard to tell how badly it destroys the inner loops of the dhrystone
benchmark.  The other side of the coin though is they do the same type
of optimizations on the other machines.  Again compare systems not CPU's.
This is also why I always tell anyone interested in the 386 to come in 
with their favorite benchmark and run it on the box I have.  So far the 
only place the 25MHz 68020's have beaten the 16MHz 386 is when the main
loop of the code fits in 256 bytes.

------
"Ever notice how your mental image of someone you've 
known only by phone turns out to be wrong?  
And on a computer net you don't even have a voice..."

  tomk@intsc.UUCP  			Tom Kohrs
					Regional Architecture Specialist
		   			Intel - Santa Clara

PS: John there will be a 386/20 manual in the mail to you as soon as I can
find one.