gnu@hoptoad.UUCP (03/30/87)
In article <2130@intelca.UUCP>, clif@intelca.UUCP (Clif Purkiser) writes: > The system we used to run was an Intel Multibus I system > running Unix System V Release 3.0. The CPU board was a 386/24 > MultiBus I which has a 64 Kbyte direct-mapped write-through > cache and 2-3 wait states for cache misses. Hmm, let's make sure: cache hits run with 0 wait states, cache misses run with 2-3 wait states? I'm curious about the construction of such a cache. What is the basic cycle time of the machine, and how many cycles does a cache hit take? Is it accessing main memory over the Multibus, or on a local bus? Is main memory static ram, or dynamic? DRAMs take at least 100ns to fire up, so unless they are starting a RAM access even before the cache is checked, that would seem to mean 100ns (2 cycles at 20MHz) just for RAM access, not counting bus delay and time to drive addresses to RAM chips (required if you intend to support a reasonably sized main memory, e.g. >128 chips), and the time required for the RAMs to be ready for the *next* address (another 100ns or so). And if the cache is running the RAM all the time even when it hits, the DRAMs will not be ready to jump into action when the miss comes along. For example, in a 68020 design where the address lines zap straight out into RAM drivers, with no cache at all, the DRAM cycle time is ~270ns, or 4 clocks (1 wait state). Adding a cache just increases the wait states ON A MISS, though it speeds things up on a hit. If these figures are true, I suspect the system is a "hot rod" with a custom static RAM main memory on a local bus. This amounts to building the whole main memory out of expensive cache RAMs. I'm willing to be corrected, and/or to learn something new about memory design. John Gilmore -- Copyright 1987 John Gilmore; you can redistribute only if your recipients can. (This is an effort to bend Stargate to work with Usenet, not against it.) {sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu gnu@ingres.berkeley.edu
davidsen@steinmetz.UUCP (03/31/87)
In article <1946@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >+++ stuff +++ >DRAMs take at least 100ns to fire up, so unless they are starting a RAM >access even before the cache is checked, that would seem to mean 100ns >(2 cycles at 20MHz) just for RAM access, not counting bus delay and time to >drive addresses to RAM chips (required if you intend to support a >reasonably sized main memory, e.g. >128 chips), and the time required >for the RAMs to be ready for the *next* address (another 100ns >or so). And if the cache is running the RAM all the time even when >it hits, the DRAMs will not be ready to jump into action when the miss >comes along. >+++ stuff +++ >If these figures are true, I suspect the system is a "hot rod" with a >custom static RAM main memory on a local bus. This amounts to building >the whole main memory out of expensive cache RAMs. I'm willing to be >corrected, and/or to learn something new about memory design. I am not a hardware type, so bear with me if I am a bit empiric about this: I am qualified to make some measurements, but not to justify them on theoretical grounds. I have an 80386 system, running 16 MHz, with 1 MB of 100ns DRAM (32 bits wide) on the motherboard, and 2 MB of 120 ns DRAM on the 16 bit bus. I have 64k of 35ns static cache which I can disable. Without cache the 16 bit memory runs about 180% slower than 32 bit memory. With cache enabled the penalty drops to 30%. Also, the 32 bit memory runs about 30% faster than it did without cache, meaning that the 16 bit memory is running as fast with cache as the 32 bit memory without. -- bill davidsen sixhub \ ihnp4!seismo!rochester!steinmetz -> crdos1!davidsen chinet / ARPA: davidsen%crdos1.uucp@ge-crd.ARPA (or davidsen@ge-crd.ARPA)
tomk@intsc.UUCP (04/10/87)
> In article <2130@intelca.UUCP>, clif@intelca.UUCP (Clif Purkiser) writes: > > The system we used to run was an Intel Multibus I system > > running Unix System V Release 3.0. The CPU board was a 386/24 > > MultiBus I which has a 64 Kbyte direct-mapped write-through > > cache and 2-3 wait states for cache misses. And John Gilmore replies: > > Hmm, let's make sure: cache hits run with 0 wait states, cache misses > run with 2-3 wait states? I'm curious about the construction of such a > cache. What is the basic cycle time of the machine, and how many > cycles does a cache hit take? Is it accessing main memory over the > Multibus, or on a local bus? Is main memory static ram, or dynamic? The Multibus I board that was used for the measurements is a standard production board. It has 64K bytes of direct mapped cache based on 45ns data rams and 35ns tag rams. The DRAMs are 120ns access time variety. They could have been 150's but 120's are what we buy a lot of. The DRAM is local on the CPU board and is dual ported to the multibus. The 386/20 board (that's what we are talking about) will support up to 16MB of DRAM (Multibus I limit). The DRAM cycles are not started until after the a cache miss is detected. The first access on a cache miss will cause 3 wait states. When a cache miss occurs the CPU is switched into pipelined address mode and any subsequent misses will be 2 wait states. When a cache hit occurs again then the CPU resumes operating in non-pipelined address mode. With this setup we have measured an average of 0.7 wait states running UNIX os code. The basic bus cycle time of the machine is 2 CPU clocks. At 16MHz that is 125ns, 100ns @ 20MHz. Each wait state adds 61.25ns @ 16MHz and 50ns @ 20MHz. The basic instruction execution time is 4.5 clocks on the average with some magical instruction mix (details available on request). Adding a wait state slows down execution approx. 20%. For those curious about the compiler. The benchmarks were run with the greenhills C compiler with the opitmization switch OFF. The greenhills technology does a lot of optimization even without the -O switch so it is hard to tell how badly it destroys the inner loops of the dhrystone benchmark. The other side of the coin though is they do the same type of optimizations on the other machines. Again compare systems not CPU's. This is also why I always tell anyone interested in the 386 to come in with their favorite benchmark and run it on the box I have. So far the only place the 25MHz 68020's have beaten the 16MHz 386 is when the main loop of the code fits in 256 bytes. ------ "Ever notice how your mental image of someone you've known only by phone turns out to be wrong? And on a computer net you don't even have a voice..." tomk@intsc.UUCP Tom Kohrs Regional Architecture Specialist Intel - Santa Clara PS: John there will be a 386/20 manual in the mail to you as soon as I can find one.