doug@terak.UUCP (Doug Pardee) (03/15/85)
> 1) The VAX memory access speed relies on write back cache - the bus cycles > are 400 ns. With a 10 Mhz 32032 *or* 32016 one can get real memory > response times of 200ns. This saves time. Say what???? 200 ns bus cycle times on a 10 MHz 32K? At 10 Mhz, each clock is 100 ns. Except for slave processor register accesses, the shortest possible bus cycle on a 32xxx is 4 clocks. That's 400 ns. If you have an MMU, it's 5 clocks, or 500 ns. One of the "unique" aspects of the 32xxx is that there's no point in putting a memory cache on it, 'cause the bus cycle will take forever anyway. BTW, another "unique" aspect is that if you're using the TCU, there is no usable specification for how rapidly the memory must respond on a read operation. The data must be ready 10 ns (CPU only) or 15 ns (CPU w/MMU) before the falling edge of PHI2 in T3, but the TCU does not have any timing specification for the falling edge of PHI2! > 2) it is not clear that the 32016 doesn't compare to a VAX. With the right > kind of paging algorithms and hardware, one might very well outperform an > 11/750 WITH FPA. I haven't tried it, but it looks possible. My experience is that a 10 MHz 32016 w/MMU and FPU is in the same ballpark as (or slightly faster than) a VAX 11/750. But -- the C compiler supplied with Genix is terribly slow, taking twice as long as the VAX/UNIX C compiler. -- Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug
jss@sjuvax.UUCP (J. Shapiro) (03/20/85)
Me: > > 1) The VAX memory access speed relies on write back cache - the bus cycles > > are 400 ns. With a 10 Mhz 32032 *or* 32016 one can get real memory > > response times of 200ns. This saves time. Doug Pardee: > Say what???? 200 ns bus cycle times on a 10 MHz 32K? At 10 Mhz, each > clock is 100 ns. Except for slave processor register accesses, the > shortest possible bus cycle on a 32xxx is 4 clocks. That's 400 ns. > If you have an MMU, it's 5 clocks, or 500 ns. Pulling out the data book, it appears to me that you are incorrect. Though the basic processor cycling takes 4 T states, and each of these takes 100ns on a 10Mhz chip, the turnaround time is measured from address valid to data valid, which is 200ns. ~ADS goes low midway through T-state 1 and data is read midway through T-state 3. On the VAX this is not the case, though I don't have the hardware manual handy to spell it out. I seem to recall the full VAX cycle takes longer. If I am wrong, would someone yank out their hardware manual and outline it for me? Typically, one does not just access memory, one does something with it, and the access itself is only a fraction of the time (though a significant one). Cut memory access time by 1/4 or 1/2 and lots of things get speeded up, precisely because the most significant fraction of the time of an operation is often memory (where applicable, of course). > One of the "unique" aspects of the 32xxx is that there's no point in > putting a memory cache on it, 'cause the bus cycle will take forever > anyway. At present, I can get 150ns rams (256Kx1) at $10 a piece in qty 10 or so. 100 ns rams are not that much more expensive, and how many processors out there (really out there, not just in internal engineering samples, so no talk about the 68020) run significantly faster than 100ns cycles? Of these, how many expect to do a memory fetch in one cycle. The fastest thing around in general use that I know of is the NCR/32 series, which expects to get a 32 bit read from memory with a turnaround of 80ns (though it will wait as necessary). The point I am making is, I can't thing of *any* current microprocessor which in general use hardware can possibly benefit from a memory cache. Memory is no longer the bottleneck in memory cycles, and hasn't been for a few years. > > 2) it is not clear that the 32016 doesn't compare to a VAX. With the right > > kind of paging algorithms and hardware, one might very well outperform an > > 11/750 WITH FPA. I haven't tried it, but it looks possible. > > My experience is that a 10 MHz 32016 w/MMU and FPU is in the same > ballpark as (or slightly faster than) a VAX 11/750. But -- the C > compiler supplied with Genix is terribly slow, taking twice as long > as the VAX/UNIX C compiler. With respect, the quality of UNIX compilers (including the C compiler) is low, in part because many of them are pcc derivatives. pcc was written to be portable, not necessarily to take full advantage of the virtual machine on which it is used. This doesn't change my comment. Not having used the Genix compiler, I can't say. Several people have commented that the development machine National has is terribly slow in practice, again I haven't yet had a chance to use one, so this is hearsay. If it is the case, might this not in part account for your speed differences? To support further the point about C compilers, pick your favorite C compiler benchmarks and run them under 4.1 (which is faster than 4.2) and VMS 4.0. Though of course there will be exceptional benchmarks, the DEC compiler wins hands down, even where the program is fairly straightforward computation. This doesn't make UNIX better or worse than VMS. It means the VMS compiler is in most places better written *for the VAX.* It is needles to say not terribly portable. It seems I have waxed longer than I intended. Time to pack away the soapbox until next week. Jon Shapiro
jans@mako.UUCP (Jan Steinman) (03/21/85)
In article <446@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes, quotes: >> ... With a 10 Mhz 32032 *or* 32016 one can get real memory response times >> of 200ns... > >... At 10 Mhz, each clock is 100 ns... the shortest possible bus cycle on a >32xxx is 4 clocks. That's 400 ns. If you have an MMU, it's 5 clocks, or 500>ns. I don't think National literature backs either of you, however, these numbers were heard over the phone from National: byte align unalign align unalign word word double double or aligned double '016 3 3 7 7 11 '016+MMU 4 4 9 9 14 '032 3 3 3 3 7 '032+MMU 4 4 4 4 9 These numbers represent operand fetch times, and do not include processor overhead. Note that multiple CPU's *could* operate the bus at these speeds if driven lock-step from a multi-phase clock. This may be a nit-pick, but bus cycle time is not instruction sequence time. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
doug@terak.UUCP (Doug Pardee) (03/22/85)
JS> 1) The VAX memory access speed relies on write back cache - the bus cycles JS> are 400 ns. With a 10 Mhz 32032 *or* 32016 one can get real memory JS> response times of 200ns. This saves time. me> Say what???? 200 ns bus cycle times on a 10 MHz 32K? At 10 Mhz, each me> clock is 100 ns. Except for slave processor register accesses, the me> shortest possible bus cycle on a 32xxx is 4 clocks. That's 400 ns. me> If you have an MMU, it's 5 clocks, or 500 ns. JS> Though the basic processor cycling takes 4 T states, and each of these JS> takes 100ns on a 10Mhz chip, the turnaround time is measured from address JS> valid to data valid, which is 200ns. True, but this is of no importance to the CPU. It is of importance to a) the designer, who has to build a 200 ns memory to support a CPU which has a 400 or 500 ns bus cycle; and b) multi-ported memories where the shorter memory cycle reduces contention as seen from the other ports. It is hard to see how "this saves time". me> My experience is that a 10 MHz 32016 w/MMU and FPU is in the same me> ballpark as (or slightly faster than) a VAX 11/750. But -- the C me> compiler supplied with Genix is terribly slow, taking twice as long me> as the VAX/UNIX C compiler. JS> Several people have commented that the JS> development machine National has is terribly slow in practice Not really too surprising -- the DB16000 is run at 6 MHz with 2 wait states. JS> If it is the JS> case, might this not in part account for your speed differences? No, simply because the benchmarks were not run on the DB16000. I see that I've goofed. I didn't state that where the Genix C compiler falls down is *compile time*, and folks naturally assumed that I meant that the object code was poor. Sorry about misleading y'all. The object code is not a problem. But it takes so dratted long to compile even the smallest programs that one hates to turn on the optimization feature. -- Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug
jack@boring.UUCP (03/22/85)
Hey, come on! The days that cache was used because memory was slow lay years behind us. CACHE IS USED SO THAT THE PROCESSOR KEEPS ITS GRABBING HANDS OFF THE BUS!! If you have a system wich is more or less balanced in it's CPU and disk usage, you gain an awful lot by adding a cache. This will enable the CPU to continue running while the disk controller is stuffing bytes into memory, without having to add wait states because someone else is the bus master. For this, your cache doesn't even have to be extremely fast. Even if you have to let the CPU wait while you are invalidating a cache location, this won't have too much influence on performance, since you hardly ever look at a buffer that is currently being filled by a disk controller, so there will be hardly any hits because of the disk i/o. Of course, all these things become even more valid when you have mulitple CPU's on one bus.... -- Jack Jansen, {decvax|philabs|seismo}!mcvax!jack It's wrong to wish on space hardware.
doug@terak.UUCP (Doug Pardee) (03/26/85)
The figures reproduced below were obtained by "fudging" out the last clock cycle of an operand access (T4). This is the "wind-down" cycle, where the various control lines are released. Notice that while a single-cycle is shown to need 4 clocks w/MMU, a double-cycle needs 9 and a triple-cycle needs 14 -- all but the last cycle are shown at the full 5 clocks. The figures below represent the CPU's view of bus timing, since it has all of the operand data it needs without waiting for the wind-down clock cycle to complete. But it still cannot access the bus again until that clock cycle has finished. These figures are an accurate picture of operand access time on an otherwise idle memory bus. They don't represent end-to-end memory bus cycle times. > I don't think National literature backs either of you, however, these numbers > were heard over the phone from National: > > byte align unalign align unalign > word word double double > or > aligned > double > > '016 3 3 7 7 11 > '016+MMU 4 4 9 9 14 > '032 3 3 3 3 7 > '032+MMU 4 4 4 4 9 -- Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug