rpw3@fortune.UUCP (02/07/84)
#R:utzoo:-349300:fortune:6600011:000:4379 fortune!rpw3 Feb 6 23:15:00 1984 Please, please, please, folks... don't fall in the trap of comparing CPU clock speeds across different machine architectures (such as 20 Mhz 68k vs. 6Mhz 16k). "It ain't that simple!" [Murphy's Law #27] The CPU clock has only to do with the internal fineness of the particular state-machine/microcode-engine used to implement the chip. You have to look at how many clocks it takes for a memory cycle, AND what access time is demanded of the memory to achieve that cycle. Comparing CPU clocks is like saying, "My car is faster than yours because my wheels have higher RPMs." (What's the diameter of the wheels, Ollie?) To get valid comparisons one must normalize the CPU clock to the memory access time and then memory cycle times can be calculated using the bus sequence of the particular chip. Since processor clock speeds generally evolve more quickly than memory access times (in the marketplace), one has to look at how well the (expensive) memory is being used. In extreme examples, equal speed memories can result in one architecture being two or more times faster than another, simply because the memory is left idle. This explains, for example, why the obscure 6809 can stomp the familiar Z80, given equal access time memories, even though the Z80 may be running with a 2.5 times faster CPU clock. The 6809 uses one clock per memory cycle, the Z80 needs three (data) or four (instruction fetch). The Z80 also leaves the RAMs idle for a longer fraction of the cycle. (To get equivalent performance from the Z80, you have to run the CPU clock at a MUCH higher rate to balance the duty cycle while adding back wait states to match the access time.) One of the main reasons I happen to like the 68000/68010 is simply that the bus access-to-cycle time ratio nicely matches the access-to-cycle ratio of current (and near-future) dynamic RAMs. (For hardware hackers, the chip leaves the memories idle for just about the "RAS precharge time".) It makes good use of the memories. (Who knows about the 68020?) But don't let Motorola hype you. With the RAM chips we are going to have available over the next 1-2 years, you don't NEED a 20Mhz CPU; 12-16Mhz will do just fine, thank you. (I have not done a careful study of the 16000, but from the few minutes I have looked at the bus timing diagrams, it didn't looked quite as memory efficient. Be that as it may, ...) To do a fair comparision, one needs to presume some RAM access time, add bus driver/receiver and memory system delays (to get a memory SYSTEM access and cycle time), add MMU delays, and then compute the fastest CPU clock speed (for each chip) that just makes that access time work. (If one of the CPUs won't go fast enough to keep commercial memory chips busy, you've got a real problem with that one.) From that clock and the number of clocks per memory cycle, you can calculate the effective system memory cycle time as driven by each processor. Divide the raw memory system cycle time by the CPU-cum-memory system cycle time to get percentage effective memory utilization. The result is a pretty good first-order comparison of throughput between the CPU architectures. If you have reason to believe that one machine is GROSSLY more instruction stream efficient than the other (average bits/instruction), then you can scale a little for that, but be careful. Such interpretations are tricky (what is an "average instruction"?). The best way to do that is to take some fairly large modules of frequently used code (say, pieces of "libc") and hand code them in assembler as tight as possible. (Comparisons of individual instructions are meaningless.) Look at total memory cycles required for the entire function (don't forget a byte often costs the same as a word), and scale by the memory utilization calculated above. That gives you "functions per mem-access-time", which is a measure that can be used across a fairly large evolution in CPU clock and memory access times (which occur as chips get better). Whatever you do, don't try to compare CPU clock speeds alone. Even within a chip family, it's bogus. (A 20 Mhz 68000 is twice as fast as a 10 MHz 68000 ONLY with an infinitely fast memory system with no real-world components.) Rob Warnock UUCP: {sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3 DDD: (415)595-8444 USPS: Fortune Systems Corp, 101 Twin Dolphins Drive, Redwood City, CA 94065
rcm@tropix.UUCP (Robert C. Moore) (02/10/84)
Rob's comments on the relative speed comparisons of micros are quite correct. It is important to note the effective instruction execution speed including the effects of mmu's, bus arbitration, memory speeds, and so forth --- unless there is an intervening cache. The cache speed is then most important, as well as the cache size (and thus its hit rate.) For example, the 16k can get data from memory in only 3 clock cycles, but with its mmu, the number jumps to 4 (assuming very fast memory.) If the 68451 mmu is used alone with the 68000, getting only 2 waits at 12.5 Mhz is considered pretty good (ie 6 clock cycles). But if a translation cache is put around it, no wait operation (4) is pretty easy with conventional dynamic ram. With both a translation cache and a data cache, no wait operation of the 68000 at 12.5 Mhz is trivial, although the cache size will determine the degradation due to imperfect hit rate. The 68020 provides a virtual cache inside the processor, neatly avoiding the delays in address translation and main memory cycle time. A hidden benefit is the fact that the internal cycles are synchronous, avoiding the need to repetedly sample the DTACK (actually DSACK) asynchronous handshake line to prevent meta-stable states from propagating into the chip. ("You are not expected to understand this.") In short, an average of one byte is consumed off the instruction stream on each clock cycle. (The shortest instructions require 2 cycles, and are two bytes long.) Compare this to the 32032. There the state machine is unchanged from the 16032. It contains no cache. The 16032 already underutilizes its bus (in fact the 16008 is almost as fast, as the 8 byte prefetch queue is almost always full.) The 32032 will only go slightly faster than the 16032 in such circumstances. It will, however, leave enough bus time available that one could credibly run two processors on the same bus! All this discussion assumes that with register rich instruction sets most of the effects on system timing are due to the time needed to access text (programs) and data cycles have very little effect. Does anyone have any hard numbers on 16k bus utilization or text/data access ratios for either of these chips? bob moore ihnp4!tropix!rcm