[net.micro.68k] 68020 vs 16k - is the 020 worth

rpw3@fortune.UUCP (02/07/84)

#R:utzoo:-349300:fortune:6600011:000:4379
fortune!rpw3    Feb  6 23:15:00 1984

Please, please, please, folks... don't fall in the trap of comparing
CPU clock speeds across different machine architectures (such as
20 Mhz 68k vs. 6Mhz 16k). "It ain't that simple!" [Murphy's Law #27]

The CPU clock has only to do with the internal fineness of
the particular state-machine/microcode-engine used to implement
the chip. You have to look at how many clocks it takes for a
memory cycle, AND what access time is demanded of the memory
to achieve that cycle. Comparing CPU clocks is like saying,
"My car is faster than yours because my wheels have higher RPMs."
(What's the diameter of the wheels, Ollie?)

To get valid comparisons one must normalize the CPU clock to the memory
access time and then memory cycle times can be calculated using the bus
sequence of the particular chip.  Since processor clock speeds generally
evolve more quickly than memory access times (in the marketplace), one
has to look at how well the (expensive) memory is being used.

In extreme examples, equal speed memories can result in one
architecture being two or more times faster than another, simply
because the memory is left idle. This explains, for example, why the
obscure 6809 can stomp the familiar Z80, given equal access time
memories, even though the Z80 may be running with a 2.5 times faster
CPU clock. The 6809 uses one clock per memory cycle, the Z80 needs
three (data) or four (instruction fetch). The Z80 also leaves the
RAMs idle for a longer fraction of the cycle. (To get equivalent
performance from the Z80, you have to run the CPU clock at a MUCH
higher rate to balance the duty cycle while adding back wait states
to match the access time.)

One of the main reasons I happen to like the 68000/68010 is simply that
the bus access-to-cycle time ratio nicely matches the access-to-cycle
ratio of current (and near-future) dynamic RAMs. (For hardware hackers,
the chip leaves the memories idle for just about the "RAS precharge
time".) It makes good use of the memories. (Who knows about the 68020?)
But don't let Motorola hype you. With the RAM chips we are going to have
available over the next 1-2 years, you don't NEED a 20Mhz CPU; 12-16Mhz
will do just fine, thank you.

(I have not done a careful study of the 16000, but from the few minutes
I have looked at the bus timing diagrams, it didn't looked quite as
memory efficient. Be that as it may, ...)

To do a fair comparision, one needs to presume some RAM access time,
add bus driver/receiver and memory system delays (to get a memory
SYSTEM access and cycle time), add MMU delays, and then compute the
fastest CPU clock speed (for each chip) that just makes that access
time work. (If one of the CPUs won't go fast enough to keep commercial
memory chips busy, you've got a real problem with that one.) From that
clock and the number of clocks per memory cycle, you can calculate
the effective system memory cycle time as driven by each processor.
Divide the raw memory system cycle time by the CPU-cum-memory system
cycle time to get percentage effective memory utilization. The result
is a pretty good first-order comparison of throughput between the CPU
architectures.

If you have reason to believe that one machine is GROSSLY more instruction
stream efficient than the other (average bits/instruction), then you can
scale a little for that, but be careful. Such interpretations are tricky
(what is an "average instruction"?). The best way to do that is to take
some fairly large modules of frequently used code (say, pieces of "libc")
and hand code them in assembler as tight as possible. (Comparisons of
individual instructions are meaningless.) Look at total memory cycles
required for the entire function (don't forget a byte often costs the
same as a word), and scale by the memory utilization calculated above.
That gives you "functions per mem-access-time", which is a measure that
can be used across a fairly large evolution in CPU clock and memory
access times (which occur as chips get better).

Whatever you do, don't try to compare CPU clock speeds alone. Even
within a chip family, it's bogus. (A 20 Mhz 68000 is twice as fast
as a 10 MHz 68000 ONLY with an infinitely fast memory system with
no real-world components.)

Rob Warnock

UUCP:	{sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphins Drive, Redwood City, CA 94065

rcm@tropix.UUCP (Robert C. Moore) (02/10/84)

Rob's comments on the relative speed comparisons of micros are quite
correct.  It is important to note the effective instruction execution
speed including the effects of mmu's, bus arbitration, memory speeds,
and so forth --- unless there is an intervening cache.  The cache speed
is then most important, as well as the cache size (and thus its hit
rate.)

For example, the 16k can get data from memory in only 3 clock cycles,
but with its mmu, the number jumps to 4 (assuming very fast memory.)
If the 68451 mmu is used alone with the 68000, getting only 2 waits at
12.5 Mhz is considered pretty good (ie 6 clock cycles).  But if a
translation cache is put around it, no wait operation (4) is pretty
easy with conventional dynamic ram.  With both a translation cache and
a data cache, no wait operation of the 68000 at 12.5 Mhz is trivial,
although the cache size will determine the degradation due to imperfect
hit rate.

The 68020 provides a virtual cache inside the processor, neatly
avoiding the delays in address translation and main memory cycle time.
A hidden benefit is the fact that the internal cycles are synchronous,
avoiding the need to repetedly sample the DTACK (actually DSACK)
asynchronous handshake line to prevent meta-stable states from
propagating into the chip.  ("You are not expected to understand
this.")  In short, an average of one byte is consumed off the
instruction stream on each clock cycle.  (The shortest instructions
require 2 cycles, and are two bytes long.)

Compare this to the 32032.  There the state machine is unchanged from
the 16032.  It contains no cache.  The 16032 already underutilizes its
bus (in fact the 16008 is almost as fast, as the 8 byte prefetch queue
is almost always full.)  The 32032 will only go slightly faster than
the 16032 in such circumstances.  It will, however, leave enough bus
time available that one could credibly run two processors on the same
bus!

All this discussion assumes that with register rich instruction sets most
of the effects on system timing are due to the time needed to access
text (programs) and data cycles have very little effect.  Does anyone
have any hard numbers on 16k bus utilization or text/data access ratios
for either of these chips?

bob moore 
ihnp4!tropix!rcm