[comp.sys.amiga] 68040 specs

dtiberio@csserv1.ic.sunysb.edu (David Tiberio) (12/07/90)
  I am going to give everyone the specs file for the 68040 chip. Maybe
we can gouge through it and see what is relevant...and sorry if it comes
up double spaced! It should be about 6k. :)

David Tiberio  SUNY Stony Brook 2-3605  AMIGA  Toto Productions  DDD Men

Captured from the bix network:

==========================
microbytes/features #243, from microbytes, 11743 chars, Mon Jan 22 16:33:53 1990
--------------------------
TITLE:  FIRST IMPRESSION: Motorola's New 68040 Microprocessor


by Tom Thompson



---------------------------

This new CISC microprocessor

offers RISC performance

---------------------------





Motorola has officially unwrapped its newest 32-bit
microprocessor, the 68040. Manufactured with 0.8-micron
high-speed CMOS technology, the 68040 packs 1.2 million
transistors on a single silicon die. With 900,000 extra
transistors to work with over the 300,000 transistors in a
68000 processor, the 68040's designers added new features
and boosted performance. New features include the following:



-- Optimized 68030 integer unit. While retaining object-code
compatibility with previous 68000-family processors, the IU
has been optimized to execute instructions in fewer clock
cycles (i.e., run faster). The claimed boost in performance is
three times that of a 68030.

-- Integral FPU. The 68020 and 68030 require external FPU
coprocessor chips to handle floating-point math. The 68040,
however, has an FPU built into it, giving it the power to do
serious number crunching. The FPU's data types are
compatible with the ANSI/IEEE 754 standard for binary
floating-point math, and its instruction set is object
code-compatible with Motorola's 68881/68882 FPUs. Like
the IU, the 68040's on-chip FPU has been optimized to
execute frequently used instructions using fewer clock
cycles. The claimed performance boost is 10 times that of a
68882.

-- Large caches. Processor accesses to the system bus are
minimized by storing the most recently used set of
instructions or data in on-chip, 4K-byte caches. Both caches
operate independently but can be accessed at the same time.
Bus snoop logic is used to maintain cache coherency (i.e., it
ensures that the cache's contents match those parts of
memory corresponding to the cache). The bus snooper's design
is fined-tuned to support multiprocessor systems where one
or more bus masters or 68040s might share the same section
of memory.

-- Separate memory units for instructions and data. Each
memory unit consists of a memory management unit, a cache
controller, and bus snoop logic. The MMUs use a subset of the
68030's MMU instruction set. Both memory units function
independently of each other to improve processor throughput.

  The 68040 ships with an initial clock speed of25 MHz;
higher speeds are to be available in the future, Motorola says.
The 68040 comes in a 179-pin grid-array package. With the
elimination of coprocessor function lines (now that the MMU
and FPU are consolidated onto the processor) and the addition
of snoop control lines, the 68040 is not pin-compatible with
the 68030.

Because of the 68040's software compatibility with its
predecessors, it can tap into the existing software base of
680x0 applications. It does this not only while eliminating a
component (the FPU) from a computer's design, but also while
improving performance. In fact, the 68040 executes
instructions on the average of nearly once per clock cycle --
the same as a RISC processor.



Fine-Tuned for Performance

The 68040 was built on the firm foundation of its
predecessors. The design team used the experience garnered
from developing earlier processors to aid in optimizing the
throughput of the 040.

The 040 was designed from the ground up, Motorola engineers
said. It incorporates a high degree of parallelism using a
number of internal buses. An internal Harvard architecture
gives the processor full access to both instructions and data.
Both the IU and FPU have separate pipelines and can operate
concurrently. For example, the FPU can perform
floating-point instructions independently of the IU. Each
stream (instructions or data) has its own dedicated cache and
MU that function independently of each other. A smart bus
controller assigns priorities to bus traffic to and from the
caches.

There were several key areas where Motorola was able to
boost performance. The first was in reducing the clock cycles
needed to execute certain instructions. The next was to
ensure that the processor funnels instructions and data into
itself quickly and constantly, lest it stall while waiting on
information. The processor then gets its results back into the
system without interfering with incoming information.
Finally, as if this wasn't enough, the processor stays off the
system bus to a greater extent than is the case with other
processor designs. This lets DMA transfers and other bus
masters have use of it.



CISC with the Speed of RISC

The IU was optimized so that high-usage instructions execute
in fewer clock cycles, particularly branch instructions.
Motorola said it performed thousands of code traces using
real-world applications to determine which instructions
were used most often.The IU consists of 6 stages: instruction
prefetch, decode, effective address calculation, operand
fetch, execution, and writeback (i.e., the result is written to
either a register or to memory). Each stage works
concurrently on the instruction pipeline. Dual prefetch and
decode units deal with the branch instructions: One set
processes the instruction taken on the branch, and another
processes the intruction not taken. In this way, no matter
what the outcome, the IU has the net instruction decoded and
ready to go without seriously disrupting the pipeline. This
complex design has a big payoff: Motorola has determined
that the average instruction takes 1.3 clock cycles to
execute. The ability to execute an instruction once per clock
cycle is the performance edge of RISC processors --  yet the
68040's IU accomplishes the same goal while executing
complex-instruction-set computer (CISC) instructions.

The FPU adds 11 registers to the 68040 register set: Eight of
them are 80-bit floating-point registers, and three are
status, control, and instruction address registers. The FPU
has a three-stage execution unit, and, like the IU, each stage
operates concurrently. Load and store instructions (FMOVE)
can be performed during other arithmetic operations, and a
64- by 8-bit hardware multiplication unit speeds many
calculations. However, the FPU only implements a subset of
the 68882 instructions on-chip. The transcendental
(trigonometric and exponential) functions are emulated in
software via a software trap. But Motorola claims that even
these instructions should execute 25% to 100% faster on
25-MHz 68040 than on a 33-MHz 68882 FPU.



Boosting Throughput

In the area of throughput, each stream is managed by a
separate memory unit that uses an MMU for
logical-to-physical address translations during bus accesses.
These MMUs support demand-paged virtual memory. Both
MMUs have a four-way set-associative address translation
cache (ATC) with 4 entries (versus 22 entries for the 68030).
The ATCs reduces is done, you must examine the
structure of the cache. Each is a four-way set-associative
cache composed of 64 sets of four lines. A line consists of 4
longwords, or 16 bytes. Cache lines are read or written
rapidly using burst-mode access (a type of bus transfer that
moves 16 bytes in a minimum of clock cycles). For read
operations, this fills the cache efficiently and, at the same
time, loads adjacent instructions or data into the cache that
could be used in the near future.



Zen and the Art of Cache Maintenance

As the cache is accessed and data modified, cache-mode bits
in the ATC determine, on a page-by-page basis, the method by
which the information is handled. That is, the ATC entry that
corresponds to the address in main memory whose contents
were copied into the cache decides how the data will be
updated. The modes are cacheable write-through, cacheable
copyback, noncacheable, and noncacheable I/O.

In the cacheable write-through mode, an update to the data
cache forces a write to main memory. While this generates
additional bus activity, this mode is required when working
with a portion of memory that other processors share. The
copyback mode updates the cache line but without updating
main memory. The modified (or "dirty") cache line is copied
back into main memory only when absolutely necessary.
"Noncacheable" indicates that the data shouldn't be cached,
which is typically the situation for shared data structures or
for locked accesses (e.g., an operand access or a translation
table entry update). Noncacheable I/O indicates that the data
can't be cached and must be read or written in the exact order
of instruction execution. This mode is for memory-mapped
I/O devices (typically a seial device) where the information's
order is crucial.

The bus snooper is used in multiple bus master situations
where a noncaching bus master, such as a DMA controller,
might modify the memory that is mapped into the 68040's
cache. The bus snooper monitors the external bus and updates
the cache as required.

Cache validity is handled on a line-by-line basis (i.e., a cache
miss triggers a burst-mode access that updates 16 bytes
either in the cache or main memory). The copyback mode
minimizes writes to main memory, and the bus controller
prioritizes each cache's external memory requests. Read
requests take priority over writes to ensure that the
pipelines remain filled.

The caches are critical to the 040's overall throughput. They
keep instructions and data moving into the processor while
satisfying the apparently contradictory role of minimizing
system bus accesses. Motorola estimates that the cache hit
rate is about 93 percent for instruction and data reads and
about 94 percent for data writes.



A Processor for the 1990s

It is perhaps appropriate that Motorola has introduced the
68040 in the first month of the 1990s. The 040 has the
power to tackle the jobs with large amounts of information
that we will be dealing with regularly in the next ten or so
years.

Preliminary results have a 68040 weighing in at 20 million
instructions per second versus the SPARC's 18 MIPS and the
80486's 15 MIPS, all clocked at 25 MHz. On floating-point
operations, the 68040 antes up 3.5 million floating-point
operations per second versus the SPARCS's 2.6 MFLOPS and
the 80486's 1 MFLOPS. If these numbers are accurate, then
the 68040 already outperforms one RISC processor.

But the computer industry doesn't stand still. As we move
into the new decade, we can expect new RISC processors to
once again take the lead in performance. Still, the 68040
shows that owners of CISC systems can have their cake and
eat it, too. They don't have to forsake their software base or
settle for mediocre performance. And Motorola is already
working on the 68050.



------------------------------------------------------

Tom Thompson is a BYTE senior technical editor. He can be
reached here on BIX as tom_thompson.

end of file

Okay, I hope it worked! 

(Reprinted with permission from God).

:)