[comp.arch] hard data on Motorola 88000

andrew@frip.gwd.tek.com (Andrew Klossner) (04/19/88)

The announcement is today, so I guess it's okay to talk hard data on
the Motorola 88000 architecture.

The 88100, the CPU chip, includes a floating point processor.  The
88200 is the CMMU (cache/memory management unit).  The CPU uses a
Harvard architecture (separate memory ports for instruction and data)
so a minimum configuration is one CPU and 2 CMMUs.  It cycles at 20MHz
initially, with 25MHz expected before long.

The CPU itself, excluding the floating point unit, looks much like
everybody else's RISC CPU.  There are 32 registers, with r0 hardwired
to zero.  (No register windows.)  There is hardware stalling on a
register scoreboard.  ALU instructions take three register addresses,
two operands and a destination.  They all execute in one cycle, except
for integer multiply/divide.  (There is result forwarding, so a
destination register can be used in the next instruction without
stalling.)  Load/store instructions can take a 16 bit offset and an
index register, which can be scaled by a factor of 1, 2, 4, or 8.  To
get to an arbitrary 32-bit address, you need two instructions:

	or.u	r2,r0,hi16(address)	; high 16 bits of address to r2
	ld	r2,r2,lo16(address)	; load word into r2

There is a three-deep pipeline for instruction fetch and a three-deep
pipeline for data fetch/store.  Branch instructions have one delay
slot, and each branch instruction has a bit which means execute the
instruction in the delay slot before branching.  Load instructions take
three cycles if the target memory location is already in cache.  Store
instructions get started in one cycle if the data pipeline isn't full,
otherwise they stall.

The on-chip floating point unit implements floating point
add/subtract/multiply/divide/compare and integer multiply/divide.
Floating point instructions can freely mix single and double precision,
which are the usual IEEE format 32- and 64-bit words.  The add/subtract
portion is separate from the multiply portion and both are pipelined,
so, for example, there can be three multiplies going at one time.  But
the divide instruction takes over the whole FP unit and iterates
through it.  Integer multiply takes 4 cycles; integer divide takes 39.
Single precision add/sub/cmp/mul/convert takes 5 cycles; single divide
takes 30; double add/sub/cmp/convert takes 6; double mul takes 10;
double divide takes 60.  Curiously, an integer divide with a negative
operand traps and makes the kernel complete the operation; I guess
Motorola just ran out of silicon.

Each CMMU has 16k bytes of RAM, organized as a 4-way set associative
cache.  You can have as many as 4 CMMUs on each memory port.  The cache
is by physical addresses, and the cache lookup, hashed on offset within
page, proceeds in parallel with the logical to physical address
translation to get the speed up.  The MMU is a subset of Motorola's
PMMU chip, with the usual two-level page tables and all the necessary
bits (referenced, dirty, etc) in the page descriptor words.  The CMMU
includes a page address translation cache which can describe 56
entries, and a block address translation cache which can be used to
avoid page table walks for memory that's locked down, like kernel code
and data.

A cache line is 16 bytes.  On a cache miss during fetch, the whole line
must be loaded from memory before the fetch is satisfied.  On a cache
miss during store, the whole line is loaded, then the modified word is
written to memory; a cache hit during store does not cause the word to
be written.

The CMMUs include logic to do bus snooping and maintain cache
coherency, so you can throw several CPU/CMMU lashups onto the same
memory bus.  Motorola is playing this up in their advertising, claiming
17 MIPS for one CPU and 50 MIPS for a multi-CPU system.

Unix system V release 3 is up and running (single-CPU).  A reference
port will be sold by either Motorola or Unisoft.  A binary
compatibility standard, which eventually will be blessed by AT&T and be
an ABI, is coming along.

We at Tektronix have been designing a workstation around this chip set
for several months.  I like it.

Don't ask me what price or availability are, I don't know the answers
for the general public.  As a member of the 88open consortium,
Tektronix negotiated favorable terms.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

fnf@mcdsun.UUCP (Fred Fish) (04/22/88)

In article <9916@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
>The announcement is today, so I guess it's okay to talk hard data on
>the Motorola 88000 architecture.

I hope so too, or we will both be in trouble...  :-)

Andrew presents lots of interesting information about our new baby, but
I'd like to elaborate on one point before rumors get started that all
loads and stores in a 32-bit address space require two instructions.

>           Load/store instructions can take a 16 bit offset and an
>index register, which can be scaled by a factor of 1, 2, 4, or 8.  To
>get to an arbitrary 32-bit address, you need two instructions:
>
>	or.u	r2,r0,hi16(address)	; high 16 bits of address to r2
>	ld	r2,r2,lo16(address)	; load word into r2

We recognized early in the development cycle of the C compiler and associated
tools that the 16 bit immediate values in some instructions had the 
potential to get us into the same ugly mess that the 80x86 camp is in,
with multiple memory "models" directly visable to the programmer.  We
wanted to hide this as much as possible, so the those programming in
a high level language, and to some extent those programming in assembler,
could simply treat the machine as if it had a linear 32-bit address
space with no special contortions necessary for access to any particular
object, no matter how large.  To demonstrate one of the features of the
tool set that accomplishes this goal, consider the following example program:

	char array[(4 * 64 * 1024) + 1];

	main ()
	{
		array[0 * 64 * 1024] = 1;
		array[1 * 64 * 1024] = 1;
		array[2 * 64 * 1024] = 1;
		array[3 * 64 * 1024] = 1;
		array[4 * 64 * 1024] = 1;
	}

The compiler produces the following assembly code (with comments stripped
by hand for the sake of saving some space):

		global		_main
		text
	_main:
		addu		r20,r0,1
		st.b		r20,r0,_array
		st.b		r20,r0,_array+65536
		st.b		r20,r0,_array+131072
		st.b		r20,r0,_array+196608
		st.b		r20,r0,_array+262144
		jmp		r1
		data
		comm		_array,262145

Note the lack of any hi16/lo16 pseudofunctions.  The compiler just
emits the straightforward, obvious code.  Note that the assembler
does not do any particular magic with this code either.  Any expressions
that do not evaluate to a constant small enough to fit into the allocated
slot in the object code, are simply passed on to the linker for evaluation.
Below is a disassembly of the relevant section of the .o file produced by
the above assembly code:

       _main 62800001 addu        r20,r0,$0001
   $00000004 2E800000 st.b        r20,r0,$0000
   $00000008 2E800000 st.b        r20,r0,$0000
   $0000000C 2E800000 st.b        r20,r0,$0000
   $00000010 2E800000 st.b        r20,r0,$0000
   $00000014 2E800000 st.b        r20,r0,$0000
   $00000018 F400C001 jmp         r1 (_main)


Now is where the interesting stuff starts.  The linker is allocated the
registers r26-r29, for it to use in any way it sees fit.  By convention,
the linker is also guaranteed that no user code will ever play with
these registers.  For the example above, the linker decides that it's
most efficient use of the registers, based on the final address of the
data section and some other factors, is to segment the data section
into three 64K segments, followed by an "infinite" length segment.
The first three registers, r26, r27, and r28 are set up as base pointers
to these first three segments, and the last linker register, r29, is
reserved for synthesizing 32-bit addresses into the remaining "infinite"
length segment.  Thus in effect, r29 becomes a dynamically changing
base pointer that gets changed on an instruction by instruction basis,
to point to the 64K data segment containing the referenced object.
When the linker does it's work, it actually patches the object code, 
changing register assignments and inserting instructions as necessary,
to produce the following code, which ultimately gets executed:

       _main 62800001 addu        r20,r0,$0001              
    _main+$4 2E9A0028 st.b        r20,r26,$0028
    _main+$8 2E9B0028 st.b        r20,r27,$0028
    _main+$C 2E9C0028 st.b        r20,r28,$0028
   _main+$10 5FA00043 or.u        r29,r0,$0043
   _main+$14 2E9D0028 st.b        r20,r29,$0028
   _main+$18 5FA00044 or.u        r29,r0,$0044
   _main+$1C 2E9D0028 st.b        r20,r29,$0028

Note that the data section for this sample starts at 0x40000.  The
$0028 offset comes from the fact that crt0.o contains $0028 worth of
data that gets linked before our test array.  I.E. the address of
_array ends up being 0x40028.  With this strategy, we have the
best of both worlds.  Loads and stores to objects low in the
data space use the more efficient single instruction form, while
loads and stores to objects far into the data space use the two
instruction form, and all of this is completely transparent to the
programmer.  He did not have to decide in advance whether to use
a "small model" or "huge model" for his program.

This is just the tip of the iceburg, there are lots of other optimizations
that become obvious.  By examining the static and dynamic characteristics
of the program, the data section objects can be sorted to get the most
frequently used objects into low data memory.  The linker might also 
decide that certain sections of the program reference portions of
data memory more often than others, and insert the appropriate code to
change the data mapping on the fly, rather than using a static mapping.

One loose end in our example needs to be tied up.  How do r26, r27, and
r28 get initialized?  The answer lies in crt0, where the linker patches
a section of code to initialize any registers it uses:

   __start     5F400040 or.u        r26,r0,$0040
   __start+$4  5B5A0000 or          r26,r26,$0000
   __start+$8  5F600041 or.u        r27,r0,$0041
   __start+$C  5B7B0000 or          r27,r27,$0000
   __start+$10 5F800042 or.u        r28,r0,$0042              
   __start+$14 5B9C0000 or          r28,r28,$0000             
   __start+$18 5FA00000 or.u        r29,r0,$0000              
   __start+$1C 5BBD0000 or          r29,r29,$0000             

I hope you have found this little example interesting.  I should note
that the general idea of having the linker synthesize necessary instruction
streams to hide the 16-bit literal constant problem was first proposed to
me by a long time Motorolan architecture expert, Bob Greiner.

-Fred
-- 
# Fred Fish    hao!noao!mcdsun!fnf    (602) 438-3614
# Motorola Computer Division, 2900 S. Diablo Way, Tempe, Az 85282  USA

jlw@lznv.ATT.COM (j.l.wood) (04/24/88)

In article <833@mcdsun.UUCP>, fnf@mcdsun.UUCP (Fred Fish) writes:
> In article <9916@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
> >The announcement is today, so I guess it's okay to talk hard data on
> >the Motorola 88000 architecture.
> 
> 
> I hope you have found this little example interesting.  I should note
> that the general idea of having the linker synthesize necessary instruction
> streams to hide the 16-bit literal constant problem was first proposed to
> me by a long time Motorolan architecture expert, Bob Greiner.
> 

Delaying the decision making this long looks like an interesting approach to
the problem.  This would, however, have consequences in two types of routines
used in, for example, UNIX SVR3.1.  They are the shared library and the loadable
device drivers.  The loadable device drivers could be fixed by using the same
methodology for fixing things up via the loader as the linker uses.  This
shouldn't slow up things too much except for rebooting which hopefully is
a rare uccurrence.  For the shared libraries, unless another register can
be "understood" to be pointing somewhere for them I don't see how they
could have any a priori knowledge of which register's pointing where.
The whole idea of reserving registers for this purpose is to avoid having
to be continuingly have to be munging registers. Ie. they'd be set on
a load module basis, but if the register fixing up were to be placed in
a special (csav/cret) pair for shared libraries then you lose a large part
of that savings.


Joe Wood
lznv!jlw

aglew@urbsdc.Urbana.Gould.COM (04/26/88)

..> Joe Wood comments about the 88000's link
..> strategy for offsets too large to fit in
..> instruction, and shared libraries.

If your shared libraries are always linked
in at the same address, and make no calls
outside the library, I don't think that there
is any problem.