[comp.arch] architecture/implementation -- 88000

earl@mips.COM (Earl Killian) (05/24/88)

(See previous postings for background.)

(Thanks to Andrew Klossner for his help on this one.)

> Architecture Reference

Where is the architecture fully described?

  -- Technical Summary: 32-Bit Concurrent RISC Microprocessor (27-page
     data sheet)

  -- MC78000 User's Manual Revision 0.4, October 7, 1987, Advanced
     Information (100+ page document, describes registers,
     instructions, exception processing, and timing information in
     detail; it has no doubt been renamed by now)

  -- Technical Summary: 32-Bit Cache/Memory Management Unit (CMMU)
     (19-page data sheet)

  -- MC78200 Cache and Memory Management Unit (CMMU) Architecture Spec.
     version 2.0, November 3, 1986, Advanced Information (80-page
     document, describes pre-production CMMU in detail)

  -- MC78200 User's Manual Revision 0.1, November 29, 1987, Advanced
     Information (80+ page document, like above but includes
     architecture changes which will appear in the production chip)

> Peak native MIPS

What is the clock cycle time? 20MHz (50ns)
What is the peak native MIPS rate? 20mips

> Implementation technology

What are the parameters of the implementation technology? 1.5micron CMOS
How many chips of what kinds to build a cpu subsystem?
	1 88100
	2-8 88200s
How many pins on those chips?
	Each chip is in a 17 pin by 17 pin package, 181 pins apiece.

> Instruction format

What instruction sizes are used? 32 bits
What size are immediate operands? 16 bits
What size are branch displacements? 16 bits (+-128KB)
What size are unconditional branch and call displacements? 26 bits (+-128MB)

> Integer Registers

How are the registers organized [simple, windowed]? simple
How many total integer registers? 32 32-bit registers
Hardwired zero register? yes, r0

4 registers reserved for linker

> Integer Alu

What is the logical latency/issue/repeat? 1/1/1
What is the shift latency/issue/repeat? 1/1/1
What is the add latency/issue/repeat? 1/1/1
What is the compare latency/issue/repeat? 1/1/1
How is 64 bit (signed/unsigned) integer addition supported and how many cycles?
	An "addu.co" instruction followed by an "add.ci" or "addu.ci"
	instruction.  Each is 1/1/1 for a total of 2/2/2.

> Branches

Which operand comparisons are implemented in the conditional branch
instruction, and which require a separate instruction?
	branch instructions: = 0, != 0, > 0, < 0, >= 0, <= 0
			bit set, bit clear
	Everything else requires a separate compare instruction.

Where is the result of separate comparisons stored [registers,
condition codes]? registers

Which forms of branch delay are present in instruction set
[execute N if no branch, execute N if branch, execute N always]?
	execute 1 always and execute 1 if no branch
What are the taken and not-taken cycle counts for each branch type,
not including the N delayed instructions, if executed?
	execute 1 always: 1 cycle, taken or not
	execute 1 if no branch: 1 cycle untaken, 2 cycles taken
	
> Loads/Stores

What addressing mode(s) do load instructions use?
register + 16-bit unsigned displacement
	register + register
	register + register*size
What addressing mode(s) do store instructions use?
	same
Which load/store sizes are supported [8, 16, 32, 64]? 8, 16, 32, 64
What is the load latency/issue/repeat? 3/1/1 for 8-32, 4/2/2 for 64

What is the store latency/issue/repeat? 1/1/1 for 8-32, 2/2/2 for 64

> Integer Multiply/Divide

How is multiply is implemented [software, multiply step, hardware]? hardware
How many cycles to perform 32x32->32 multiply? 4/1/1

How is divide is implemented [software, divide step, hardware]? hardware
How many cycles to perform 32x32->32 divide? 39/1/39
	Signed divide traps on negative operand.

How is 32x32->64 bit integer multiplication supported and how many cycles?
	Software.  No cycle count estimate.

How is 64/32->32,32 bit integer division supported and how many cycles?
	Software.  No cycle count estimate.

> Floating Point

Are floating point registers separate from integer registers? no
How many 32-bit floating point registers? 32
How many 64-bit floating point registers? 16
How many 80-bit floating point registers? 0

How is floating point is implemented [software, coprocessor, on-chip]? on-chip
What are the floating point operation latency/issue/repeats?

		 32-bit		 64-bit		80-bit
	add	 5/ 1/ 1	 6/ 2/ 2	n.a.
	mul	 5/ 1/ 1	10/ 2/ 2	n.a.
	div	30/ 1/30	60/ 2/60	n.a.
	sqrt	n.a.		n.a.		n.a.

Which floating point units can operate in parallel? add and multiply
Can floating point operate in parallel with integer? yes
Are floating point exceptions precise? some but not all

> Memory management

Page size in bytes? 4096
How many bits in a virtual address? 32
What is the size of the user-mode address space? 4G
	There can be two user-mode address spaces, each 4G, if you
	want to split I&D.

How many bits in a physical address? 32
How many bits of address space id are added to virtual addresses, if any? 0
Translation cache [none, off-chip, in-cache, on-chip]? in-cache
Translation cache size in entries? 56
Translation cache associativity [direct-mapped, 2-set, 4-set, full]? full
Translation cache miss handled by [software, hardware]? hardware

Also 10 512Kbyte software-managed translation entries.

> Caches

Instruction cache [none, off-chip, on-chip]? off-chip
Data cache [none, off-chip, on-chip]? off-chip
Are I and D caches separate? yes
I-cache total size in bytes? 16K to 64K
I-cache associativity [direct-mapped, 2-set, 4-set, fully associative]? 4-set
I-cache address block size in bytes (bytes per tag)? 16
I-cache transfer block size in bytes (bytes read on cache miss)? 16
I-cache index [virtual, physical]? virtual
	The distinction only matters when there is more than one CMMU on a
	memory port.  When there's just one, the index is both virtual and
	physical.
I-cache tag [virtual, physical]? physical
D-cache total size in bytes? 16K to 64K
D-cache associativity [direct-mapped, 2-set, 4-set, fully associative]? 4-set
D-cache writes [write-through, write-back]? write-through or write-back
D-cache address block size in bytes (bytes per tag)? 16
D-cache transfer block size in bytes (bytes read on cache miss)? 16
D-cache index [virtual, physical]? virtual
	See comment for I-cache index.
D-cache tag [virtual, physical]? physical
Is there a secondary cache? no

> Branch Prediction

What form of branch prediction is used, if any? none

> Other

Describe other unique or interesting features of the architecture or
its implementation.
E.g. describe the functional units, with emphasis on non-standard
units.

There are four 32-bit scratch "control" registers available in
supervisor mode.

There's a user-writable "floating point control register" with bits
like "disable divide-by-zero exception", "disable overflow exception",
and so on.  The bits are not interpreted by the hardware; the exception
always occurs, and it's up to the kernel to fix up the imprecise result
and make it appear to the user as though the exception hadn't occurred.
The kernel does all the right IEEE things, including implementing
not-a-number.

There's an instruction to trap on subscript out of range.

A bit in the PSR selects whether the data space is big-endian or
little-endian.

The instruction and data pipelines are exposed to software.  Exception
handling involves a lot of overhead; the code has to deal with up to
six outstanding user page faults and up to nine outstanding floating
point exceptions.  You can't just duck in and out of a device interrupt
routine and then return with RTE.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

tom@nud.UUCP (Tom Armistead) (05/26/88)

In article <2232@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>
>Describe other unique or interesting features of the architecture or
>its implementation.
>E.g. describe the functional units, with emphasis on non-standard
>units.

    Some other features I think were missed in the original posting:

    It has bit field instructions.  (I'm not aware of other RISC processors
with this feature).

    Support for multiprocessing via cache coherency features of cache
chip.

    Support for fault tolerant applications.

    
>The instruction and data pipelines are exposed to software.  Exception
>handling involves a lot of overhead; the code has to deal with up to
>six outstanding user page faults and up to nine outstanding floating

    Actually only 5 page faults and handling just 4 of them is sufficient
and optimum for performance.

>point exceptions.  You can't just duck in and out of a device interrupt
>routine and then return with RTE.

    You can write the exception handlers to process the interrupt first
(if interrupt latency is important) before any page faults/FP exceptions
are handled.  The page faults/FP exception handling doesn't have to be done
on every interrupt either - only those faults that occur simultaneously with
the interrupt need to be handled simultaneously.  Simultaneous exceptions/
interrupts are relatively rare in comparison to interrupts that occur without
any other pending exceptions. 

    The FP exceptions can also be ignored until RTE if needed.  However, this
means your interrupt handler cannot use FP (including integer multiply and
divide).  If you can make this restriction (or guarantee no FP exceptions)
and if you can guarantee no page faults, you can indeed "duck in and out and
then RTE" with an 88K interrupt handler. 

    How about a comp.sys.m88k (or equivalent)?
-- 
Just a few more bits in the stream.

The Sneek

df@nud.UUCP (Dale Farnsworth) (05/26/88)

Earl Killian (earl@mips.COM) writes:
> (Thanks to Andrew Klossner for his help on this one.)

I am pleased that Earl and Andrew made this available to the net.

> > Architecture Reference
> Where is the architecture fully described?
> 
>   -- MC78000 User's Manual Revision 0.4, October 7, 1987, Advanced
>      Information (100+ page document, describes registers,
>      instructions, exception processing, and timing information in
>      detail; it has no doubt been renamed by now)

Also marked Motorola Confidential/Proprietary.  The most recent version
is MC88100 User's Manual Revision 0.6, April 6, 1988, Advanced
Information.

>   -- MC78200 User's Manual Revision 0.1, November 29, 1987, Advanced
>      Information (80+ page document, like above but includes
>      architecture changes which will appear in the production chip)

Current version: MC88200 User's Manual Revision 0.4 Preliminary Copy,
April 19, 1988, Advanced Information.

> The instruction and data pipelines are exposed to software.

This could be misunderstood.  The pipelines are only exposed to
the exception handler.  Hardware register scoreboarding is used
by the chip so compilers are *not required* to do pipeline
instruction scheduling.

Again, thanks for the excellent information.

-Dale

-- 
Dale Farnsworth		602-438-3092	uunet!unisoft!nud!df

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/27/88)

In article <798@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>    Support for multiprocessing via cache coherency features of cache
Lets hear about these very IMPORTANT features in detail.  Anyone have
this information available to them?

tom@nud.UUCP (Tom Armistead) (06/02/88)

>>    Support for multiprocessing via cache coherency features of cache
>Lets hear about these very IMPORTANT features in detail.  Anyone have
>this information available to them?

In brief:

    The 88200 chip contains logic that allows it to monitor the activities
of other 88200s in the system (including other processor's 88200s).  If 
an 88200 attempts to access a location in memory which does not contain
valid data (i.e. its "real" contents are in cache), then the 88200 containing
the correct data will preempt the access and update main memory.  The first
88200 will then continue the access and get the correct data.  This is referred
to as "snooping" and is performed by the 88200 chips themselves - software is
required to take no action (other than configuring the 88200's in snoop mode)
to maintain cache coherency between multiprocessors.  "Snooping" takes a 
large burden off of the work required to implement a multiprocessing system.

     Of course semaphoring is supported as well.

     I could post lots more but at this time have to limit my comments to
that information which has been released to the public.  The above is 
also discussed in the Technical summary of the MC88200.  
-- 
Just a few more bits in the stream.

The Sneek