[comp.sys.m88k] Emulating other computers on 88K's and Benchmarks

newton@smoggy.gg.caltech.edu (Mike Newton) (10/08/90)

The m88k has a fundamental advantage over the other RISC's that I know
regarding:
	Interpetation of code for other processors
	Compilers for certain very high level languages -- (especially: Prolog)
and this is 31 truly general purpose registers.

For interpreting code for other machines, or for interpreters in general,
it is very helpful to be able to keep all "state" in registers.  Otherwise
memory accesses and updates are serious problem.  For several projects that
I've worked on, going from 16-->32 registers would make factors of 30%-250%
(the last being an rough guess) difference in performnace.

There is also certain advantages with the 88k over the MIPS (an great
chip, but only 16 reg.s!) in that more of the paging / address translation
lines are accessible.  This would ease the work of some translate-on-the
fly interpretation routines as well as overlay and self-modifying code
detection.

That the FP registers are the same as the GP registers might slow down some
FP code.  This is dependent on the register passing conventions, not
the number of registers.  IE: though you could divide the 31 into 15 normal
and 8 double prec. floating, (which I _beleive_ is the same as MIPS)
in reality this would have to be modified for register passing of args and
linker registers.

On a related topic talked about... : 

Be careful when you look at m88k performance statistics.  In particular,
find out:
	[1] What compiler is being used -- Tom Wood at DG has made _many_
		improvements to gcc over the last year.  Some of my code
		runs noticeably faster.
	[2] The memory model, including wait states.  The lower end DG
		machines have a fair number of wait states -- a fact that
		surprised me, considering their memory is custom.


- mike
--
newton@csvax.cs.caltech.edu   Beach Bums Anonymous, Pasadena President
Caltech 256-80		      (Hilo -- it's not just another rainy day!)
Pasadena CA 91125	      Life's a beach.  Then you graduate.

mash@mips.COM (John Mashey) (10/08/90)

In article <newton.655360878@smoggy> newton@smoggy.gg.caltech.edu (Mike Newton) writes:
>The m88k has a fundamental advantage over the other RISC's that I know
>regarding:
>	Interpetation of code for other processors
>	Compilers for certain very high level languages -- (especially: Prolog)
>and this is 31 truly general purpose registers.

I'm afraid this posting needs a grain of salt.  Most current commercial
RISCs have at least 32 integer registers available at once, plus at least
32 32-bit FP, or 16 64-bit FP.  These include: MIPS, SPARC, AMD 29K,
Intel i860, IBM POWER, HP PA.  Clipper indeed has less registers.
All of the rest have at least 2X the general purpose (integer + FP)
register state of the 88K... 
Please read Kane's book "The MIPS RISC Architecture", published by
Prentice-Hall, and available since 1987...
...

>There is also certain advantages with the 88k over the MIPS (an great
>chip, but only 16 reg.s!) in that more of the paging / address translation
Again, 32 integer + 16 64-bit FP.  Maybe you're thinking of the Stanford
MIPS chip, which indeed had 16 integer registers, but is fairly
irrelevant at this point.
>lines are accessible.  This would ease the work of some translate-on-the
>fly interpretation routines as well as overlay and self-modifying code
>detection.

I'm at a loss to understand what this means.  Say more about the
feature being compared with?? 
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tom@ssd.csd.harris.com (Tom Horsley) (10/08/90)

>>>>> Regarding Emulating other computers on 88K's and Benchmarks; newton@smoggy.gg.caltech.edu (Mike Newton) adds:
newton> 	[2] The memory model, including wait states.  The lower end
newton> 	    DG machines have a fair number of wait states -- a fact
newton> 	    that surprised me, considering their memory is custom.

Before you complain about memory wait states you should figure out where
they are all coming from. A vast number of cycles are consumed by overhead
in the 88200 MMU chip - as a rough example, if a certain configuration of
memory and MMUs take 16 cycles to fill a cache line, 3 of those are the time
it takes to walk through the data unit pipeline, 2 or 3 of the remaining
cycles are the time it takes to access memory, and the remainder are
consumed by the MMU. Even doubling the speed of memory would only reduce
the 16 cycles to 14 or 15.

Please Note: The above figures are from my memory of one example we worked
out in detail - there are A LOT of different types of memory cycles and
paths through the MMU and this was one specific example we worked through (I
seem to recall it was doing a load from a non-cached memory location). The
specific figures quoted may be wrong, but the approximate percentage speed
improvement from using faster memory chips is about right (in other words,
barely significant :-).
--
======================================================================
domain: tahorsley@csd.harris.com       USMail: Tom Horsley
  uucp: ...!uunet!hcx1!tahorsley               511 Kingbird Circle
                                               Delray Beach, FL  33444
+==== Censorship is the only form of Obscenity ======================+
|     (Wait, I forgot government tobacco subsidies...)               |
+====================================================================+

sasrer@unx.sas.com (Rodney Radford) (10/12/90)

In article <41965@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <newton.655360878@smoggy> newton@smoggy.gg.caltech.edu (Mike Newton) writes:
>>The m88k has a fundamental advantage over the other RISC's that I know
>>regarding:
>>	Interpetation of code for other processors
>>	Compilers for certain very high level languages -- (especially: Prolog)
>>and this is 31 truly general purpose registers.
>
>I'm afraid this posting needs a grain of salt.  Most current commercial
>RISCs have at least 32 integer registers available at once, plus at least
>32 32-bit FP, or 16 64-bit FP.  These include: MIPS, SPARC, AMD 29K,
>Intel i860, IBM POWER, HP PA.  Clipper indeed has less registers.
>All of the rest have at least 2X the general purpose (integer + FP)
>register state of the 88K... 

But there are cases when having the same register set for both the general
purpose registers and the floating point registers can offer improvements
in the code by allowing you to operate on the floating point values with the
same integer instructions (for example: using some of the specialized bit 
manipulation instructions).  Also, the chips listed above that have the 
'extra' FP registers you mention are actually on external floating point 
coprocessor chips, so they should not be included in the register count when
comparing specific RISC processors. The choice of whether to use an external
floating point coprocessor is a system designers choice, not a specific
RISC chip requirement (it is possible for an 88K to also be hooked to an 
external math coprocessor, although I have not heard of such an arrangement). 

>-- 
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
>UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
>DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
>USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

-- 
Rodney E. Radford        SAS Institute, Inc.        sasrer@unx.sas.com
DG/UX AViiON developer   Box 8000, Cary, NC 27512   (919) 677-8000 x7703

jat@xavax.com (John Tamplin) (10/13/90)

In article <TOM.90Oct8065144@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
:>>>>> Regarding Emulating other computers on 88K's and Benchmarks; newton@smoggy.gg.caltech.edu (Mike Newton) adds:
:newton> 	[2] The memory model, including wait states.  The lower end
:newton> 	    DG machines have a fair number of wait states -- a fact
:newton> 	    that surprised me, considering their memory is custom.
:
:Before you complain about memory wait states you should figure out where
:they are all coming from. A vast number of cycles are consumed by overhead
:in the 88200 MMU chip - as a rough example, if a certain configuration of
:memory and MMUs take 16 cycles to fill a cache line, 3 of those are the time
:it takes to walk through the data unit pipeline, 2 or 3 of the remaining
:cycles are the time it takes to access memory, and the remainder are
:consumed by the MMU. Even doubling the speed of memory would only reduce
:the 16 cycles to 14 or 15.

The speed of memory generally does only make a few cycles difference --
even at 33 MHz the difference between 60ns and 120ns RAMS is 3 cycles.
The time taken in the data pipeline is usually in parallel with the
execution of other instructions -- good pipelining by the compiler will
hide that time.  Also, this time is the same regardless of external
memory architecture, so I ignore it for the purposes of comparison.

:Please Note: The above figures are from my memory of one example we worked
:out in detail - there are A LOT of different types of memory cycles and
:paths through the MMU and this was one specific example we worked through (I
:seem to recall it was doing a load from a non-cached memory location). The
:specific figures quoted may be wrong, but the approximate percentage speed
:improvement from using faster memory chips is about right (in other words,
:barely significant :-).
:--
:======================================================================
:domain: tahorsley@csd.harris.com       USMail: Tom Horsley
:  uucp: ...!uunet!hcx1!tahorsley               511 Kingbird Circle
:                                               Delray Beach, FL  33444
:+==== Censorship is the only form of Obscenity ======================+
:|     (Wait, I forgot government tobacco subsidies...)               |
:+====================================================================+

The AViiON desktop system (not sure of model numbers) fills a cache line
in 16 clock cycles.  The Topgun does the same in 11.  In the 88200 manual,
Motorola gives a circuit (although it has several problems, including
marginal timing) that does it in 8.  The theoretical minimum is 7, since
the 88200 takes 2 clocks to decide it needs to hit the bus, 1 address
phase and 4 data phases.  At 20 MHz with 60ns memory 2 way interleaved,
this can be achieved.  With 100ns chips, you can do it in 8.  If you need
to go across a bus to get to the memory or if the capacitance becomes a
problem, it will cost you another cycle.  The Motorola MVME181 board gets
a cache burst in 8 cycles, as does an Opus board.  I don't understand where
DG's time is going -- perhaps so they can sell faster systems?

Of course, these numbers all assume no translation delays, ie. TLB hits.

The AViiON numbers came from a technical person I talked to there (I
don't remember the name) when I was discussing pipelining optimizations.

-- 
John Tamplin						Xavax
jat@xavax.COM						2104 West Ferry Way
...!uunet!xavax!jat					Huntsville, AL 35801

meissner@osf.org (Michael Meissner) (10/14/90)

In article <1990Oct11.174838.7990@unx.sas.com> sasrer@unx.sas.com
(Rodney Radford) writes:

| But there are cases when having the same register set for both the general
| purpose registers and the floating point registers can offer improvements
| in the code by allowing you to operate on the floating point values with the
| same integer instructions (for example: using some of the specialized bit 
| manipulation instructions).  Also, the chips listed above that have the 
| 'extra' FP registers you mention are actually on external floating point 
| coprocessor chips, so they should not be included in the register count when
| comparing specific RISC processors. The choice of whether to use an external
| floating point coprocessor is a system designers choice, not a specific
| RISC chip requirement (it is possible for an 88K to also be hooked to an 
| external math coprocessor, although I have not heard of such an arrangement). 

I worked on GCC for the 88k for 1 year, and for the MIPS chips for 1
year, so I have a little experience in both sides.  :-)

The statement about operating on floating point values with integer
instructions is a bit of a red herring.  If you have to support a
signaling NaN, the code sequence to check for the NaN wipes out any
savings by using the faster integer instructions.

Whether or not a FPU is implemented via a separate chip or not, is
immaterial.  I'm not aware of ANY vendor who uses MIPS chips which
does not include a FPU.  The question is does the intstruction set
hinder or help to run the task at hand.

Separating the register sets is helpful, because you've just doubled
the number of registers without changing instruction formats.  When I
was a Data General, we did some hand checking, and found that
unrolling loops could not keep the machine going at full tilt, because
you run out of registers too quickly.  I wouldn't bet that the 88k
will have a unified register set forever....

The only time that I've wished the MIPS had a unified register set was
in dealing with varargs functions where you would like to be able to
store all unknown arguments on the stack, and walk a pointer.
However, the 88k doesn't win any points in this arena, because the
Greenhills inspired 88K OCS demands that you have two separate arrays,
va_list is a 3 word structure, and the va_arg macro continually has to
check whether or not the argument is in the first 8 words or not....

Like most people, I find the current generation of RISC chips to be
fairly similar.  However, as I compiler writer there are things about
each of the two processors that I like and dislike:

Things that I like about the 88k that aren't in MIPS:

    *	reg+reg, and reg+(reg*base_size) addressing modes.
    *	bit extraction operators (except no bit field set).
    *	pure PC-relative jumps/subroutine calls.
    *	branch insns w/optional delay slot (saves code space, not time).
    *	and.u, or.u, xor.u instructions.
    *	standard calling sequence has 13 saved regs instead of 9.
    *	standard calling sequence passes 8 words in regs instead of 4.
    *	better conversion ops (esp. int<->floating point).
    *	hardware interlocks.
    *	pipeline multiply rather than multiply unit.
    *	add/subtract with carry.
    *	assembler supports creating debug information.

Things I like about the MIPS that aren't in the 88k:

    *	more registers, since FPU regs are separate.
    *	signed division doesn't require branches to fix up sign/avoid traps.
    *	modulus operation without having to do a - ((a/b)*b).
    *	a divided by b gives a/b and a%b at same time.
    *	32x32->64 bit multiply.
    *	small data/bss area (I think this is just coming to the 88k).
    *	standard calling sequence passes structs in regs, not in the stack.
    *	branch on a==b and a!=b are each one instruction.
    *	assembler temporary register.
    *	only 1 cycle delay after load rather than 2.
    *	ECOFF debug format is slightly more expressive than COFF debug format.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?

meissner@osf.org (Michael Meissner) (10/15/90)

In article <MEISSNER.90Oct13222526@osf.osf.org> meissner@osf.org
(Michael Meissner) writes:

I know, following up on my own article.....

| The only time that I've wished the MIPS had a unified register set was
| in dealing with varargs functions where you would like to be able to
| store all unknown arguments on the stack, and walk a pointer.
| However, the 88k doesn't win any points in this arena, because the
| Greenhills inspired 88K OCS demands that you have two separate arrays,
| va_list is a 3 word structure, and the va_arg macro continually has to
| check whether or not the argument is in the first 8 words or not....

Actually another case just occurred to me, and that is figuring out
where to put a union with pointers/scalaras and floating point, you
usually have to do some shuffling between the two register sets.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?

terryl@osf.osf.org (10/15/90)

In article <MEISSNER.90Oct14225652@osf.osf.org> meissner@osf.org (Michael Meissner) writes:
| The only time that I've wished the MIPS had a unified register set was
| in dealing with varargs functions where you would like to be able to
| store all unknown arguments on the stack, and walk a pointer.
| However, the 88k doesn't win any points in this arena, because the
| Greenhills inspired 88K OCS demands that you have two separate arrays,
| va_list is a 3 word structure, and the va_arg macro continually has to
| check whether or not the argument is in the first 8 words or not....


     Yes, but that's just the way the Greenhills compiler works with varargs.
There's no reason I can think of why the Greenhills can't spill the register
arguments onto the stack; in fact, it appears that it allocates that space
already.....

meissner@osf.org (Michael Meissner) (10/15/90)

In article <14903@paperboy.OSF.ORG> terryl@osf.osf.org writes:

| In article <MEISSNER.90Oct14225652@osf.osf.org> meissner@osf.org (Michael Meissner) writes:
| | The only time that I've wished the MIPS had a unified register set was
| | in dealing with varargs functions where you would like to be able to
| | store all unknown arguments on the stack, and walk a pointer.
| | However, the 88k doesn't win any points in this arena, because the
| | Greenhills inspired 88K OCS demands that you have two separate arrays,
| | va_list is a 3 word structure, and the va_arg macro continually has to
| | check whether or not the argument is in the first 8 words or not....
| 
| 
|      Yes, but that's just the way the Greenhills compiler works with varargs.
| There's no reason I can think of why the Greenhills can't spill the register
| arguments onto the stack; in fact, it appears that it allocates that space
| already.....

No, the 88Open standard mandates that varargs use a 3 word structure,
and NOT save the registers in the spill area (which would have made
more sense IMHO).  Otherwise you can't pass structures to your
varardic function, since structures are NOT passed in registers, and
are passed in the spill area instead.  The way the original Motorola
calling sequence worked was the first 8 words are always passed in the
registers, and that you would save these registers into the spill
area.  If this had been used instead of the current calling sequence,
then va_list would have been a normal char *.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?