[comp.arch] 68040 where is it?

smithw@hamblin.math.byu.edu (William V. Smith) (08/23/90)

 Just out of curiosity, does anyone out there know when Motorola is
supposed to be delivering the '040 in mass quantities??  I heard
there was some problem with the FPU a while back but that it was fixed
and things were ready to roll.  So what's happening now?

mslater@cup.portal.com (Michael Z Slater) (08/24/90)

 >Just out of curiosity, does anyone out there know when Motorola is
>supposed to be delivering the '040 in mass quantities??  I heard
>there was some problem with the FPU a while back but that it was fixed
>and things were ready to roll.  So what's happening now?

General sampling will be announced real soon now, with production this
fall.  It appears that the latest rev of the silicon has fixed the
significant problems.

Michael Slater, Microprocessor Report   mslater@cup.portal.com
707/823-4004   fax: 707/823-0504

atk@boulder.Colorado.EDU (Alan T. Krantz) (08/24/90)

In article <33156@cup.portal.com> mslater@cup.portal.com (Michael Z Slater) writes:
>
> >Just out of curiosity, does anyone out there know when Motorola is
>>supposed to be delivering the '040 in mass quantities??  I heard
>>there was some problem with the FPU a while back but that it was fixed
>>and things were ready to roll.  So what's happening now?
>
>General sampling will be announced real soon now, with production this
>fall.  It appears that the latest rev of the silicon has fixed the
>significant problems.
>
>Michael Slater, Microprocessor Report   mslater@cup.portal.com
>707/823-4004   fax: 707/823-0504

Would someone be willing to give me a fairly good contrast between
the following chips in terms of speed:

MIPS RS3000
Sparc +
IBM RS6000
80486
68040

Quoted bench marks don't mean a whole lot. So how about some comments on
what types of problems a specific chip/computer does well. Also,
how does the 68040 obtain it's speed up. Does it use pipelineing (I
assume this is the case). Does it have multiple functional units
operating in parallel. What about the future? Can a sparc chipset be 
made as fast as the MIPS (without "visible" changes to the user. I.e, 
reducing the number of register windows) - will a 68050 one day be 10
times faster than the current IBM RS6000 (I.,e is there room in the
basic design for mass improvements)..

If possible could you cc me on any followups to this message. Thanks...





 
------------------------------------------------------------------
|  Mail:    1830 22nd street      Email: atk@boulder.colorado.edu|
|           Apt 16                Vmail: Home:   (303) 939-8256  |
|           Boulder, Co 80302            Office: (303) 492-8115  |
------------------------------------------------------------------

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/25/90)

In article <25146@boulder.Colorado.EDU> atk@boulder.Colorado.EDU (Alan T. Krantz) writes:

| Would someone be willing to give me a fairly good contrast between
| the following chips in terms of speed:
| 
| MIPS RS3000
| Sparc +
| IBM RS6000
| 80486
| 68040

  SPEC just released it's benchmark results on some of these, I thought
I could find them but can't. I remember that the 80486-33 did very well,
which I'm sure will be explained away by all of the RISC people in this
group.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (08/25/90)

Alan> In article <25146@boulder.Colorado.EDU> atk@boulder.Colorado.EDU (Alan T. Krantz) writes:
Alan> Would someone be willing to give me a fairly good contrast between
Alan> the following chips in terms of speed:
Alan> 
Alan> MIPS RS3000
Alan> Sparc +
Alan> IBM RS6000
Alan> 80486
Alan> 68040

>>>>> On 24 Aug 90 19:51:56 GMT, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) said:
Bill>   SPEC just released it's benchmark results on some of these, I thought
Bill> I could find them but can't. I remember that the 80486-33 did very well,
Bill> which I'm sure will be explained away by all of the RISC people in this
Bill> group.

While we're discussing rumors, I've been told (by someone I'd _expect_ to
know) that the 68040 has roughly the same integer throughput as a SPARC at
the same clock speed.  Anyone out there have some hard data on this?

Also, with the 040 trapping, then executing software routines for
trancendental functions, I'd _expect_ at least the trancendentals to be
computed slower than a Weitek or 882, in spite of the PR.  Has anyone
benchmarked this?  Does Moto have plans for a faster (perhaps
trancendental-only) version of the 882?

Bill> bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
Bill>     VMS is a text-only adventure game. If you win you can use unix.

Love yer .sig!
--
Chuck Phillips  MS440
NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp

henry@zoo.toronto.edu (Henry Spencer) (08/26/90)

In article <CHUCK.PHILLIPS.90Aug25143508@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:
>While we're discussing rumors, I've been told (by someone I'd _expect_ to
>know) that the 68040 has roughly the same integer throughput as a SPARC at
>the same clock speed.

This should not be an enormous surprise.  The existing SPARCs all do about
one instruction per cycle, and the 68040 designers moved heaven and earth
(at great expense in design time and silicon) to make the 68040 do likewise
for the simpler instructions.  The real question is, which one will scale
to higher clock speeds and more-than-one-instruction-per-cycle execution
schemes better?  Hint:  the simpler one has a decided edge here.
-- 
Committees do harm merely by existing. | Henry Spencer at U of Toronto Zoology
                       -Freeman Dyson  |  henry@zoo.toronto.edu   utzoo!henry

pcg@cs.aber.ac.uk (Piercarlo Grandi) (08/29/90)

On 26 Aug 90 02:42:12 GMT, henry@zoo.toronto.edu (Henry Spencer) said:

henry> In article
henry> <CHUCK.PHILLIPS.90Aug25143508@halley.FtCollins.NCR.COM>
henry> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:

Phillips> While we're discussing rumors, I've been told (by someone I'd
Phillips> _expect_ to know) that the 68040 has roughly the same integer
Phillips> throughput as a SPARC at
Phillips> the same clock speed.

henry> This should not be an enormous surprise.  The existing SPARCs all
henry> do about one instruction per cycle, and the 68040 designers moved
henry> heaven and earth (at great expense in design time and silicon) to
henry> make the 68040 do likewise for the simpler instructions.

I don't think it was all that difficult actually; the RISC subset of the
68K architecture (in instructions and addressing modes) is not that
complicated actually. It all depends on whether they wanted to
implemented the RISC subset with an underlying load-store architecture
or whether they wanted to do like the 486 and play hard tricks with the
cache (treating it as a large register bank).

In theory you can just RISC'ify a small subset of M68K instructions and
then only the register-register modes of the non load/store
instructions. Some people I remember used this trick to build fast 68k
clones (e.g. EDGE, if I remember well) using MSI components. You want
then to recompile things though.

I think everybody remembers that when the PL.8 compiler was retargeted
to a RISC subset of 370 instructions using only RR instructions for non
load/stores the generated code was *faster* than otherwise -- i.e. the
370 is already often implemented internally as a RISC core with
paraphernalia appended.

henry> The real question is, which one will scale to higher clock speeds

Well, things are not that simple. We have three alternatives really:

Pure RISC	You only got simple instructions and load store.
		Code is big, CPU has low transistor count, istructions
		are slow.

Pure CISC	You only got complex instructions and no special casing.
		Code is small, CPU has medium transistor count, instructions
		are slow.

RISCy CISC	You got simple instructions and address modes
		implemented as they were RISC; complex instructions
		and addressing modes are there for backwards
		compatibility.
		Code is small, CPU has large transistor count, there
		are both slow and fast instructions.

Actually there is another alternative, mostly used in mainframes e.g.
some 370 and very high end VAXes:

Super CISC	You have a super parallel CPU that decodes and executes
		complex instructions with lots of internal parallelism.
		Code is small, CPU has colossal transistor count,
		all instructions are fast.

henry> and more-than-one-instruction-per-cycle execution schemes better?
henry> Hint: the simpler one has a decided edge here.

Cost effective wise there seems to be evidence that Pure RISC is better
than Pure CISC. The choice between RISCy CISC and Pure RISC is not that
clear however. Architectural efficiency is comparable, so the contest,
as indicated by Spencer, may be decided by the much lower transistor
count of Pure RISC, which allows use of more advanced (faster if less
dense) technology.

There are however technical factors that favour RISCy CISC; one is that
higher code density that conserves memory bandwidth is not irrelevant,
and the so called "RISC window" which happens when memory gets
relatively faster than CPUs may be closing; another is the ability to
support rare but important applications better thanks to the CISC part
of the instruction set.

Non technical considerations are that usually the best (fastest or
densest) technology is only available to the largest manufacturers,
which are however wedded to CISC architectures; in a sense RISC
therefore is how smaller players get comparable performance even if they
use less advanced technology (vide SPARC on a gate array).

My opinion is that a million plus transistor budget would be better
spent in having multiple SPARCs/MIPSes/M88Ks/29Ks/ARMs/NOVIXes per chip
rather than a RISCy CISC, but the players who can afford a million plus
transistor budget have a vested interest in old, CISC architectures; and
that RISCs had better do something about code density, because the
relative speed of memory and CPU may change again. Stack instead of
laod-store RISCs are my favourite dream.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (08/31/90)

In article <PCG.90Aug29161206@athene.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> RISCy CISC	You got simple instructions and address modes
> 		implemented as they were RISC; complex instructions
> 		and addressing modes are there for backwards
> 		compatibility.
> 		Code is small, CPU has large transistor count, there
> 		are both slow and fast instructions.

The CLIPPER tries to find a balance by having a RISC core + "macros".
There are nearly 70 of these macros, covering save/restore general
registers, conversions, and string instructions.  Think of them as
common subroutines that are always "in cache" and have a specially cheap
calling protocol.

(Actually, that reminds me a _lot_ of the DEC-10, with its UUOs.)

If I remember correctly, the 29000 does something similar,
except that the CLIPPER requires the arguments of its macros to be
in specific registers, while the 29000 has three registers "your Nth
operand, Mr Macro, comes from this user register".

This isn't RISCy CISC; it's CISCy RISC.  The CPU is basically RISC+ROM.
Same benefits as RISCy CISC except for backwards compatibility.

-- 
You can lie with statistics ... but not to a statistician.

matloff@eeyore.Berkeley.EDU (Norman Matloff) (09/01/90)

In article <3643@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:

>You can lie with statistics ... but not to a statistician.

I couldn't pass this up without comment, since statistics affects our
daily lives so much. :-)

Contrary to your clever line here, I'd like to mention that statisticians 
tend to be among the greatest "statistical liars"  --  not intentionally,
but due to lack of understanding of the subtleties of statistics.  Don't
let someone with statistician credentials intimidate you into abandoning
your common sense.

   Norm

daveh@cbmvax.commodore.com (Dave Haynie) (09/07/90)

In article <CHUCK.PHILLIPS.90Aug25143508@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:

>Also, with the 040 trapping, then executing software routines for
>trancendental functions, I'd _expect_ at least the trancendentals to be
>computed slower than a Weitek or 882, in spite of the PR.  

Why is that?  Weitek's don't have built-in trancendentals, either.  In fact,
at least those based around the most popular Weitek FPU core (the one in the
3167 and ill-fated 3168) don't have as many instructions as the '040 FPU.
And the '040 FPU appears to be roughly 2-3 times as fast as a 3167.  I have
yet to see any numbers on the Motorola trancendental code as compared to a
similar '882, though Motorola has been claiming it to be a tad faster.

>Chuck Phillips  MS440


-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
      Get that coffee outta my face, put a Margarita in its place!

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (09/11/90)

In article <14263@cbmvax.commodore.com> daveh@cbmvax (Dave Haynie) writes:
In article <CHUCK.PHILLIPS.90Aug25143508@halley.FtCollins.NCR.COM> writes:

>>Also, with the 040 trapping, then executing software routines for
>>trancendental functions, I'd _expect_ at least the trancendentals to be
>>computed slower than a Weitek or 882, in spite of the PR.

>Why is that?  Weitek's don't have built-in trancendentals, either.  In
>
>Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"

Dave,
The Weitek has other advantages over the 68040.  No doubt the Weitek uses
64-bit (or there abouts) registers for general purpose operations such
binary arithmetic shifts and adds.  These types of operations are necessary
for example in using the CORDIC algorithm to approximate sin, cos, sincos,
exp, log, asin, acos, atan, sinh, and cosh.  The 040 is limited to 32-bit
binary shifts and adds. 

I got a feeling Dave that the 040 will do the transcendental functions with
ieeesp (32-bit) real fast (probably quicker and just as accurate as the 68882
at the same clock), but the ieeedp (64-bits) is going to require a bit of
igenuity (magic) me thinks.  Just my opinion at this time.....

Al Aburto
aburto@marlin.nosc.mil

chinds@oakhill.UUCP (Chris Hinds) (09/13/90)

aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>Dave,
>The Weitek has other advantages over the 68040.  No doubt the Weitek uses
>64-bit (or there abouts) registers for general purpose operations such
>binary arithmetic shifts and adds.  These types of operations are necessary
>for example in using the CORDIC algorithm to approximate sin, cos, sincos,
>exp, log, asin, acos, atan, sinh, and cosh.  The 040 is limited to 32-bit
>binary shifts and adds. 

>I got a feeling Dave that the 040 will do the transcendental functions with
>ieeesp (32-bit) real fast (probably quicker and just as accurate as the 68882
>at the same clock), but the ieeedp (64-bits) is going to require a bit of
>igenuity (magic) me thinks.  Just my opinion at this time.....

Alfred,

A little bit more information for your opinion...

You are correct in that the 040 is only capable of 23-bit binary shifts and 
adds, etc. in integer code.  However, the algorithms chosen for transcendental 
emulation make use primarily of the FPU on the 040, and, like the 68882, 
are done completly in IEEE extended precision arithmetic.  Accuracy guaranteed 
within the same bounds, and as fast as a 33-MHz 68030 system with a 68882
coprocessor for fpu support.  So your comment about the speed of single being 
different from double is not accurate.  The 040 FPU is optimized for double, 
but, with emulation code, that will disappear, and the computation will take
equal time for all sizes of IEEE floating-point formats supported.

The 040 takes an unimplemented instruction trap on all transcendental 
instructions, so part of the time to process the instruction is overhead
of the trap, stack, etc.  The same emulation code, if used as a library,
would be faster by as much as 33% over the trap and emulate mode.  

Chris

*************************************************
*   Motorola Microprocessor Products Sector     *
*   Austin, Tx                                  *
*                                               *
*   Chris N. Hinds <><      Standard Disclamers *
*   oakhill!wtkatz!chinds@cs.utexas.edu         * 
*	chinds@oakhill.sps.mot.com					*
*************************************************

daveh@cbmvax.commodore.com (Dave Haynie) (09/14/90)

In article <1477@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes:
>In article <14263@cbmvax.commodore.com> daveh@cbmvax (Dave Haynie) writes:
>In article <CHUCK.PHILLIPS.90Aug25143508@halley.FtCollins.NCR.COM> writes:

>>>Also, with the 040 trapping, then executing software routines for
>>>trancendental functions, I'd _expect_ at least the trancendentals to be
>>>computed slower than a Weitek or 882, in spite of the PR.

>>Why is that?  Weitek's don't have built-in trancendentals, either.  In

>Dave,
>The Weitek has other advantages over the 68040.  No doubt the Weitek uses
>64-bit (or there abouts) registers for general purpose operations such
>binary arithmetic shifts and adds.  

Where do you find this information?  My WTL 3167 manual lists only instructions
for movement between registers, format conversions, floating point comparisons, 
floating point add, floating point subract, floating point multiply, 
floating point multiply-accumulate, floating point division, floating point
square root, floating point sign manipulation, and a couple of paging things.
Any efficient binary manipulation would have to be done in 32 bit '386
registers.  Unless there are some undocumented instructions I'm missing.

>Al Aburto
>aburto@marlin.nosc.mil


-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
      Get that coffee outta my face, put a Margarita in its place!