[comp.arch] 68040

baum@Apple.COM (Allen J. Baum) (02/02/90)

[]
>In article <3426@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes:
>
>And the doubling of the RAM banks from 2 to 4 eats a lot of real estate
>(remember, the 68040 has a 64-bit bus) and gives a few NuBus problems.

Try again. My brand, spanking new 68040 manual shows 32 pins for the data
bus. Try again a lot.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

karl@ficc.uu.net (Karl Lehenbauer) (02/02/90)

In article <7MD152Axds13@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>Considering that the 68000 beat the pants off the
>80286, and it was only an accident that IBM used the 8086 family in the PC,
>it looks like Motorola was years ahead all the way. It's not until the 80386
>that Intel produced a chip as desirable as the old 68000.

Yeah, the 68000 was 32-bits internally and 16-bits externally (making it 
roughly equivalent to the 386SX, but of course without an MMU.)

That made the evolution of the 68000 series to full 32-bit parts trivial
from a software standpoint.  Most 68K code ran unmodified on the 68020 and
68030, and new compilers were not required to take advantage of their 
newly widened external busses.

Registers had to grow on the 80x86 series from the 8086 to the 286 to the 386, 
and the fact that almost every 286 and 386 ever sold is still running in 8086 
emulation mode is living proof that the 8086 family is not is easy to evolve 
through as the 68000, and that a *lot* of people are still having to pay for 
early MS-DOS design errors.  (for example that there was no BIOS call to write 
a string to the display, so programs did it directly for speed, and no early, 
articulated-to-developers plan for evolving to multitasking.)
-- 
-- uunet!ficc!karl "...as long as there is a Legion of super-Heroes, 
   uunet!sugar!karl       all else can surely be made right."   -- Sensor Girl

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/02/90)

In article <38264@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:

| >And the doubling of the RAM banks from 2 to 4 eats a lot of real estate
| >(remember, the 68040 has a 64-bit bus) and gives a few NuBus problems.
| 
| Try again. My brand, spanking new 68040 manual shows 32 pins for the data
| bus. Try again a lot.

  Did he mean 32 data + 32 address or something?
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

pkr@maddog.sgi.com (Phil Ronzone) (02/03/90)

In article <38264@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <3426@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes:
>>
>>And the doubling of the RAM banks from 2 to 4 eats a lot of real estate
>>(remember, the 68040 has a 64-bit bus) and gives a few NuBus problems.
>
>Try again. My brand, spanking new 68040 manual shows 32 pins for the data
>bus. Try again a lot.



Ooops oops oops -- wrong Motorola part. Just been hanging arounds MIPs chips
lately -- how quickly we forget. Duh .....



------Me and my dyslexic keyboard----------------------------------------------
Phil Ronzone   Manager Secure UNIX           pkr@sgi.COM   {decwrl,sun}!sgi!pkr
Silicon Graphics, Inc.               "I never vote, it only encourages 'em ..."
-----In honor of Minas, no spell checker was run on this posting---------------

thomas@trane.UUCP (Thomas Driemeyer) (02/03/90)

In <3426@odin.SGI.COM>, pkr@maddog.sgi.com (Phil Ronzone) writes:
> And the doubling of the RAM banks from 2 to 4 eats a lot of real estate
> (remember, the 68040 has a 64-bit bus) and gives a few NuBus problems.

You appear to know a lot more about the 68040 than I do. I haven't seen
a data book yet, but I am interested in this chip. Could someone please
post some data?

Some of my questions are, does it have dynamic bus sizing? Is that 64-bit
bus actually two busses, I/D, or is it a wider path to the caches? Does it
support burst cache fills? How fast is the fpu? If it really has a six-
stage pipeline, how does it deal with branches? I assume its user mode is
100% compatible to the 68020/30, so branch prediction and delay slots are
out. Does it still have micro/nanocode, other than for complex fpu ops?
User context is some 150 bytes, do registers have dirty bits or something?

And the most difficult one: When will it be available, and how much will
it cost?

Thanks!


Thomas Driemeyer
pyramid!trane!thomas

joey@hpkslx.HP.COM (Joey Trevino contractor) (02/06/90)

Try the latest Byte magazine for a First Look at the 68040.

srg@quick.COM (Spencer Garrett) (02/06/90)

> Some of my questions are, does it have dynamic bus sizing?
	Nope.  That's one of the major differences between
	the 680[23]0 and the 68040 bus.
> Is that 64-bit bus actually two busses, I/D,
> or is it a wider path to the caches?
	The external bus is 32 bits, but there are separate
	I and D caches on the chip, and the instruction unit
	uses Harvard Architecture (ie - separate I/D busses).
> Does it support burst cache fills?
	It much prefers them.
> How fast is the fpu?
	Supposedly quite fast for the operations it supports, but
	I don't have any actual numbers.  The 68040 fpu only does
	add, sub, mul, div, cmp, abs, neg, and sqrt, though it does
	all the data type combinations for the above.  Everything
	else traps and gets emulated in software.
> If it really has a six-stage pipeline, how does it deal with branches?
> I assume its user mode is 100% compatible to the 68020/30,
> so branch prediction and delay slots are out.
	I don't know the details here, but all exceptions are precise,
	and I don't think there's any out-of-order execution.  It is
	fully compatible with its predecessors in user mode,
	and even nearly compatible in supervisor mode.
> Does it still have micro/nanocode, other than for complex fpu ops?
	I don't know for sure, but I'd bet there's no nanocode, and
	probably not much microcode.  The only "complex" fpu op is
	sqrt, so maybe they bit the bullet and hardwired it all.
> User context is some 150 bytes, do registers have dirty bits or something?
	User context varies depending upon the source of irritation.
	For a "normal" context swap (supervisor mode entered by interrupt
	or trap) the integer unit state is 8 bytes on the stack and
	64 bytes of registers.  Fpu state is 4 bytes if you haven't
	used the fpu since it was last reset or 100 bytes if you're
	using it but it isn't unhappy at the moment.  A faulting
	instruction will generate considerably more context, but
	those are (hopefully) rare.
> And the most difficult one: When will it be available, and how much will
> it cost?
	I'm hearing summer and $750.

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (02/06/90)

srg@quick (Spencer Garrett) writes:
] 	I don't know for sure, but I'd bet there's no nanocode, and
] 	probably not much microcode.  The only "complex" fpu op is
] 	sqrt, so maybe they bit the bullet and hardwired it all.

Sqrt is not much more complex than division, actually. Look at
the manual procedures you were taught for each : very similar.
I think I've seen array circuit designs that did mul, div and sqrt. 
I think I remember that once you've built an array divider,
it's not hard to make it do square-root as well.

Unfortuneately, I can't find the paper I'm thinking of
in the office. Sorry.
--
  Dennis O'Connor      OCONNORDM@CRD.GE.COM      UUNET!CRD.GE.COM!OCONNOR
  Science and Religion have this in common : you must take care to
  distinguish both from the people who claim to represent each of them.

henry@utzoo.uucp (Henry Spencer) (02/07/90)

In article <5120@crdgw1.crd.ge.com> oconnordm@CRD.GE.COM (Dennis M. O'Connor) writes:
>] 	... The only "complex" fpu op is sqrt...
>
>Sqrt is not much more complex than division, actually. Look at
>the manual procedures you were taught for each : very similar.
>I think I've seen array circuit designs that did mul, div and sqrt. 

The IBM 801 people mentioned in one paper that one reason they liked having
a divide-step instruction rather than a full divide was that a carefully-
designed divide-step could double as a sqrt-step.

The IEEE floating-point folks thought sufficiently highly of the utility
of sqrt, and sufficiently lowly :-) of its complexity compared to divide,
that it's a required operation for an IEEE FP implementation.  Looks like
the 040's FP box implements exactly what's needed for IEEE conformance,
in fact.
-- 
SVR4:  every feature you ever |     Henry Spencer at U of Toronto Zoology
wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

colwell@mfci.UUCP (Robert Colwell) (02/07/90)

In article <5120@crdgw1.crd.ge.com> oconnordm@CRD.GE.COM (Dennis M. O'Connor) writes:
>srg@quick (Spencer Garrett) writes:
>] 	I don't know for sure, but I'd bet there's no nanocode, and
>] 	probably not much microcode.  The only "complex" fpu op is
>] 	sqrt, so maybe they bit the bullet and hardwired it all.
>
>Sqrt is not much more complex than division, actually. Look at
>the manual procedures you were taught for each : very similar.
>I think I've seen array circuit designs that did mul, div and sqrt. 
>I think I remember that once you've built an array divider,
>it's not hard to make it do square-root as well.
>
>Unfortuneately, I can't find the paper I'm thinking of
>in the office. Sorry.

An array divider?  Are you sure?  That's a new one on me.  Who
does division this way?  I know about array multipliers, and folks 
who do division based on Newton Raphson approximation (and sqrts too,
for that matter.)  And there are parts like Weitek's and BIT's that 
generate quotients and roots via iterative methods like radix-4 
non-restoring algorithms.  For those, sqrt is exactly twice as hard, 
because you get only one bit of root per iteration, where division
yields two.

I did see a paper about two years ago in IEEE Trans. on Computers
that showed a way to get two sqrt bits per iteration, but I
recall thinking it would be hard to implement it (then).

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (02/07/90)

] In <> oconnordm@CRD.GE.COM (Dennis M. O'Connor) writes:
] Sqrt is not much more complex than division, actually. Look at
] the manual procedures you were taught for each : very similar.
] I think I've seen array circuit designs that did mul, div and sqrt. 
] I think I remember that once you've built an array divider,
] it's not hard to make it do square-root as well.
] 
] Unfortuneately, I can't find the paper I'm thinking of
] in the office. Sorry.

I found the paper :

  "A Generalized Pipeline Array"
  A.K. Kammal, Harpreet Singh, and P.G. Agrawal
  IEEE Transactions on Computers, May 1974, pp 533-536

Quoting the abstract:

  "Abstract - A generalized pipeline cellular array has been proposed
   which can perform all the basic operations such as multiplication,
   division, squarinng, and square rooting. The different modes of opera-
   tion are controlled by a single control line. An expression for time
   delay has ben obtained. Further, it has be shown that these
   arithmetic operations can be overlapped in the pipe in any desired
   sequence, and thus significant speed improvement can be achieved.
   The array is fully iterative and hence is suitable for large-scale
   integration."

The authors were ( and might still be ? ) with the Department of Electronics
and Communication Engineering, University of Roorkee, India.
--
  Dennis O'Connor      OCONNORDM@CRD.GE.COM      UUNET!CRD.GE.COM!OCONNOR
  Science and Religion have this in common : you must take care to
  distinguish both from the people who claim to represent each of them.

jsweedle@mipos2.intel.com (Jonathan Sweedler) (02/08/90)

In article <1224@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>
>An array divider?  Are you sure?  That's a new one on me.  Who
>does division this way?  

Take a look at "Computer Arithmetic - Principles, Architecture and Design"
by Kai Hwang.  Chapter 8 is entitled "Convergence Division and Cellular
Array Dividers."

Also, if you take a look at the November issue of High Performance
Systems, there is an article called "Advancing the Standard" by Tom
Brightman of Cyrix Corp.  He claims that Cyrix has a method for doing
"radix 128k" division and square root.  This gives 17 bits per iteration
(note, this isn't per cycle).  This method uses the multiply array.
===============================================================================
Jonathan Sweedler, Microprocessor Design, Intel Corp.
UUCP: {decwrl,hplabs,oliveb}!intelca!mipos3!mipos2!jsweedle
ARPA: jsweedle%mipos2.intel.com@relay.cs.net

aglew@oberon.csg.uiuc.edu (Andy Glew) (02/08/90)

>>srg@quick (Spencer Garrett) writes:
>>I think I've seen array circuit designs that did mul, div and sqrt. 
>>I think I remember that once you've built an array divider,
>>it's not hard to make it do square-root as well.
>
>An array divider?  Are you sure?  That's a new one on me.  Who
>does division this way?  I know about array multipliers, and folks 
>who do division based on Newton Raphson approximation (and sqrts too,
>for that matter.)  And there are parts like Weitek's and BIT's that 
>generate quotients and roots via iterative methods like radix-4 
>non-restoring algorithms.  For those, sqrt is exactly twice as hard, 
>because you get only one bit of root per iteration, where division
>yields two.
>
>I did see a paper about two years ago in IEEE Trans. on Computers
>that showed a way to get two sqrt bits per iteration, but I
>recall thinking it would be hard to implement it (then).
>
>Bob Colwell               ..!uunet!mfci!colwell


There are a number of designs for array dividers. (I'll hunt up papers
if you want.  A good selection of references should be in the upcoming
IEEE Transactions on Computers special issue on Computer Arithmetic).

Most array dividers, however, are iterative.  The time through the array
is O(n) (and, of course, the area is O(n^2)).
    Full array multipliers come in the iterative O(n) time variety,
where all the additions propagate linearly across the array.  Much
more interesting are array multipliers with embedded trees to suim the
partial products: 3:2 Wallace trees or 2:1 trees, usually using a
redundant form.  These get O(log n) time performance out of the array.
This sort of thing seems to be the state of the art - M88K, AMD's latest.
(Amusing side note: all the old VLSI papers used to say that Wallace trees
were too hard to lay out, too irregular.  A circuit designer said to me
<<Obviously, nobody who said that ever tried to lay one out. They're easy.>>)
    I have not seen any array dividers with better than O(n) time.
It seems that the division algorithms are inherently "Choose some bits;
Form partial quotient and remnainder; Repeat", ie. the linearity seems inherent
in everything except Newton-Raphson or multiplicative approaches. 


Higher radix square root - quite a few papers report 2 bits/cycle.
Fandrianto has reported 3 (and implied 4). Cyrix, with their 17
bit/cycle divide, could likely use NR to get mongo bits for sqrt - I'm
not sure if they did.

References:

%A Kuninnobu
%A Nisiyama
%A Edamatsu
%A Taniguchi
%A Takagi
%T Design of High Speed MOS Multiplier and Divider
Using Redundant Binary Representation
%J ARITH8
%D 1987
%P 80-86
%X Uses redundant {-1,0,1} arithmetic to obtain 2:1 adders;
pre-manipulation to use only n/4 terms.
    I know these guys used an array multiplier. They may have used
an array divider - I'm not sure.


The following papers describe higher radix sqrt algorithms.

%A Bruce Gene De\0Lugish
%T A Class of Algorithms for Automatic Evaluation
of Certain Elementary Functions in a Binary Computer
%J University of Illinois PhD thesis
%D June 1, 1970
%C Urbana-Champaign
%K Lugish
%X Normalization digit selection algorithms for 
multiply, divide, ln, exp,
sqrt, tan, cotan, sin, cos, arctan, arcsin, arccos.
Additive and multiplicative digit based algorithms.
Suggested hardware structure.

%A Catherine Yuk-Fun Chow
%T A Variable Precision Processor Module
%J University of Illinois PhD thesis
%D July 1980
%C Urbana-Champaign
%X Predecessor to Carter's work.

%A Jan Fandrianto
%T Algorithm for High-Speed Shared Radix 4 Division
and Radix 4 Square Root
%K radix4
%J ARITH8
%D 1987
%P 73-79
%X Probably the paper Bob Colwell found "hard to do".

%A Fandrianto
%T Algorithm for High Speed Shared Radix 8 Division and Radix 8 Square Root
%J ARITH9
%K radix8
%P 68-75
%X Doesn't cascade like Taylor's radix 16 = radix 4 squared.
But could be considered a radix 2 cascaded with a radix 4 implementation,
thus avoiding the need for a 3x generation.

%A Montuschi
%A Ciminiera
%T On the Efficient Implementation of Higher Radix Square Root Algorithms
%J ARITH9
%P 154-161
%X Uses 7 bits to select radix 4 digit.

%A Ercegovac
%A Lang
%T Radix 4 Square Root Without Initial PLA
%J ARITH9
%P 162-168
--
Andy Glew, aglew@uiuc.edu

jkrueger@dgis.dtic.dla.mil (Jon) (02/21/90)

daveh@cbmvax.commodore.com (Dave Haynie) writes:

>['040] starts processing both the taken and not-taken path when it
>comes to a branch, at least up to a point. 

Kinda gives "what if analysis" a whole new meaning, eh?

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
The Philip Morris Companies, Inc: without question the strongest
and best argument for an anti-flag-waving amendment.

greg@sce.carleton.ca (Greg Franks) (02/22/90)

In article <9746@cbmvax.commodore.com> daveh@cbmvax.cbm.commodore.com (Dave Haynie) writes:
>The other real interesting thing is that some of the CISCish stuff, like
>address register increment/decrement or offset are handled at their own
>pipeline stages, so many of the non-simple instructions/addressing modes
>still execute in a single effective cycle.  I'm certain more of the
>really weird addressing modes must be handled in several cycles, though
>most compilers don't use these anyway.

The latest issue of IEEE micro is a summary of the hot-chips
symposium.  (Overall, its a very good issue).  The one thing that
struck me as interesting with regard to the new 68040 is that the
"optimized effective address" modes are almost the same as those
supported by the PDP-11.  

The article also states that the fancy addressing modes are twice as
fast on the 040 as they are on the 030 and 020.  However, since I
don't have the data books, I can't say how one complex address mode
instruction stacks up against 'n' optimized address mode instructions.

Finally, from my days of kernel writing for imbedded real-time
systems, I can state that I didn't find the "quadruple indexed through
pc on full moon" type address modes very necessary at all.

-- 
Greg Franks   (613) 788-5726     Carleton University,             
uunet!mitel!sce!greg (uucp)  	 Ottawa, Ontario, Canada  K1S 5B6.
greg@sce.carleton.ca (bitnet)	 (we're on the internet too. (finally))
Overwhelm them with the small bugs so that they don't see the big ones.

iyengar@grad2.cis.upenn.edu (Anand Iyengar) (02/24/90)

In article <769@dgis.dtic.dla.mil> jkrueger@dgis.dtic.dla.mil (Jon) writes:
>daveh@cbmvax.commodore.com (Dave Haynie) writes:
>>['040] starts processing both the taken and not-taken path when it
>>comes to a branch, at least up to a point. 
>Kinda gives "what if analysis" a whole new meaning, eh?
	Interesting that HP was one of the early people to announce using it...

							Anand.  
--
"It just doesn't pay to make money, any more."
{inter | bit}net: iyengar@eniac.seas.upenn.edu
uucp: !$ | uunet
--- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---