[comp.sys.amiga.tech] MC68881/2 Support

djh@dragon.metaphor.com (Dallas J. Hodgson) (05/31/90)

If you're like me, you're tired of all the software that doesn't take
advantage of your FFP. The way things work right now, we've got too many
floating point standards, such as :

	mathffp.library
	ieee.library	(double precision)
	ieee.library	(new single precision, recognizes 6888x for 2.0)
	compiler-specific math libraries
	in-line FFP instructions

Since the FFP instructions trap out thru the FLINE vector anyway (if there's
no coprocessor present) why don't we EMULATE a 6888x when the traps occur?
Put this support in the RKM, and let all the compilers generate in-line FFP
instructions. Yes, there's a small amount of overhead for non-FFP users, but
it's no different from the PC way of doing things - and very
system-friendly! Let's say GOODBYE to the math libraries; if a user wants to
install a Weitek in the future, the vendor(s) can supply the driver software
themselves - and everything will still work.

Just a thought.
+----------------------------------------------------------------------------+
| Dallas J. Hodgson               |     "This here's the wattle,             |
| Metaphor Computer Systems       |      It's the emblem of our land.        |
| Mountain View, Ca.              |      You can put it in a bottle,         |
| USENET : djh@metaphor.com       |      You can hold it in your hand."      |
+============================================================================+
| "The views I express are my own, and not necessarily those of my employer" |
+----------------------------------------------------------------------------+

gpsteffler@tiger.uwaterloo.ca (Glenn Steffler) (05/31/90)

In article <1181@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J. Hodgson) writes:
>
>Since the FFP instructions trap out thru the FLINE vector anyway (if there's
>no coprocessor present) why don't we EMULATE a 6888x when the traps occur?

I am not sure about the 680x0 line of processors, but the 80x86 line
definately slow down when a lot of coprocessor traps occur.  The best
way to do it (IMNSHO) would be to place code in the program which calls
the floating point library function.  If a coprocessor is available, the
library runtime link code would replace the library calls with actual floating
point processor instructions (like, fix up the code man).

----
Attention Hazy:

Does the Amiga handle faults, and other hardware interrupts faster than
a generic PC?  I know the 80386 is very slow when interrupts start
wailing away in protect mode.
----

(Microsoft Windows 3.0 math libraries are available which do this, and
 in fact i beleive Microsoft Excel uses them)

>Put this support in the RKM, and let all the compilers generate in-line FFP
>instructions. Yes, there's a small amount of overhead for non-FFP users, but

Like I said, WAY TO MUCH OVERHEAD...

>it's no different from the PC way of doing things - and very
>system-friendly! Let's say GOODBYE to the math libraries; if a user wants to
>install a Weitek in the future, the vendor(s) can supply the driver software
>themselves - and everything will still work.

Not necessarily the way the PC does it...
>
>Just a thought.
>+----------------------------------------------------------------------------+
>| Dallas J. Hodgson               |     "This here's the wattle,             |
>| Metaphor Computer Systems       |      It's the emblem of our land.        |
>| Mountain View, Ca.              |      You can put it in a bottle,         |
>| USENET : djh@metaphor.com       |      You can hold it in your hand."      |
>+============================================================================+
>| "The views I express are my own, and not necessarily those of my employer" |
>+----------------------------------------------------------------------------+

Just a reply.

----
Co-Op Scum

valentin@cbmvax.commodore.com (Valentin Pepelea) (05/31/90)

In article <1181@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J.
Hodgson) writes:
>
> Since the FFP instructions trap out thru the FLINE vector anyway (if there's
> no coprocessor present) why don't we EMULATE a 6888x when the traps occur?

Too much time would be spent decoding and emulating the coprocessor
instructions. And what if someone plugs in a Weitek math coprocessor? The
math coprocessors execute in 30 cycles what would take 3000 cycles otherwise.
Emulating the instruction in software would take more than that. And what if
the user has merely a 68000? Only the 68010 and higher processors have
instruction suspension-completion mechanisms.

The correct solution is to provide a shared library of math functions, and
that's what we do. Those functions automatically take advantage what ever
hardware the user has, thus the programmer does not have to worry about
the configuration of the platform on which his software is about to run.

The 68040 does not have a floating point coprocessor, nor the coprocessor
interface of the 68020 & 68030. But it implements the most used instructions
in hardware, and lets a software emulate the remaining less used instructions.
The 25MHz 68040 will therefore achieve a superior performance than a 33MHz
68030/68882. Now that makes sense.

Valentin
-- 
The Goddess of democracy? "The tyrants     Name:    Valentin Pepelea
may distroy a statue,  but they cannot     Phone:   (215) 431-9327
kill a god."                               UseNet:  cbmvax!valentin@uunet.uu.net
             - Ancient Chinese Proverb     Claimer: I not Commodore spokesman be

Martin@pnt.CAM.ORG (Martin Taillefer) (05/31/90)

>The 68040 does not have a floating point coprocessor, nor the coprocessor
>interface of the 68020 & 68030. But it implements the most used instructions
>in hardware, and lets a software emulate the remaining less used instructions.
>The 25MHz 68040 will therefore achieve a superior performance than a 33MHz
>68030/68882. Now that makes sense.
>
>Valentin

So does 2.0 provide the necessary software to support 68040 transcendental
instructions transparently? Or will 040 board vendors (of which C= will surely
be) have to provide some patch programs to replace the math lib entry points
for the trans function and provide FLINE trap handlers for those programs
using the FPU directly?

--
-------------------------------------------------------------
Martin Taillefer   INTERNET: martin@pnt.CAM.ORG   BIX: vertex
UUCP: uunet!philmtl!altitude!pnt!martin     TEL: 514/640-5734

daveh@cbmvax.commodore.com (Dave Haynie) (05/31/90)

In article <11996@cbmvax.commodore.com> valentin@cbmvax (Valentin Pepelea) writes:
>In article <1181@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J.
>Hodgson) writes:

>> Since the FFP instructions trap out thru the FLINE vector anyway (if there's
>> no coprocessor present) why don't we EMULATE a 6888x when the traps occur?

>Too much time would be spent decoding and emulating the coprocessor
>instructions. And what if someone plugs in a Weitek math coprocessor? 

Kind of a moot point anyway, since Weitek cancelled the 68030 bus version of
their FPU.  Of course, you'd get better performance out of the floating point
on various AT&T DSPs, the 96002, or these new really fast FPUs from BIT.  If
there were a reasonable way to harness this speed in a general way.  Right
now there isn't, but read on...

>The correct solution is to provide a shared library of math functions, and
>that's what we do. Those functions automatically take advantage what ever
>hardware the user has, thus the programmer does not have to worry about
>the configuration of the platform on which his software is about to run.

That's the best general solution, but still not perfect.  You get programs
written for the libraries when speed is not a major issue.  When it is, most
programs currently come in a separate version with direct FPU code.  Even 
the library interface is too slow to do your floating point multiplies in 
the inner loop of a ray trace or something like that.  Same as on the Intel
based systems -- you can get some MS-DOS programs in generic, '387, or
Weitek flavors.

What would really help all of this is a standardized library of useful higher
level math functions.  You're not going to suffer a library call for an
instruction that takes only 20 or 30 clocks if you're worried about speed,
but you might be willing to take a library call to get a routine that takes
a couple thousand clocks to run, removes lots of the work you're trying to
do, works close to theoretical limits on 68000 or 68030/68882, and would get
you going even faster with a faster math engine sitting around.  This is
what I call retargetable mathematics, directly analogous to the type of 
problems folks want to solve with graphics libraries.  Folks complain about
the speed of something like WritePixel() just as they do about the basic
IEEE library's multiply routine.  But you rarely hear complaints about the
higher level graphics functions -- they do enough work for you that you use
them, and they go fast enough.  We really need something to do mathematics
at a high enough level to make faster math coprocessors viable without a
proliferation of math-coprocessor-specific versions of programs running around.

>Valentin

-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
	"I have been given the freedom to do as I see fit" -REM

daveh@cbmvax.commodore.com (Dave Haynie) (05/31/90)

In article <1990May31.025529.28370@watdragon.waterloo.edu> gpsteffler@tiger.uwaterloo.ca (Glenn Steffler) writes:
>In article <1181@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J. Hodgson) writes:

>Attention Hazy:

>Does the Amiga handle faults, and other hardware interrupts faster than
>a generic PC?  I know the 80386 is very slow when interrupts start
>wailing away in protect mode.

As in hardware interrupts, I assume you're getting at exceptions here, such
as F-line and A-line execptions.  In other words, traps for unimplemented
instructions.  Exceptions don't take a great deal of overhead on their
own; they basically store enough context on the stack to get back to where
they occurred, then call a routine at their assigned vector.  But they're
certainly more expensive than a JSR/RTS, or even the JSR/JMP/RTS you get
with Amiga libraries.  And you the expense of the exception is just the
beginning.  Once you're in the exception handler, the nature of the emulated
instruction must be determined: instruction type, operands, addressing modes.
With all this computed, and effective addresses calculated and all, the actual
routine for the instruction emulation must be called.  That last part it just
about all that's required for a library call.  So, while I have no idea how
bad '386 exceptions are, you can rest safe in the knowledge that in most cases,
instruction emulation via traps is not the best way to go on a 680x0 if you're
interested in speed.

-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
	"I have been given the freedom to do as I see fit" -REM

djh@dragon.metaphor.com (Dallas J. Hodgson) (06/01/90)

Hmm, The Sun 3/50's used to trap out thru Unix for their FFP instructions if
no comprocessor was present, and Unix has a much higher overhead than
AmigaDOS. Likewise, PC's often do the same. If you're doing enough FP that
exception vector overhead is an issue, then you sound like a candidate for
buying a math coprocessor. Well? For businesses, this is certainly the case.
For hobbyists, well you know we're cheap...

The "shared library" support is not effective the way it stands right now,
because only 1 of the libraries RECOGNIZES the FFP. And it's Not the popular
one! As long as developers keep on using Motorola FFP as their standard,
(and they will unless precision requirements demand otherwise) we're gonna
keep having a lot of useless 68881/2/040 hardware lying around.

The only software I have that makes use of it is Turbo-Silver 3.0 SV. Impulse
solved their problem by supplying to different executables of this product.
What if Dpaint used the chip for Perspective? What if Design-3D or
Modeler-3D did also? Too bad they don't. Now that Design-3D & Modeler-3D are
no longer supported, it looks like they never will.
+----------------------------------------------------------------------------+
| Dallas J. Hodgson               |     "This here's the wattle,             |
| Metaphor Computer Systems       |      It's the emblem of our land.        |
| Mountain View, Ca.              |      You can put it in a bottle,         |
| USENET : djh@metaphor.com       |      You can hold it in your hand."      |
+============================================================================+
| "The views I express are my own, and not necessarily those of my employer" |
+----------------------------------------------------------------------------+

riley@batcomputer.tn.cornell.edu (Daniel S. Riley) (06/01/90)

In article <1183@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J. Hodgson) writes:
>The "shared library" support is not effective the way it stands right now,
>because only 1 of the libraries RECOGNIZES the FFP. And it's Not the popular
>one! As long as developers keep on using Motorola FFP as their standard,
>(and they will unless precision requirements demand otherwise) we're gonna
>keep having a lot of useless 68881/2/040 hardware lying around.

Doesn't 2.0 includes a single precision IEEE library for exactly this
reason?

-Dan Riley (riley@tcgould.tn.cornell.edu, cornell!batcomputer!riley)
-Wilson Lab, Cornell University

valentin@cbmvax.commodore.com (Valentin Pepelea) (06/01/90)

In article <1183@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J.
Hodgson) writes:
>
>Hmm, The Sun 3/50's used to trap out thru Unix for their FFP instructions if
>no comprocessor was present, and Unix has a much higher overhead than
>AmigaDOS.

Unix has a higher overhead than AmigaDOS because it traps on every OS call.
(the occasional implementation migh differr) When it comes to emulating F-line
instructions, the overhead incurred by Unix or AmigaDOS would be the same.

Valentin
-- 
The Goddess of democracy? "The tyrants     Name:    Valentin Pepelea
may distroy a statue,  but they cannot     Phone:   (215) 431-9327
kill a god."                               UseNet:  cbmvax!valentin@uunet.uu.net
             - Ancient Chinese Proverb     Claimer: I not Commodore spokesman be

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (06/05/90)

In article <10341@batcomputer.tn.cornell.edu> riley@tcgould.tn.cornell.edu (Daniel S. Riley) writes:

>In article <1183@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J. Hodgson) writes:
>>As long as developers keep on using Motorola FFP as their standard,
>>(and they will unless precision requirements demand otherwise) we're gonna
>reason?
>
>-Dan Riley (riley@tcgould.tn.cornell.edu, cornell!batcomputer!riley)
>-Wilson Lab, Cornell University

Yes, the IEEESP (single precision) software floating-point libraries
(so I've been told) are being designed with the goal to be as quick
or quicker than the MathFFP but yet have an accuracy equivalent to that
of the 68881/68882 (Single Precision).  Once this goal is achieved (and I
think C-A is getting close) then I think the MathFFP will be, and should be,
weaned out of the system.  That number of math libraries is reduced
by two and we wind up with IEEESP libraries quicker and more accurate than
the MathFFP.

Al Aburto
aburto@marlin.nosc.mil

djh@dragon.metaphor.com (Dallas J. Hodgson) (06/06/90)

I should mention something I found recently on a late-model Fish disk; it's
called "mathtrans", and was written by a German programmer as a 68881
replacement for MathTrans.library. Let it be known that even tho' his
library has to translate between IEEE and FFP for Every Call, it STILL
performs 2-7x faster on 68881/2 equipped machines. So much for the "overhead
is too high for this sort of thing to be done" myth. On my A3000, the
averaged improvement was about 3.4%. Your mileage may vary.

Benchmark People note : Dhrystone 1.1 runs 5150 on a 25MHz A-3000 using 32-bit
ints, registers enabled, caches (but not bursts) on. 5500 dhrystones using
16-bit ints. These numbers courtesy of Aztec C 5.0.

Now: Run NoFastMem and try again. 5500 dhrystones -> 1200 dhrystones. No
DMA contention going on other than an interlaced 2bitplane WB screen and the
Intuition sprite.
+----------------------------------------------------------------------------+
| Dallas J. Hodgson               |     "This here's the wattle,             |
| Metaphor Computer Systems       |      It's the emblem of our land.        |
| Mountain View, Ca.              |      You can put it in a bottle,         |
| USENET : djh@metaphor.com       |      You can hold it in your hand."      |
+============================================================================+
| "The views I express are my own, and not necessarily those of my employer" |
+----------------------------------------------------------------------------+

hamish@waikato.ac.nz (06/06/90)

In article <11996@cbmvax.commodore.com>, valentin@cbmvax.commodore.com (Valentin Pepelea) writes:
> In article <1181@metaphor.Metaphor.COM> djh@dragon.metaphor.com (Dallas J.
> Hodgson) writes:
>>
>> Since the FFP instructions trap out thru the FLINE vector anyway (if there's
>> no coprocessor present) why don't we EMULATE a 6888x when the traps occur?
> 
> Too much time would be spent decoding and emulating the coprocessor
> instructions. And what if someone plugs in a Weitek math coprocessor? The
> math coprocessors execute in 30 cycles what would take 3000 cycles otherwise.
> Emulating the instruction in software would take more than that. And what if
> the user has merely a 68000? Only the 68010 and higher processors have
> instruction suspension-completion mechanisms.

So what? We weren't talking about instruct re-start, but rather exception
processing. Any 68000 can emulate an FPU by doing fline processing for
you. It's the "Oh my god, there's a page fault, quick grab it from virtual
memory half way through an instruction" that the 68000 can't complete.
Different story altogether. 

> 
> The correct solution is to provide a shared library of math functions, and
> that's what we do. Those functions automatically take advantage what ever
> hardware the user has, thus the programmer does not have to worry about
> the configuration of the platform on which his software is about to run.
> 
> The 68040 does not have a floating point coprocessor, nor the coprocessor
> interface of the 68020 & 68030. But it implements the most used instructions
> in hardware, and lets a software emulate the remaining less used instructions.
> The 25MHz 68040 will therefore achieve a superior performance than a 33MHz
> 68030/68882. Now that makes sense.
> 

The 68040 DOES have a floating point coprocessor. The fact that its a subset
of the 68882, and shares the same piece of silicon, doesn't matter. It still
has 11 80 bit registers, 3 control registers and still looks like an ordinary
68881/2 to the user (except for the trig stuff which is emulated via a trap)

Compare this to the 68030. Are you going to say that it doesn't have an MMU,
even though its in the same boat? ie subset of 68851, on same piece of silicon.

> Valentin
> -- 
> The Goddess of democracy? "The tyrants     Name:    Valentin Pepelea
> may distroy a statue,  but they cannot     Phone:   (215) 431-9327
> kill a god."                               UseNet:  cbmvax!valentin@uunet.uu.net
>              - Ancient Chinese Proverb     Claimer: I not Commodore spokesman be

-- 
==============================================================================
|  Hamish Marson                        |  Internet  hamish@waikato.ac.nz    |
|  Computer Support Person              |  Phone  (071)562889 xt 8181        |
|  Computer Science Department          |  Amiga 3000 for ME!                |
|  University of Waikato                |                                    |
==============================================================================
|Disclaimer:  Anything said in this message is the personal opinion of the   |
|             finger hitting the keyboard & doesn't represent my employers   |
|             opinion in any way. (ie we probably don't agree)               |
==============================================================================

daveh@cbmvax.commodore.com (Dave Haynie) (06/08/90)

In article <664.266d2e99@waikato.ac.nz> hamish@waikato.ac.nz writes:
>In article <11996@cbmvax.commodore.com>, valentin@cbmvax.commodore.com (Valentin Pepelea) writes:

>> The 68040 does not have a floating point coprocessor, nor the coprocessor
>> interface of the 68020 & 68030. 

>The 68040 DOES have a floating point coprocessor. 

You're arguing trivia here, for the most part.  The 68040 doesn't have a 
floating point coprocessor, at least in the traditional sense.  It does that
idea one better by having an internal floating point execution unit which is
a first class processing unit with some of it's own buses and everything.  That
makes it much faster than a coprocessor, which is an external device that sits
next to an integer processor and generally relies on that processor's integer
unit to feed it instructions and data via some special protocol.  All of which
makes things slower.

From the programmer's point of view, though, you have the better part of a 
68882 in the 68040, assuming the factor of 10 or so increase on hardwared
instructions doesn't present a problem.  With an F-line math package to 
cover the missing 68882 codes, you wouldn't know any difference other than
speed.  But in truth, there is no coprocessor.

>Compare this to the 68030. Are you going to say that it doesn't have an MMU,
>even though its in the same boat? ie subset of 68851, on same piece of silicon.

Valentin was saying it didn't have a floating point coprocessor, not that it
didn't have a floating point unit.  The 68851 connected to the 68020 uses the
coprocessor protocol to manage it's registers and all, while the 68030 does
this all internally, without dropping though the coprocessor protocol.  So
I would say the 68030 has an MMU, but I wouldn't say the 68030 has an MMU
coprocessor.

Like I said, trivia.

>> Valentin

>|  Hamish Marson                        |  Internet  hamish@waikato.ac.nz    |
-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
	"I have been given the freedom to do as I see fit" -REM