[comp.sys.hp] 68040 and Floats, is this true?

curt@OCE.ORST.EDU (Curt Vandetta) (06/08/91)

 Hello folks,

 A couple of days ago, I read an article (Sorry I lost it) that someone
 here on the net wrote about thier experience with the 68040 upgrade on
 the HP 9000/400t.  I currently have a 68040 upgrade kit sitting on my
 desking waiting for HP-UX 7.05.  Is it true that the Floating Point
 performance suffers as much as the previous post indecated?  I have a
 really uneasy feeling that it is true.  Incorrect post to the net have
 a habit of being shot down very quickly, and I've yet to see anyone even
 mention this.

 Concerned,
 Curt

tt@euler.jyu.fi (Tapani Tarvainen) (06/09/91)

In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes:


> A couple of days ago, I read an article (Sorry I lost it) that someone
> here on the net wrote about thier experience with the 68040 upgrade on
> the HP 9000/400t.  I currently have a 68040 upgrade kit sitting on my
> desking waiting for HP-UX 7.05.  Is it true that the Floating Point
> performance suffers as much as the previous post indecated?  I have a
> really uneasy feeling that it is true.


Floating Point performance suffers!?  I'd say the question is how much
it improves ... our experience from the 400t -> 425t upgrade is that
floating-point intensive programs are speeded up by a factor ranging
from around two to almost seven.  My only gripe is that they don't
offer a 33MHz '040 for it, like they do for the 400s (wonder if
the 400s upgrade would work in the 400t?  Or just replacing the
processor and the crystal ... if it works with a 50MHz '030
it just might work with a 50MHz '040, too ... )
--
Tapani Tarvainen    (tarvaine@jyu.fi, tarvainen@finjyu.bitnet)

gritz@seas.gwu.edu (Larry Gritz) (06/10/91)

In article <TT.91Jun9113646@euler.jyu.fi> tt@euler.jyu.fi (Tapani Tarvainen) writes:
>In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes:
>
>
>> A couple of days ago, I read an article (Sorry I lost it) that someone
>> here on the net wrote about thier experience with the 68040 upgrade on
>> the HP 9000/400t.  I currently have a 68040 upgrade kit sitting on my
>> desking waiting for HP-UX 7.05.  Is it true that the Floating Point
>> performance suffers as much as the previous post indecated?  I have a
>> really uneasy feeling that it is true.
>
>
>Floating Point performance suffers!?  I'd say the question is how much
>it improves ... our experience from the 400t -> 425t upgrade is that
>floating-point intensive programs are speeded up by a factor ranging
>from around two to almost seven.  My only gripe is that they don't
>offer a 33MHz '040 for it, like they do for the 400s (wonder if
>the 400s upgrade would work in the 400t?  Or just replacing the
>processor and the crystal ... if it works with a 50MHz '030
>it just might work with a 50MHz '040, too ... )
>--
>Tapani Tarvainen    (tarvaine@jyu.fi, tarvainen@finjyu.bitnet)

I made the original post.  It turns out that it's not the actual floating
point calculations that take longer, it's that HP's library routines (in
particular, the ones that convert float->ascii such as fprintf) use opcodes
which used to be in hardware (on the 030) but are now emulated in
software (very slowly).

I heard many replies from people who said that NeXT had the same problem
with their 040's, but just fixed the compilers and libraries.  There's
no reason why HP can't do the same.

Also, even if this problem with the libraries is fixed, there ARE some
computations which will take longer on the 040 than on the 030.  Please
note that some TRIG functions which were done in hardware on the 030
could not "fit" on the 040 chip, and are therefore emulated now in software.

If you want more info, feel free to send me email.  If you still have
doubts, I'll be happy to send you a 25 line program which will illustrate
the problem very clearly.

   -- Larry Gritz

-- 
Larry Gritz                                   lg@galileo.usno.navy.mil
US Naval Observatory                          phone: 202-653-1034
Washington, DC 20392-5100                     also: gritz@seas.gwu.edu

robs@hpuamsa.neth.hp.com (Rob Slotemaker CRC) (06/11/91)

Some more information about the performance of systems with a 68040 processor :

  >  - Will his program perform better when using the 7.40 compiler ?
  Maybe.  If he is using a lot of float->integer conversions, yes it
  will really speed up.  If he is using sin/cos/tan, etc. don't expect
  anything from 7.40; you will have only a slight improvement.

  >  - Will his program perform better when using HP-UX 8.0 ?
  Yes.  All emulated instructions are no longer emitted (not completely
  true, but the one case where they are is extremely rare).  This is
  about as fast as code will get.

  >  - Is there a possibility to let the sin/cos functions NOT be emulated
  >    by using an external floating point processor ?  If yes, how should
  >    it be implemented ?
  This has not been considered, since the coprocessor interface which is on
  the 68030 and the 68882 is no longer present on the 040.  Thus, we would
  have to have a physical bus oriented chip, which could get messy.
  Besides, the emulated instructions ARE as fast as a 68882.  It's just
  that the 040 is not running at the clock speed of a 375.  When the 040
  gets up to 50 MHz, then this will become a non-issue.  For now, using
  direct calls instead of emulation gets you about the same performance
  as a 50 MHz 030.

  >  - Do you have any other suggestions to speed it up ?
  Move to 8.0 as quickly as possible. Of course this requires
  recompilation, but this is the best we can do.


Best regards,

Rob Slotemaker, Dutch CRC

tt@euler.jyu.fi (Tapani Tarvainen) (06/11/91)

In article <TT.91Jun9113646@euler.jyu.fi> I wrote:

>In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes:

>> A couple of days ago, I read an article (Sorry I lost it) that someone
>> here on the net wrote about thier experience with the 68040 upgrade on
>> the HP 9000/400t.  I currently have a 68040 upgrade kit sitting on my
>> desking waiting for HP-UX 7.05.  Is it true that the Floating Point
>> performance suffers as much as the previous post indecated?  I have a
>> really uneasy feeling that it is true.

>Floating Point performance suffers!?  I'd say the question is how much
>it improves ... our experience from the 400t -> 425t upgrade is that
>floating-point intensive programs are speeded up by a factor ranging
>from around two to almost seven.

The original article referred to above arrived here today, and I must
report that I got similar results: the '040 IS much slower with
certain operations.  In particular, *printf()ing floating point
numbers is sloooow.

I dug out HP-UX 7.05 Release Notes, which gives a list of operations
the '040 can't do and which are therefore emulated in software.
I've copied the relevant part here.
(I guess this is technically copyrighted material, but I feel this is
a justified copyright-slaughter if there ever was one.)

! Because there was not enough space on the chip, some instructions were
! chosen to be emulated in software.  That is, instead of having the
! instruction interpreted by the hardware directly, a software trap is taken
! into the kernel, and software in the kernel does the requested operation.
! Because they are done in software, the algorithms used may be slightly
! different than the algorithms that would have been used on the 68882.
! Thus, there are differences in the results of the same instruction on the
! 68882 and 68040.
! 
! Differing results are typically measured in "Unit Last Place's" (ULP's),
! which indicates the distance between the true mantissa and the one
! calculated.  For example, if the real mantissa is 0x4572 and the
! calculated mantissa is 0x456E, the difference is 4 ULP's.
! 
! The MC68882 documentation states that "in general, the worst-case accuracy
! of any transcendental function is one unit in the last place of double
! precision."  The software that emulates these instructions is designed to
! give the same accuracy. This means that, on average, the double precision
! representation should be within one ULP of the true value. This does not
! mean that the 68882 and the 68040 give identical results, only that they
! both should be close to the desired value.
! 
! Emulated Instructions
! ---------------------
! The instructions which are emulated in software are given below.
! Instructions marked with a (*) return exact results, the others are within
! one ULP in double precision.
! 
! 	Instr.	 Description		HP-UX Usage
! 	-------------------------------------------------------------
! 	Trig Functions
! 	 fcos	 Cosine			libm, inline Fortran/C
! 	 facos	 Arc Cosine		libm, inline Fortran/C
! 	 fsincos Sine and Cosine
! 	 ftan	 Tangent		libm, inline Fortran/C
! 	 fsin	 Sine			libm, inline Fortran/C
! 	 fasin	 Arc Sine		libm, inline Fortran/C
! 	 fatan	 Arc Tangent		libm, inline Fortran/C
! 
! 	Hyperbolic Functions
! 	 fsinh	 Hyperbolic Sine	libm, inline Fortran/C
! 	 fcosh	 Hyperbolic Cosine	libm, inline Fortran/C
! 	 ftanh	 Hyperbolic Tangent	libm, inline Fortran/C
! 	 fatanh	 Arc Hyper Tangent
! 
! 	Exponential Functions
! 	 flog2	 Log base 2
! 	 flog10	 Log base 10		libm, inline Fortran/C
! 	 flogn	 Log base e		libm, inline Fortran/C
! 	 flognp1 Log base e of (x+1)
! 	 ftwotox 2 to the x
! 	 ftentox 10 to the x
! 	 fetox	 e to the x		libm, inline Fortran/C
! 	 fetoxm1 e to the (x-1)
! 
! 	Utility Functions
! 	 fint	 Integer Part (*)	Fortran Library
! 	 fintrz	 Same, Round Zero (*)	All Compiled Code using floats
! 	 fgetexp Get Exponent (*)
! 	 fgetman Get Mantissa (*)
! 	 frem	 IEEE Remainder
! 	 fscale	 Scale Exponent
! 	 fmod	 Modulo Remainder	Fortran Library
! 
! 
! Unsupported Data Types
! ----------------------
! Besides the emulated instructions discussed above, the MC68040 does not
! have support for any kind of denormalized numbers on the chip.  This
! included denormalized single and double precision numbers, as well as the
! less common denormalized extended precision. In order to handle these
! types, a software trap is taken into the kernel when these data types are
! encountered.
! 
! A denormalized number is a smaller number than could normally be
! represented.  These are included to extend the range around zero.  Since
! they are minority, and since the data type handler can do exactly what the
! 68882 can do (that is, answers between the two chips should be the same),
! this should not cause any problems for most users.  Because of the trap
! and emulate, dealing with denormalized numbers will be much slower than
! dealing with normalized numbers.
! 
! Another data type which is not supported is packed decimal.  Packed
! decimal is used to convert from binary floating point formats to the usual
! decimal form.  This type is used by scanf() and printf() to input and
! output floating point numbers.  Since the emulator uses the same algorithm
! that the 68882 used, the two chips should give the same result.

Some comments: Cursory testing suggests that for the most part the
emulation is quite effective.  In particular, trigs and logs appear
significantly faster on the 040 even though it's emulating them in
software.

The critical thing in the present case is, I think, revealed in
the last paragraph I quoted above: packed decimal support.

HP: PLEASE do something about this.  If you can't speed up
the packed decimal support emulation then try to rewrite
*printf() and *scanf() without them.
--
Tapani Tarvainen    (tarvaine@jyu.fi, tarvainen@finjyu.bitnet)

hardy@golem.ps.uci.edu (Meinhard E. Mayer (Hardy)) (06/11/91)

As I said in a previous post connected to this, NeXT-Mach 2.1 seems to
have solved this problem (which may have come from Motorola?).  One
should keep the pressure on HP to emulate that solution too.

Greetings,
Hardy 
			  -------****-------
Meinhard E. Mayer (Hardy);  Department of Physics, University of California
Irvine CA 92717; (714) 856 5543; hardy@golem.ps.uci.edu or MMAYER@UCI.BITNET

irf@kuling.UUCP (Bo Thide') (06/14/91)

In order to see how well the 68040 performs in general, and on sprintf()
in particular, I ran the "C Cost" benchmark on the HP9000/425t (68040/25
MHz), HP9000/400t (68030/50 MHz) and the Sun SparcStation 1.  The
results are presented below.  As is seen, the HP-UX 7.05 sprintf() on
the 68040 is a factor of 2 *slower* than on the 68030 but a factor of
two *faster* than in SunOS4.1 on the Sun SparcStation (lower numbers =
faster).

The "C Cost" program is taken from an article titled "An Elementary C
Cost Model" written by Jon Bentley, Brian Kernighan, and Chris Van Wyk
contained within the Volume 9 Number 2 issue of "Unix Review", February
1991.

RESULTS:

-------------------------------------------------------------
Operation                         Mics/N  Mics/N  Mics/N
                                  HP425t  HP400t  Sun Sparc-
                                  (68040) (68030) Station 1
Null Loop (n=1000000)           
 {}                                 0.00    0.43    0.18
Int Operations (n=1000000)              
 i1++                               0.16    0.18    0.34
 i1 = i2                            0.16    0.19    0.35
 i1 = i2 + i3                       0.24    0.35    0.30
 i1 = i2 - i3                       0.24    0.35    0.30
 i1 = i2 * i3                       0.36    1.21    0.30
 i1 = i2 / i3                       2.02    2.11    0.31
 i1 = i2 % i3                       2.02    2.12    0.30
Float Operations (n=1000000)           
 f1 = f2                            0.24    0.19    0.42
 f1 = f2 + f3                       0.40    2.68    0.43
 f1 = f2 - f3                       0.40    2.68    0.42
 f1 = f2 * f3                       0.48    3.29    0.43
 f1 = f2 / f3                       1.78    3.70    0.42
Numeric Conversions (n=1000000)         
 i1 = f1                            1.83    4.92    0.49
 f1 = i1                            0.49    1.92    0.79
Integer Vector Operations (n=1000000)
 v[i] = i                           0.41    0.39    0.38
 v[v[i]] = i                        0.59    0.71    0.62
 v[v[v[i]]] = i                     0.73    0.83    0.82
Control Structures (n=1000000)          
 if (i == 5) i1++                   0.28    0.27    0.12
 if (i != 5) i1++                   0.38    0.39    0.67
 while (i < 0) i1++                 0.32    0.18    0.12
 i1 = sum1(i2)                      0.20    0.82    0.60
 i1 = sum2(i2, i3)                  0.28    1.11    0.67
 i1 = sum3(i2, i3, i4)              0.32    1.42    0.84
Input/Output (n=10000)          
 fputs(s,fp)                       10.00   15.57   15.42
 fgets(s,9,fp)                     11.20   20.37   11.42
 fprintf(fp,sdn,i)                 28.40   48.37   65.82
 fscanf(fp,sd,&i1)                 47.60   80.77   89.42
Malloc (n=20000)                
 free(malloc(8))                    7.60   19.57   28.82
 push(i)                            6.20   15.77   14.02
 i1 = pop()                         0.60    1.97    2.22 
String Functions (n=100000)             
 strcpy(s,s0123456789)              2.08    3.97    5.06
 i1 = strcmp(s,s)                   3.52    4.93    6.14
 i1 = strcmp(s,sa123456789)         1.16    1.49    3.42
String/Number Conversions (n=10000)
 i1 = atoi(s12345)                  5.60    8.37    7.02
 sscanf(s12345,sd,&i1)             48.40   81.57   97.02
 sprintf(s,sd,i)                   23.20   40.77   63.02
 f1 = atof(s123_45)                81.20   56.37  558.62
 sscanf(s123_45,sf,&f1)           148.80  146.37  478.62
 sprintf(s,sf62,123.45)           250.40  127.57  519.02
Math Functions (n=20000)                
 i1 = rand()                        1.60    2.17    6.22
 f1 = log(f2)                      33.20   25.37   13.02
 f1 = exp(f2)                      26.40   19.77   16.42
 f1 = sin(f2)                      24.20   20.37   19.82
 f1 = sqrt(f2)                      4.60   12.57   26.82




----------------------------------------------------------------------

I've cross-posted to comp.benchmarks for possible comments.

Bo

---

   ^   Bo Thide'--------------------------------------------------------------
  |I|       Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |R|  Phone: (+46) 18-303671.  Telex: 76036 (IRFUPP S).  Fax: (+46) 18-403100 
 /|F|\        INTERNET: bt@irfu.se       UUCP: ...!uunet!sunic!irfu!bt
 ~~U~~ -----------------------------------------------------------------sm5dfw

tim@proton.amd.com (Tim Olson) (06/15/91)

In article <2080@kuling.UUCP> bt@irfu.se (Bo Thide') writes:
| In order to see how well the 68040 performs in general, and on sprintf()
| in particular, I ran the "C Cost" benchmark on the HP9000/425t (68040/25
| MHz), HP9000/400t (68030/50 MHz) and the Sun SparcStation 1.  The
| results are presented below.  As is seen, the HP-UX 7.05 sprintf() on
| the 68040 is a factor of 2 *slower* than on the 68030 but a factor of
| two *faster* than in SunOS4.1 on the Sun SparcStation (lower numbers =
| faster).

I don't trust any of these numbers, as they appear highly suspect for
even the simple operations:

| RESULTS:
| 
| -------------------------------------------------------------
| Operation                         Mics/N  Mics/N  Mics/N
|                                   HP425t  HP400t  Sun Sparc-
|                                   (68040) (68030) Station 1
| Null Loop (n=1000000)           
|  {}                                 0.00    0.43    0.18
				      ^^^^
				      It appears that the HP compiler
				      removed the null loop through
				      dead code elimination.	

| Int Operations (n=1000000)              
|  i1++                               0.16    0.18    0.34
|  i1 = i2                            0.16    0.19    0.35
|  i1 = i2 + i3                       0.24    0.35    0.30
|  i1 = i2 - i3                       0.24    0.35    0.30
|  i1 = i2 * i3                       0.36    1.21    0.30
|  i1 = i2 / i3                       2.02    2.11    0.31
|  i1 = i2 % i3                       2.02    2.12    0.30

Are these register or memory operations (they would appear to be
memory to memory by the times listed)?  Note that the SparcStation
times for multiply and divide are the same as those for the simple
operations, even though it has no hardware MUL or DIV.  In the 68040
column, why does it take the same amount of time to perform an
assignment as it does to increment a variable?  That could only be
if they were register-to-register operations, but then why would it
take 160ns @ 25MHz? Again, these numbers are highly suspect.

| Control Structures (n=1000000)          
|  if (i == 5) i1++                   0.28    0.27    0.12 <-- ??
|  if (i != 5) i1++                   0.38    0.39    0.67 <-- ??
|  while (i < 0) i1++                 0.32    0.18    0.12
|  i1 = sum1(i2)                      0.20    0.82    0.60
|  i1 = sum2(i2, i3)                  0.28    1.11    0.67
|  i1 = sum3(i2, i3, i4)              0.32    1.42    0.84

How can these vary by such a large amount?  They should be equal times.

--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)