curt@OCE.ORST.EDU (Curt Vandetta) (06/08/91)
Hello folks, A couple of days ago, I read an article (Sorry I lost it) that someone here on the net wrote about thier experience with the 68040 upgrade on the HP 9000/400t. I currently have a 68040 upgrade kit sitting on my desking waiting for HP-UX 7.05. Is it true that the Floating Point performance suffers as much as the previous post indecated? I have a really uneasy feeling that it is true. Incorrect post to the net have a habit of being shot down very quickly, and I've yet to see anyone even mention this. Concerned, Curt
tt@euler.jyu.fi (Tapani Tarvainen) (06/09/91)
In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes: > A couple of days ago, I read an article (Sorry I lost it) that someone > here on the net wrote about thier experience with the 68040 upgrade on > the HP 9000/400t. I currently have a 68040 upgrade kit sitting on my > desking waiting for HP-UX 7.05. Is it true that the Floating Point > performance suffers as much as the previous post indecated? I have a > really uneasy feeling that it is true. Floating Point performance suffers!? I'd say the question is how much it improves ... our experience from the 400t -> 425t upgrade is that floating-point intensive programs are speeded up by a factor ranging from around two to almost seven. My only gripe is that they don't offer a 33MHz '040 for it, like they do for the 400s (wonder if the 400s upgrade would work in the 400t? Or just replacing the processor and the crystal ... if it works with a 50MHz '030 it just might work with a 50MHz '040, too ... ) -- Tapani Tarvainen (tarvaine@jyu.fi, tarvainen@finjyu.bitnet)
gritz@seas.gwu.edu (Larry Gritz) (06/10/91)
In article <TT.91Jun9113646@euler.jyu.fi> tt@euler.jyu.fi (Tapani Tarvainen) writes: >In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes: > > >> A couple of days ago, I read an article (Sorry I lost it) that someone >> here on the net wrote about thier experience with the 68040 upgrade on >> the HP 9000/400t. I currently have a 68040 upgrade kit sitting on my >> desking waiting for HP-UX 7.05. Is it true that the Floating Point >> performance suffers as much as the previous post indecated? I have a >> really uneasy feeling that it is true. > > >Floating Point performance suffers!? I'd say the question is how much >it improves ... our experience from the 400t -> 425t upgrade is that >floating-point intensive programs are speeded up by a factor ranging >from around two to almost seven. My only gripe is that they don't >offer a 33MHz '040 for it, like they do for the 400s (wonder if >the 400s upgrade would work in the 400t? Or just replacing the >processor and the crystal ... if it works with a 50MHz '030 >it just might work with a 50MHz '040, too ... ) >-- >Tapani Tarvainen (tarvaine@jyu.fi, tarvainen@finjyu.bitnet) I made the original post. It turns out that it's not the actual floating point calculations that take longer, it's that HP's library routines (in particular, the ones that convert float->ascii such as fprintf) use opcodes which used to be in hardware (on the 030) but are now emulated in software (very slowly). I heard many replies from people who said that NeXT had the same problem with their 040's, but just fixed the compilers and libraries. There's no reason why HP can't do the same. Also, even if this problem with the libraries is fixed, there ARE some computations which will take longer on the 040 than on the 030. Please note that some TRIG functions which were done in hardware on the 030 could not "fit" on the 040 chip, and are therefore emulated now in software. If you want more info, feel free to send me email. If you still have doubts, I'll be happy to send you a 25 line program which will illustrate the problem very clearly. -- Larry Gritz -- Larry Gritz lg@galileo.usno.navy.mil US Naval Observatory phone: 202-653-1034 Washington, DC 20392-5100 also: gritz@seas.gwu.edu
robs@hpuamsa.neth.hp.com (Rob Slotemaker CRC) (06/11/91)
Some more information about the performance of systems with a 68040 processor : > - Will his program perform better when using the 7.40 compiler ? Maybe. If he is using a lot of float->integer conversions, yes it will really speed up. If he is using sin/cos/tan, etc. don't expect anything from 7.40; you will have only a slight improvement. > - Will his program perform better when using HP-UX 8.0 ? Yes. All emulated instructions are no longer emitted (not completely true, but the one case where they are is extremely rare). This is about as fast as code will get. > - Is there a possibility to let the sin/cos functions NOT be emulated > by using an external floating point processor ? If yes, how should > it be implemented ? This has not been considered, since the coprocessor interface which is on the 68030 and the 68882 is no longer present on the 040. Thus, we would have to have a physical bus oriented chip, which could get messy. Besides, the emulated instructions ARE as fast as a 68882. It's just that the 040 is not running at the clock speed of a 375. When the 040 gets up to 50 MHz, then this will become a non-issue. For now, using direct calls instead of emulation gets you about the same performance as a 50 MHz 030. > - Do you have any other suggestions to speed it up ? Move to 8.0 as quickly as possible. Of course this requires recompilation, but this is the best we can do. Best regards, Rob Slotemaker, Dutch CRC
tt@euler.jyu.fi (Tapani Tarvainen) (06/11/91)
In article <TT.91Jun9113646@euler.jyu.fi> I wrote: >In article <1991Jun07.213219.14174@lynx.CS.ORST.EDU> curt@OCE.ORST.EDU (Curt Vandetta) writes: >> A couple of days ago, I read an article (Sorry I lost it) that someone >> here on the net wrote about thier experience with the 68040 upgrade on >> the HP 9000/400t. I currently have a 68040 upgrade kit sitting on my >> desking waiting for HP-UX 7.05. Is it true that the Floating Point >> performance suffers as much as the previous post indecated? I have a >> really uneasy feeling that it is true. >Floating Point performance suffers!? I'd say the question is how much >it improves ... our experience from the 400t -> 425t upgrade is that >floating-point intensive programs are speeded up by a factor ranging >from around two to almost seven. The original article referred to above arrived here today, and I must report that I got similar results: the '040 IS much slower with certain operations. In particular, *printf()ing floating point numbers is sloooow. I dug out HP-UX 7.05 Release Notes, which gives a list of operations the '040 can't do and which are therefore emulated in software. I've copied the relevant part here. (I guess this is technically copyrighted material, but I feel this is a justified copyright-slaughter if there ever was one.) ! Because there was not enough space on the chip, some instructions were ! chosen to be emulated in software. That is, instead of having the ! instruction interpreted by the hardware directly, a software trap is taken ! into the kernel, and software in the kernel does the requested operation. ! Because they are done in software, the algorithms used may be slightly ! different than the algorithms that would have been used on the 68882. ! Thus, there are differences in the results of the same instruction on the ! 68882 and 68040. ! ! Differing results are typically measured in "Unit Last Place's" (ULP's), ! which indicates the distance between the true mantissa and the one ! calculated. For example, if the real mantissa is 0x4572 and the ! calculated mantissa is 0x456E, the difference is 4 ULP's. ! ! The MC68882 documentation states that "in general, the worst-case accuracy ! of any transcendental function is one unit in the last place of double ! precision." The software that emulates these instructions is designed to ! give the same accuracy. This means that, on average, the double precision ! representation should be within one ULP of the true value. This does not ! mean that the 68882 and the 68040 give identical results, only that they ! both should be close to the desired value. ! ! Emulated Instructions ! --------------------- ! The instructions which are emulated in software are given below. ! Instructions marked with a (*) return exact results, the others are within ! one ULP in double precision. ! ! Instr. Description HP-UX Usage ! ------------------------------------------------------------- ! Trig Functions ! fcos Cosine libm, inline Fortran/C ! facos Arc Cosine libm, inline Fortran/C ! fsincos Sine and Cosine ! ftan Tangent libm, inline Fortran/C ! fsin Sine libm, inline Fortran/C ! fasin Arc Sine libm, inline Fortran/C ! fatan Arc Tangent libm, inline Fortran/C ! ! Hyperbolic Functions ! fsinh Hyperbolic Sine libm, inline Fortran/C ! fcosh Hyperbolic Cosine libm, inline Fortran/C ! ftanh Hyperbolic Tangent libm, inline Fortran/C ! fatanh Arc Hyper Tangent ! ! Exponential Functions ! flog2 Log base 2 ! flog10 Log base 10 libm, inline Fortran/C ! flogn Log base e libm, inline Fortran/C ! flognp1 Log base e of (x+1) ! ftwotox 2 to the x ! ftentox 10 to the x ! fetox e to the x libm, inline Fortran/C ! fetoxm1 e to the (x-1) ! ! Utility Functions ! fint Integer Part (*) Fortran Library ! fintrz Same, Round Zero (*) All Compiled Code using floats ! fgetexp Get Exponent (*) ! fgetman Get Mantissa (*) ! frem IEEE Remainder ! fscale Scale Exponent ! fmod Modulo Remainder Fortran Library ! ! ! Unsupported Data Types ! ---------------------- ! Besides the emulated instructions discussed above, the MC68040 does not ! have support for any kind of denormalized numbers on the chip. This ! included denormalized single and double precision numbers, as well as the ! less common denormalized extended precision. In order to handle these ! types, a software trap is taken into the kernel when these data types are ! encountered. ! ! A denormalized number is a smaller number than could normally be ! represented. These are included to extend the range around zero. Since ! they are minority, and since the data type handler can do exactly what the ! 68882 can do (that is, answers between the two chips should be the same), ! this should not cause any problems for most users. Because of the trap ! and emulate, dealing with denormalized numbers will be much slower than ! dealing with normalized numbers. ! ! Another data type which is not supported is packed decimal. Packed ! decimal is used to convert from binary floating point formats to the usual ! decimal form. This type is used by scanf() and printf() to input and ! output floating point numbers. Since the emulator uses the same algorithm ! that the 68882 used, the two chips should give the same result. Some comments: Cursory testing suggests that for the most part the emulation is quite effective. In particular, trigs and logs appear significantly faster on the 040 even though it's emulating them in software. The critical thing in the present case is, I think, revealed in the last paragraph I quoted above: packed decimal support. HP: PLEASE do something about this. If you can't speed up the packed decimal support emulation then try to rewrite *printf() and *scanf() without them. -- Tapani Tarvainen (tarvaine@jyu.fi, tarvainen@finjyu.bitnet)
hardy@golem.ps.uci.edu (Meinhard E. Mayer (Hardy)) (06/11/91)
As I said in a previous post connected to this, NeXT-Mach 2.1 seems to have solved this problem (which may have come from Motorola?). One should keep the pressure on HP to emulate that solution too. Greetings, Hardy -------****------- Meinhard E. Mayer (Hardy); Department of Physics, University of California Irvine CA 92717; (714) 856 5543; hardy@golem.ps.uci.edu or MMAYER@UCI.BITNET
irf@kuling.UUCP (Bo Thide') (06/14/91)
In order to see how well the 68040 performs in general, and on sprintf() in particular, I ran the "C Cost" benchmark on the HP9000/425t (68040/25 MHz), HP9000/400t (68030/50 MHz) and the Sun SparcStation 1. The results are presented below. As is seen, the HP-UX 7.05 sprintf() on the 68040 is a factor of 2 *slower* than on the 68030 but a factor of two *faster* than in SunOS4.1 on the Sun SparcStation (lower numbers = faster). The "C Cost" program is taken from an article titled "An Elementary C Cost Model" written by Jon Bentley, Brian Kernighan, and Chris Van Wyk contained within the Volume 9 Number 2 issue of "Unix Review", February 1991. RESULTS: ------------------------------------------------------------- Operation Mics/N Mics/N Mics/N HP425t HP400t Sun Sparc- (68040) (68030) Station 1 Null Loop (n=1000000) {} 0.00 0.43 0.18 Int Operations (n=1000000) i1++ 0.16 0.18 0.34 i1 = i2 0.16 0.19 0.35 i1 = i2 + i3 0.24 0.35 0.30 i1 = i2 - i3 0.24 0.35 0.30 i1 = i2 * i3 0.36 1.21 0.30 i1 = i2 / i3 2.02 2.11 0.31 i1 = i2 % i3 2.02 2.12 0.30 Float Operations (n=1000000) f1 = f2 0.24 0.19 0.42 f1 = f2 + f3 0.40 2.68 0.43 f1 = f2 - f3 0.40 2.68 0.42 f1 = f2 * f3 0.48 3.29 0.43 f1 = f2 / f3 1.78 3.70 0.42 Numeric Conversions (n=1000000) i1 = f1 1.83 4.92 0.49 f1 = i1 0.49 1.92 0.79 Integer Vector Operations (n=1000000) v[i] = i 0.41 0.39 0.38 v[v[i]] = i 0.59 0.71 0.62 v[v[v[i]]] = i 0.73 0.83 0.82 Control Structures (n=1000000) if (i == 5) i1++ 0.28 0.27 0.12 if (i != 5) i1++ 0.38 0.39 0.67 while (i < 0) i1++ 0.32 0.18 0.12 i1 = sum1(i2) 0.20 0.82 0.60 i1 = sum2(i2, i3) 0.28 1.11 0.67 i1 = sum3(i2, i3, i4) 0.32 1.42 0.84 Input/Output (n=10000) fputs(s,fp) 10.00 15.57 15.42 fgets(s,9,fp) 11.20 20.37 11.42 fprintf(fp,sdn,i) 28.40 48.37 65.82 fscanf(fp,sd,&i1) 47.60 80.77 89.42 Malloc (n=20000) free(malloc(8)) 7.60 19.57 28.82 push(i) 6.20 15.77 14.02 i1 = pop() 0.60 1.97 2.22 String Functions (n=100000) strcpy(s,s0123456789) 2.08 3.97 5.06 i1 = strcmp(s,s) 3.52 4.93 6.14 i1 = strcmp(s,sa123456789) 1.16 1.49 3.42 String/Number Conversions (n=10000) i1 = atoi(s12345) 5.60 8.37 7.02 sscanf(s12345,sd,&i1) 48.40 81.57 97.02 sprintf(s,sd,i) 23.20 40.77 63.02 f1 = atof(s123_45) 81.20 56.37 558.62 sscanf(s123_45,sf,&f1) 148.80 146.37 478.62 sprintf(s,sf62,123.45) 250.40 127.57 519.02 Math Functions (n=20000) i1 = rand() 1.60 2.17 6.22 f1 = log(f2) 33.20 25.37 13.02 f1 = exp(f2) 26.40 19.77 16.42 f1 = sin(f2) 24.20 20.37 19.82 f1 = sqrt(f2) 4.60 12.57 26.82 ---------------------------------------------------------------------- I've cross-posted to comp.benchmarks for possible comments. Bo --- ^ Bo Thide'-------------------------------------------------------------- |I| Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden |R| Phone: (+46) 18-303671. Telex: 76036 (IRFUPP S). Fax: (+46) 18-403100 /|F|\ INTERNET: bt@irfu.se UUCP: ...!uunet!sunic!irfu!bt ~~U~~ -----------------------------------------------------------------sm5dfw
tim@proton.amd.com (Tim Olson) (06/15/91)
In article <2080@kuling.UUCP> bt@irfu.se (Bo Thide') writes: | In order to see how well the 68040 performs in general, and on sprintf() | in particular, I ran the "C Cost" benchmark on the HP9000/425t (68040/25 | MHz), HP9000/400t (68030/50 MHz) and the Sun SparcStation 1. The | results are presented below. As is seen, the HP-UX 7.05 sprintf() on | the 68040 is a factor of 2 *slower* than on the 68030 but a factor of | two *faster* than in SunOS4.1 on the Sun SparcStation (lower numbers = | faster). I don't trust any of these numbers, as they appear highly suspect for even the simple operations: | RESULTS: | | ------------------------------------------------------------- | Operation Mics/N Mics/N Mics/N | HP425t HP400t Sun Sparc- | (68040) (68030) Station 1 | Null Loop (n=1000000) | {} 0.00 0.43 0.18 ^^^^ It appears that the HP compiler removed the null loop through dead code elimination. | Int Operations (n=1000000) | i1++ 0.16 0.18 0.34 | i1 = i2 0.16 0.19 0.35 | i1 = i2 + i3 0.24 0.35 0.30 | i1 = i2 - i3 0.24 0.35 0.30 | i1 = i2 * i3 0.36 1.21 0.30 | i1 = i2 / i3 2.02 2.11 0.31 | i1 = i2 % i3 2.02 2.12 0.30 Are these register or memory operations (they would appear to be memory to memory by the times listed)? Note that the SparcStation times for multiply and divide are the same as those for the simple operations, even though it has no hardware MUL or DIV. In the 68040 column, why does it take the same amount of time to perform an assignment as it does to increment a variable? That could only be if they were register-to-register operations, but then why would it take 160ns @ 25MHz? Again, these numbers are highly suspect. | Control Structures (n=1000000) | if (i == 5) i1++ 0.28 0.27 0.12 <-- ?? | if (i != 5) i1++ 0.38 0.39 0.67 <-- ?? | while (i < 0) i1++ 0.32 0.18 0.12 | i1 = sum1(i2) 0.20 0.82 0.60 | i1 = sum2(i2, i3) 0.28 1.11 0.67 | i1 = sum3(i2, i3, i4) 0.32 1.42 0.84 How can these vary by such a large amount? They should be equal times. -- -- Tim Olson Advanced Micro Devices (tim@amd.com)