irf@kuling.UUCP (Bo Thide') (03/27/91)
Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let loose, the official HP info has become available. Some of this info follows. There are three models, the desktop (114mm*508mm*470mm) 720 (Cobra) and 730 (King Cobra) and the deskside (610mm*220mm*595mm) 750 (Coral). They come initially with HP-UX 8.01 to be upgraded to HP-UX 8.05 in June. Later OSF/1 will be available. Clock: 50 MHZ (720) or 66 MHz (730, 750) Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data. Interfaces: SCSI-II, EISA, LAN, RS-232 (to 460.8 kbaud), HP-HIL, Centronics. HP-IB optional (via EISA!). Monitors: 72 Hz, 19" 1280x1024 8-bit grayscale (GRX) or 8+8 color planes (CRX). Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA. Languages: C, C++, Pascal, FORTRAN, ANSI C, Assembler. FORTRAN compiler with "+800" option for series 800 compatibility. Series 800 binaries run on 700 series. Performance (with HP-UX 8.05) and comparison with other workstations: ----------------------------------------------------------------------------- SPEC Khorner- Linp2P x11- Dhry- mark int fp stones MIPS MFLOPS perf stone2.0 ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 72.2 51.0 91.0 143974 76 22 10460 114680 HP9000/720 G/CRX 55.5 39.0 70.2 119213 57 17 8244 87000 IBM 6000/550 54.3 34.5 73.5 n/a 56 23 n/a n/a IBM 6000/320 24.6 16.3 32.4 54661 29.5 8.5 1520 45250 DECstation 5000/200PXGT 18.5 19.0 18.5 26456 24.2 3.7 3256 38760 DECstation 3100 11.3 11.8 10.9 15285 14.9 1.6 1702 23470 Sun SPARCstation 2GX 21.0 20.2 21.5 27142 28.5 4.2 n/a 35590 Sun SPARCstation IPC 11.8 12.4 11.4 13329 15.7 1.7 n/a 22830 ----------------------------------------------------------------------------- Linp2P = Linpack Double precision, 100*100 FORTRAN BLAS, rolled. x11perf = geometric mean of the x11perf1.2 component tests (excluding 1 and 500 pixel tests). Selected x11perf Tests: ----------------------------------------------------------------------------- 10 pixel 10*10 TR create & map Dots lines rects text subwins (50 kids) ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 1630000 911000 278000 273000 6000 HP9000/720 G/CRX 1260000 874000 272000 245000 4500 DECstation 5000/200PXGT 370000 455000 256000 90900 1750 Sun SPARCstation 2GX 101100 147000 83500 49000 1050 ----------------------------------------------------------------------------- Graphics Performance: ----------------------------------------------------------------------------- 2D floating 3D floating pt pt vectors/s vectors/s (peak) ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 1120000 1150000 HP9000/720 G/CRX 1120000 1150000 DECstation 5000/200PXGT 300000 300000 Sun SPARCstation 2GX 450000 240000 ----------------------------------------------------------------------------- Sequential Disk Access Rates: ----------------------------------------------------------------------------- Read (kB/s) Write (kB/s) ----------------------------------------------------------------------------- HP9000/700, 1*210MByte disk 1120 1140 HP9000/700, 1*420MByte disk 1520 1510 HP9000/700, 2*210MByte disk 2070 1800 HP9000/700, 2*420MByte disk 2460 2140 Sun SPARCstation 2, 207MByte disk 744 794 ----------------------------------------------------------------------------- ANSYS SP-3 results (smaller = better): ----------------------------------------------------------------------------- CPU seconds ----------------------------------------------------------------------------- Cray 2 27 HP9000/730,750 G/CRX 49 DEC VAX9000 65 HP9000/720 G/CRX 66 IBM 6000/540 68 DECstation 5000 145 IBM 6000/320 107 Sun SPARCstation 1+ 311 Sun SPARCstation 2 225 ----------------------------------------------------------------------------- HP numbers were measured with series 800 compiler code. No series 700 specific optimizations used.
irf@kuling.UUCP (Bo Thide') (03/27/91)
Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let loose, the official HP info has become available. Some of this info follows. There are three models, the desktop (114mm*508mm*470mm) 720 (Cobra) and 730 (King Cobra) and the deskside (610mm*220mm*595mm) 750 (Coral). They come initially with HP-UX 8.01 to be upgraded to HP-UX 8.05 in June. Later OSF/1 will be available. Clock: 50 MHZ (720) or 66 MHz (730, 750) Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data. Interfaces: SCSI-II, EISA, LAN, RS-232 (to 460.8 kbaud), HP-HIL, Centronics. HP-IB optional (via EISA!). Monitors: 72 Hz, 19" 1280x1024 8-bit grayscale (GRX) or 8+8 color planes (CRX). Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA. Languages: C, C++, Pascal, FORTRAN, ANSI C, Assembler. FORTRAN compiler with "+800" option for series 800 compatibility. Series 800 binaries run on series 700 machines. Performance (with HP-UX 8.05) and comparison with other workstations: ----------------------------------------------------------------------------- SPEC Khorner- Linp2P x11- Dhry- mark int fp stones MIPS MFLOPS perf stone2.0 ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 72.2 51.0 91.0 143974 76 22.9 10460 114680 HP9000/720 G/CRX 55.5 39.0 70.2 119213 57 17.2 8244 87000 IBM 6000/550 54.3 34.5 73.5 n/a 56 23 n/a n/a IBM 6000/320 24.6 16.3 32.4 54661 29.5 8.5 1520 45250 Sun SPARCstation 2GX 21.0 20.2 21.5 27142 28.5 4.2 n/a 35590 DECstation 5000/200PXGT 18.5 19.0 18.5 26456 24.2 3.7 3256 38760 DECstation 3100 11.3 11.8 10.9 15285 14.9 1.6 1702 23470 Sun SPARCstation IPC 11.8 12.4 11.4 13329 15.7 1.7 n/a 22830 ----------------------------------------------------------------------------- Linp2P = Linpack Double precision, 100*100 FORTRAN BLAS, rolled. x11perf = geometric mean of the x11perf1.2 component tests (excluding 1 and 500 pixel tests). Selected x11perf Tests: ----------------------------------------------------------------------------- 10 pixel 10*10 TR create & map Dots lines rects text subwins (50 kids) ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 1630000 911000 278000 273000 6000 HP9000/720 G/CRX 1260000 874000 272000 245000 4500 DECstation 5000/200PXGT 370000 455000 256000 90900 1750 Sun SPARCstation 2GX 101100 147000 83500 49000 1050 ----------------------------------------------------------------------------- Graphics Performance: ----------------------------------------------------------------------------- 2D floating 3D floating pt pt vectors/s vectors/s (peak) ----------------------------------------------------------------------------- HP9000/730,750 G/CRX 1120000 1150000 HP9000/720 G/CRX 1120000 1150000 DECstation 5000/200PXGT 300000 300000 Sun SPARCstation 2GX 450000 240000 ----------------------------------------------------------------------------- Sequential Disk Access Rates: ----------------------------------------------------------------------------- Read (kB/s) Write (kB/s) ----------------------------------------------------------------------------- HP9000/700, 1*210MByte disk 1120 1140 HP9000/700, 1*420MByte disk 1520 1510 HP9000/700, 2*210MByte disk 2070 1800 HP9000/700, 2*420MByte disk 2460 2140 Sun SPARCstation 2, 207MByte disk 744 794 ----------------------------------------------------------------------------- ANSYS SP-3 results (smaller = better): ----------------------------------------------------------------------------- CPU seconds ----------------------------------------------------------------------------- Cray 2 27 HP9000/730,750 G/CRX 49 DEC VAX9000 65 HP9000/720 G/CRX 66 IBM 6000/540 68 DECstation 5000 145 IBM 6000/320 107 Sun SPARCstation 1+ 311 Sun SPARCstation 2 225 ----------------------------------------------------------------------------- HP numbers were measured with series 800 compiler code. No series 700 specific optimizations used.
nazgul@alphalpha.com (Kee Hinckley) (03/27/91)
In article <1998@kuling.UUCP> irf@kuling.DoCS.UU.SE (Bo Thide') writes: >Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA. ^^^^^^^^^^^^ I don't believe this. 1.2 uses the R5 Intrinsics, and while HP is a consortium member and the contractor doing the 1.2 work I can't believe that any of that stuff is stable enough to use. It's not even in beta yet from OSF. If they are releasing it then it's sure to change before the official release. (And we won't even talk about bugs.) -- Alfalfa Software, Inc. | Poste: The EMail for Unix nazgul@alfalfa.com | Send Anything... Anywhere 617/646-7703 (voice/fax) | info@alfalfa.com I'm not sure which upsets me more: that people are so unwilling to accept responsibility for their own actions, or that they are so eager to regulate everyone else's.
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/27/91)
The performance numbers listed in your posting are misleading. The 720 will *not* achieve 17 Mflops on the double precision Linpack and/or single precision Linpack benchmarks with the Fortran compilers that are being shipped with the machines. I have tested a 720 using Jack Dongarra's fortran code, and the machine runs a 13.5 Mflops. The 17 Mflop speed is claimed by HP for compilers that will ship in June according to the fine print on the product release notes we've received from HP. Adjust your expectations accordingly. The integer performance (MIPS, Dhrystones) is achievable with the current compilers. The numbers listed for the 720 are in the range I measured during my tests. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)
One additional note on the performance numbers ... The benchmarks used for the Mflop numbers fit within the 256 KB data cache of the 720/730/750 for both single and double precision versions. If your application does *not* fit within the data cache, and if it is also a 64-bit floating point arithmetic application, then your performance will fall by a factor of 2. The official 100x100 Linpack benchmark fits entirely within the data cache for both single and double precision versions; and both versions achieved 13.5 Mflops in my testing. However a 300x300 LU decomposition benchmark (Jack Dongarra's LU benchmark program testing the effects of loop unrolling and parallel vector code) had a quite different result: the single precision version ran a twice the speed of the double precision version. Neither benchmark fit within the cache with the 300x300 problem (360 KBb single precision, 720 Kb double). It should be noted that the data caches on the Sparcstations and most of the DEC machines are smaller than even the 100x100 Linpack benchmark (the double precision version), so that the Mflop numbers for these machines are the not-in-cache, 64-bit arithmetic results; while the HP700 numbers are for the in-cache 64-bit arithmetic. Caveat Emptor! Know Your Benchmarks! -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
iyengar@gradient.cis.upenn.edu (Anand Iyengar) (03/28/91)
In article <1998@kuling.UUCP> irf@kuling.DoCS.UU.SE (Bo Thide') writes: >Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let >... >Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data. Are these external caches (sound too big to be on chip)? How much (if any) delay does a cache access cost? Anand. -- "The nearer your destination, the more you're slip-sliding away..." iyengar@grad1.cis.upenn.edu --- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? --- Disclaimer: It's a forgery. -- "The nearer your destination, the more you're slip-sliding away..." iyengar@grad1.cis.upenn.edu --- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)
Nope! You are correct. They are *fast* machines even if the application does not fit in cache ... but not nearly as fast as the published numbers imply. The 55 MIPS of the 720 versus the 28 MIPS of the Sparc 2 is a real performance edge. The 6.5 Mflops of the 720 on a 300x300 LU decomposition is a real performance edge over the 2.6 Mflops of the Sparc 2 on the same test ... it's a factor of roughly 2.5 However, the *published* numbers being spread about are 17 Mflops for the 720 vs 4.2 Mflops for the Sparc 2 ... which is a factor of 4 performance edge which is only achievable with compilers that are not shipping for another several months and which is only achievable for smaller data sets. A 256 Kb data cache is sufficient for many tasks (not any of ours, unfortunately -- geophysics applications tend to consider 500x500 systems of equations as *small*, 1000x1000 as moderate, and 5000x5000 as what-you-really- want-to-do-for -your-thesis ;-0 ) ). It is critical, however, for people to understand the conditions of a benchmark run. Because most of the benchmarks HP quotes run in-cache on the 700 series, they tend to represent best-case results. Because most of the benchmarks do *not* run in-cache on the DEC and Sun machines, the results tend to be closer to the achievable performance levels for a wider range of problems -- both large *and* small applications run at a mere 4.2 Mflops on the Sparc 2, but only small applications run at 13.5 Mflops on the 720 ... the large ones run at 6.5 Mflops (unless its single precision, 32-bit, in which case it still runs at 13.5). The key to this all is to *know* YOUR application and to *know* the benchmark's characteristics and to *know* what compilers and/or tuning was used. It makes up to a factor of 2 difference in the results. == Dave
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (03/28/91)
On cache-dominated machines, the Linpak benchmarks (100x100 or300x300) are not very good tests of realistic performance with any current compilers I am aware of. This is because many large problems, including 2D FFT's and large matrix multiplies, can be re-written using "strip mining" to maximize cache hits and re-use of data. On an IBM R6000/730, you get only 2megaflops for a 1000x1000 compiled matrix multiply, but with a simple modification of the loop to maximize cache hits, you get about 40 megaflops (and this is on a 25Mhz machine). Compilers aren't yet smart enough to do stripmining. By the way, for the IBM 6000 series, I have been able to learn this sort of stuff about how to get performance out of the thing. On my DN10000, I have to use the vector library and BLAS library to get any performance; with the machine-coded matrix multiply, I get about 25 megaflops, 1 processor. With Fortran, the results go down to about 2-3 megaflops. I have no idea how to get performance from inside fortran, despite having had the machine for almost two years now. You'd almost think HP/A considered performance tuning techniques a closely guarded secret!
krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)
The HP9000 series 700 CPU is a 3-chip set implementing the PA-RISC architecture (this is the same RISC architecture used by HP in the 800 series and in their RISC minicomputer lines). HP did *not* attempt to put the entire CPU on a single chip for a number of reasons: 1) power disipation -- cramming that much circuitry running at that speed onto a single chip would melt the chip. As it is, each of the 3 chips is mounted under a massive heat sink. 2) Cache size. One of the things which has held back the Intel i860 as a CPU chip is the limitted size of the i-cache and the d-cache. In addition, on-chip caches make it difficult to implement parallel processors due to the difficulty of maintaining cache-coherency among the on-chip caches. Not that I'm aware of any HP parallel processor plans ... I've just watched the problems that Alliant has had in getting their FX2800 (28 i860's running in a shared-memory parallel processor) to run at it's maximum potential speed. To answer your question directly, there is no on-chip cache ... it's all external, but their is no penalty since the system was designed as a multi-chip CPU. As with all CPU's, accessing data in a register (the ultimate on-chip cache) is always faster than accessing data in the memory, cache or otherwise, because you can eliminate a load/store instruction. The new LAPACK linear algebra libraries are explicitly designed around this principle and run a *lot* faster than LINPACK/EISPACK. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
steve-t@hpfcso.FC.HP.COM (Steve Taylor) (03/31/91)
/comp.sys.apollo/ krowitz@RICHTER.MIT.EDU (David Krowitz) // | In addition, on-chip caches make it difficult to implement parallel | processors due to the difficulty of maintaining cache-coherency among | the on-chip caches. Not that I'm aware of any HP parallel processor plans They're not workstations, but see page 11 of the January 1991 issue of _Workstation_ magazine (Vol. 7, No. 1) about the 9000 870/x00 models. Then there's the 3000 980/200 (PA-RISC, but running MPE). Regards, Steve taylor NOT A STATEMENT, OFFICIAL OR OTHERWISE, OF THE HEWLETT-PACKARD COMPANY.
burdick@hpspdra.HP.COM (Matt Burdick) (04/02/91)
>>Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA. > ^^^^^^^^^^^^ >I don't believe this. 1.2 uses the R5 Intrinsics, and while HP is a >consortium member and the contractor doing the 1.2 work I can't believe >that any of that stuff is stable enough to use. You're right - Jim Byers posted this to comp.sys.hp a few days ago: >>Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA. > ^^^^^^^^^^^^^^^^^^^^^^ >The Series 700s will have Motif 1.1. We will not have delivered 1.2 >into OSF's hands in that time frame. > >Jim Byers >Interface Technology Operation >Hewlett Packard -matt -- Matt Burdick | Hewlett-Packard burdick@hpspd.spd.hp.com | Intelligent Networks Operation