[comp.sys.apollo] Snakebytes

irf@kuling.UUCP (Bo Thide') (03/27/91)

Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let
loose, the official HP info has become available.  Some of this info follows.

There are three models, the desktop (114mm*508mm*470mm) 720 (Cobra) and
730 (King Cobra) and the deskside (610mm*220mm*595mm) 750 (Coral). They
come initially with HP-UX 8.01 to be upgraded to HP-UX 8.05 in June. Later
OSF/1 will be available.

Clock: 50 MHZ (720) or 66 MHz (730, 750)

Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data.

Interfaces: SCSI-II, EISA, LAN, RS-232 (to 460.8 kbaud), HP-HIL, Centronics.
            HP-IB optional (via EISA!).

Monitors: 72 Hz, 19" 1280x1024 8-bit grayscale (GRX) or 8+8 color planes (CRX).

Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA.

Languages: C, C++, Pascal, FORTRAN, ANSI C, Assembler.  FORTRAN compiler
	   with "+800" option for series 800 compatibility. Series 800
	   binaries run on 700 series.


Performance (with HP-UX 8.05) and comparison with other workstations:
-----------------------------------------------------------------------------
                            SPEC        Khorner-       Linp2P  x11-  Dhry-
                        mark int  fp    stones   MIPS  MFLOPS  perf  stone2.0
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX    72.2 51.0 91.0  143974   76    22      10460  114680
HP9000/720 G/CRX        55.5 39.0 70.2  119213   57    17       8244   87000
IBM 6000/550            54.3 34.5 73.5   n/a     56    23       n/a    n/a
IBM 6000/320            24.6 16.3 32.4   54661   29.5   8.5     1520   45250
DECstation 5000/200PXGT 18.5 19.0 18.5   26456   24.2   3.7     3256   38760
DECstation 3100         11.3 11.8 10.9   15285   14.9   1.6     1702   23470
Sun SPARCstation 2GX    21.0 20.2 21.5   27142   28.5   4.2     n/a    35590
Sun SPARCstation IPC    11.8 12.4 11.4   13329   15.7   1.7     n/a    22830
-----------------------------------------------------------------------------
Linp2P = Linpack Double precision, 100*100 FORTRAN BLAS, rolled.
x11perf = geometric mean of the x11perf1.2 component tests (excluding 1
	  and 500 pixel tests).


Selected x11perf Tests:
-----------------------------------------------------------------------------
			         10 pixel  10*10   TR      create & map
			Dots     lines     rects   text    subwins (50 kids)
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX    1630000  911000    278000  273000  6000
HP9000/720 G/CRX        1260000  874000    272000  245000  4500
DECstation 5000/200PXGT  370000  455000    256000   90900  1750
Sun SPARCstation 2GX     101100  147000     83500   49000  1050
-----------------------------------------------------------------------------


Graphics Performance:
-----------------------------------------------------------------------------
                          2D floating       3D floating pt
		    	pt vectors/s      vectors/s (peak)
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX      1120000           1150000
HP9000/720 G/CRX          1120000           1150000
DECstation 5000/200PXGT    300000            300000
Sun SPARCstation 2GX       450000            240000
-----------------------------------------------------------------------------


Sequential Disk Access Rates:
-----------------------------------------------------------------------------
                                       Read (kB/s)       Write (kB/s)
-----------------------------------------------------------------------------
HP9000/700, 1*210MByte disk            1120              1140
HP9000/700, 1*420MByte disk            1520              1510
HP9000/700, 2*210MByte disk            2070              1800
HP9000/700, 2*420MByte disk            2460              2140
Sun SPARCstation 2, 207MByte disk       744               794
-----------------------------------------------------------------------------


ANSYS SP-3 results (smaller = better):
-----------------------------------------------------------------------------
                            CPU seconds
-----------------------------------------------------------------------------
Cray 2                       27
HP9000/730,750 G/CRX         49
DEC VAX9000                  65
HP9000/720 G/CRX             66
IBM 6000/540                 68
DECstation 5000             145
IBM 6000/320                107
Sun SPARCstation 1+         311
Sun SPARCstation 2          225
-----------------------------------------------------------------------------
HP numbers were measured with series 800 compiler code. No series 700 
specific optimizations used.

irf@kuling.UUCP (Bo Thide') (03/27/91)

Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let
loose, the official HP info has become available.  Some of this info follows.

There are three models, the desktop (114mm*508mm*470mm) 720 (Cobra) and
730 (King Cobra) and the deskside (610mm*220mm*595mm) 750 (Coral). They
come initially with HP-UX 8.01 to be upgraded to HP-UX 8.05 in June. Later
OSF/1 will be available.

Clock: 50 MHZ (720) or 66 MHz (730, 750)

Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data.

Interfaces: SCSI-II, EISA, LAN, RS-232 (to 460.8 kbaud), HP-HIL, Centronics.
            HP-IB optional (via EISA!).

Monitors: 72 Hz, 19" 1280x1024 8-bit grayscale (GRX) or 8+8 color planes (CRX).

Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA.

Languages: C, C++, Pascal, FORTRAN, ANSI C, Assembler.  FORTRAN compiler
	   with "+800" option for series 800 compatibility. Series 800
	   binaries run on series 700 machines.


Performance (with HP-UX 8.05) and comparison with other workstations:
-----------------------------------------------------------------------------
                            SPEC        Khorner-       Linp2P  x11-  Dhry-
                        mark int  fp    stones   MIPS  MFLOPS  perf  stone2.0
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX    72.2 51.0 91.0  143974   76    22.9    10460  114680
HP9000/720 G/CRX        55.5 39.0 70.2  119213   57    17.2     8244   87000
IBM 6000/550            54.3 34.5 73.5   n/a     56    23       n/a    n/a
IBM 6000/320            24.6 16.3 32.4   54661   29.5   8.5     1520   45250
Sun SPARCstation 2GX    21.0 20.2 21.5   27142   28.5   4.2     n/a    35590
DECstation 5000/200PXGT 18.5 19.0 18.5   26456   24.2   3.7     3256   38760
DECstation 3100         11.3 11.8 10.9   15285   14.9   1.6     1702   23470
Sun SPARCstation IPC    11.8 12.4 11.4   13329   15.7   1.7     n/a    22830
-----------------------------------------------------------------------------
Linp2P = Linpack Double precision, 100*100 FORTRAN BLAS, rolled.
x11perf = geometric mean of the x11perf1.2 component tests (excluding 1
	  and 500 pixel tests).


Selected x11perf Tests:
-----------------------------------------------------------------------------
			         10 pixel  10*10   TR      create & map
			Dots     lines     rects   text    subwins (50 kids)
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX    1630000  911000    278000  273000  6000
HP9000/720 G/CRX        1260000  874000    272000  245000  4500
DECstation 5000/200PXGT  370000  455000    256000   90900  1750
Sun SPARCstation 2GX     101100  147000     83500   49000  1050
-----------------------------------------------------------------------------


Graphics Performance:
-----------------------------------------------------------------------------
                          2D floating       3D floating pt
		    	  pt vectors/s      vectors/s (peak)
-----------------------------------------------------------------------------
HP9000/730,750 G/CRX      1120000           1150000
HP9000/720 G/CRX          1120000           1150000
DECstation 5000/200PXGT    300000            300000
Sun SPARCstation 2GX       450000            240000
-----------------------------------------------------------------------------


Sequential Disk Access Rates:
-----------------------------------------------------------------------------
                                       Read (kB/s)       Write (kB/s)
-----------------------------------------------------------------------------
HP9000/700, 1*210MByte disk            1120              1140
HP9000/700, 1*420MByte disk            1520              1510
HP9000/700, 2*210MByte disk            2070              1800
HP9000/700, 2*420MByte disk            2460              2140
Sun SPARCstation 2, 207MByte disk       744               794
-----------------------------------------------------------------------------


ANSYS SP-3 results (smaller = better):
-----------------------------------------------------------------------------
                            CPU seconds
-----------------------------------------------------------------------------
Cray 2                       27
HP9000/730,750 G/CRX         49
DEC VAX9000                  65
HP9000/720 G/CRX             66
IBM 6000/540                 68
DECstation 5000             145
IBM 6000/320                107
Sun SPARCstation 1+         311
Sun SPARCstation 2          225
-----------------------------------------------------------------------------
HP numbers were measured with series 800 compiler code. No series 700 
specific optimizations used.

nazgul@alphalpha.com (Kee Hinckley) (03/27/91)

In article <1998@kuling.UUCP> irf@kuling.DoCS.UU.SE (Bo Thide') writes:
>Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA.
		  ^^^^^^^^^^^^
I don't believe this.  1.2 uses the R5 Intrinsics, and while HP is a
consortium member and the contractor doing the 1.2 work I can't believe
that any of that stuff is stable enough to use.  It's not even in beta
yet from OSF.  If they are releasing it then it's sure to change before
the official release.  (And we won't even talk about bugs.)

-- 
Alfalfa Software, Inc.          |       Poste:  The EMail for Unix
nazgul@alfalfa.com              |       Send Anything... Anywhere
617/646-7703 (voice/fax)        |       info@alfalfa.com

I'm not sure which upsets me more: that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/27/91)

The performance numbers listed in your posting are misleading.
The 720 will *not* achieve 17 Mflops on the double precision
Linpack and/or single precision Linpack benchmarks with the
Fortran compilers that are being shipped with the machines. I
have tested a 720 using Jack Dongarra's fortran code, and 
the machine runs a 13.5 Mflops. The 17 Mflop speed is claimed
by HP for compilers that will ship in June according to the
fine print on the product release notes we've received from HP.
Adjust your expectations accordingly.

The integer performance (MIPS, Dhrystones) is achievable
with the current compilers. The numbers listed for the 720
are in the range I measured during my tests.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)

One additional note on the performance numbers ...

The benchmarks used for the Mflop numbers fit within the
256 KB data cache of the 720/730/750 for both single and
double precision versions. If your application does *not*
fit within the data cache, and if it is also a 64-bit
floating point arithmetic application, then your performance
will fall by a factor of 2. The official 100x100 Linpack
benchmark fits entirely within the data cache for both
single and double precision versions; and both versions
achieved 13.5 Mflops in my testing. However a 300x300 LU
decomposition benchmark (Jack Dongarra's LU benchmark
program testing the effects of loop unrolling and
parallel vector code) had a quite different result: the
single precision version ran a twice the speed of the
double precision version. Neither benchmark fit within
the cache with the 300x300 problem (360 KBb single precision,
720 Kb double).

It should be noted that the data caches on the Sparcstations
and most of the DEC machines are smaller than even the 100x100
Linpack benchmark (the double precision version), so that the
Mflop numbers for these machines are the not-in-cache, 64-bit
arithmetic results; while the HP700 numbers are for the
in-cache 64-bit arithmetic. 

Caveat Emptor! Know Your Benchmarks!


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

iyengar@gradient.cis.upenn.edu (Anand Iyengar) (03/28/91)

In article <1998@kuling.UUCP> irf@kuling.DoCS.UU.SE (Bo Thide') writes:
>Now that the Snakes (HP9000/700 series HP-PA 1.1 RISC workstations) are let
>...
>Cache: 128 kB instr/256 kB data (720, 730), 256 kB instr/256 kB data.
	Are these external caches (sound too big to be on chip)?  How much
(if any) delay does a cache access cost?  

					Anand.  
--
"The nearer your destination, the more you're slip-sliding away..."
iyengar@grad1.cis.upenn.edu
--- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---
Disclaimer:  It's a forgery.  
--
"The nearer your destination, the more you're slip-sliding away..."
iyengar@grad1.cis.upenn.edu
--- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)

Nope! You are correct. They are *fast* machines even if
the application does not fit in cache ... but not nearly
as fast as the published numbers imply. The 55 MIPS of
the 720 versus the 28 MIPS of the Sparc 2 is a real
performance edge. The 6.5 Mflops of the 720 on a 300x300
LU decomposition is a real performance edge over the
2.6 Mflops of the Sparc 2 on the same test ... it's a
factor of roughly 2.5 

However, the *published* numbers being spread about are
17 Mflops for the 720 vs 4.2 Mflops for the Sparc 2 ...
which is a factor of 4 performance edge which is only
achievable with compilers that are not shipping for
another several months and which is only achievable
for smaller data sets.

A 256 Kb data cache is sufficient for many tasks (not
any of ours, unfortunately -- geophysics applications
tend to consider 500x500 systems of equations as *small*,
1000x1000 as moderate, and 5000x5000 as what-you-really-
want-to-do-for -your-thesis ;-0 ) ). It is critical, however,
for people to understand the conditions of a benchmark
run. Because most of the benchmarks HP quotes run in-cache
on the 700 series, they tend to represent best-case
results. Because most of the benchmarks do *not* run
in-cache on the DEC and Sun machines, the results tend
to be closer to the achievable performance levels for a
wider range of problems -- both large *and* small applications
run at a mere 4.2 Mflops on the Sparc 2, but only small
applications run at 13.5 Mflops on the 720 ... the large
ones run at 6.5 Mflops (unless its single precision, 32-bit,
in which case it still runs at 13.5).

The key to this all is to *know* YOUR application and to
*know* the benchmark's characteristics and to *know* what
compilers and/or tuning was used. It makes up to a factor
of 2 difference in the results.

== Dave

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (03/28/91)

On cache-dominated machines, the Linpak benchmarks (100x100 or300x300)
are not very good tests of realistic performance with any current
compilers I am aware of.  This is because many large problems, including
2D FFT's and large matrix multiplies, can be re-written using 
"strip mining" to maximize cache hits and re-use of data. On an IBM
R6000/730, you get only 2megaflops for a 1000x1000 compiled matrix
multiply, but with a simple modification of the loop to maximize
cache hits, you get about 40 megaflops (and this is on a 25Mhz machine).
Compilers aren't yet smart enough to do stripmining.

By the way, for the IBM 6000 series, I have been able to learn this
sort of stuff about how to get performance out of the thing.  On
my DN10000, I have to use the vector library and BLAS library to
get any performance;  with the machine-coded matrix multiply, 
I get about 25 megaflops, 1 processor.  With Fortran, the results
go down to about 2-3 megaflops.  I have no idea how to get
performance from inside fortran, despite having had the machine
for almost two years now.  You'd almost think HP/A considered 
performance tuning techniques a closely guarded secret!

krowitz@RICHTER.MIT.EDU (David Krowitz) (03/28/91)

The HP9000 series 700 CPU is a 3-chip set implementing the
PA-RISC architecture (this is the same RISC architecture
used by HP in the 800 series and in their RISC minicomputer
lines). HP did *not* attempt to put the entire CPU on a
single chip for a number of reasons:

1) power disipation -- cramming that much circuitry running
   at that speed onto a single chip would melt the chip. As
   it is, each of the 3 chips is mounted under a massive
   heat sink.

2) Cache size. One of the things which has held back the Intel
   i860 as a CPU chip is the limitted size of the i-cache and
   the d-cache. In addition, on-chip caches make it difficult
   to implement parallel processors due to the difficulty of
   maintaining cache-coherency among the on-chip caches. Not
   that I'm aware of any HP parallel processor plans ... I've
   just watched the problems that Alliant has had in getting
   their FX2800 (28 i860's running in a shared-memory parallel
   processor) to run at it's maximum potential speed.

To answer your question directly, there is no on-chip cache ...
it's all external, but their is no penalty since the system
was designed as a multi-chip CPU. As with all CPU's, accessing
data in a register (the ultimate on-chip cache) is always faster
than accessing data in the memory, cache or otherwise, because
you can eliminate a load/store instruction. The new LAPACK linear
algebra libraries are explicitly designed around this principle
and run a *lot* faster than LINPACK/EISPACK.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

steve-t@hpfcso.FC.HP.COM (Steve Taylor) (03/31/91)

/comp.sys.apollo/ krowitz@RICHTER.MIT.EDU (David Krowitz) //
| In addition, on-chip caches make it difficult to implement parallel
| processors due to the difficulty of maintaining cache-coherency among
| the on-chip caches. Not that I'm aware of any HP parallel processor plans 

They're not workstations, but see page 11 of the January 1991 issue of
_Workstation_ magazine (Vol. 7, No. 1) about the 9000 870/x00 models.
Then there's the 3000 980/200 (PA-RISC, but running MPE).

						Regards, Steve taylor

NOT A STATEMENT, OFFICIAL OR OTHERWISE, OF THE HEWLETT-PACKARD COMPANY.

burdick@hpspdra.HP.COM (Matt Burdick) (04/02/91)

>>Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA.
>		  ^^^^^^^^^^^^
>I don't believe this.  1.2 uses the R5 Intrinsics, and while HP is a
>consortium member and the contractor doing the 1.2 work I can't believe
>that any of that stuff is stable enough to use.

You're right - Jim Byers posted this to comp.sys.hp a few days ago:

>>Software: X11R4, OSF/Motif1.2 (not 1.1!), VUE, NCS, NFS, 4.3BSD TCP/IP, ARPA.
>                  ^^^^^^^^^^^^^^^^^^^^^^
>The Series 700s will have Motif 1.1.  We will not have delivered 1.2
>into OSF's hands in that time frame.
>
>Jim Byers
>Interface Technology Operation
>Hewlett Packard

							-matt
-- 
Matt Burdick                |   Hewlett-Packard
burdick@hpspd.spd.hp.com    |   Intelligent Networks Operation