[comp.sys.hp] 9000/850 BLAS wanted

howard@wasatch.utah.edu (Walt Howard) (08/29/89)

We recently acquired a 9000/850S with the floating point accelerator option.
According to the Hardware Technical Data sheet, this combination should crank
numbers at 1.86 MFLOPS (DP Linpack benchmark, using coded BLAS).  It only gets
about 1.3 MFLOPS with the generic Fortran BLAS, so I asked our SE if we could
get the coded BLAS.  Well, he came up with a set of HPPA-assembly-language
BLAS, and they looked like they should run faster than the output from
"f77 -O2 -S blas.f", but the difference is actually quite small and the
machine is still quite a ways short of 1.86 MFLOPS.

Even worse, these coded BLAS were a small subset of the total collection of
BLAS.  I will grant that some of the BLAS (like xROTG) run so fast anyway that
hand-coding them is unlikely to make much difference, and some others (like
xNRM2 and the multiple-precision routines) could be a *lot* of work to
hand-code.  Still, having only IDAMAX and DAXPY and DDOT seems short of a
useful set.  (For instance, is there a really clever way to do either DCOPY
or DSWAP, especially if the increments are 1?)

We have a lot of students running programs that would benefit if the BLAS were
significantly faster than plain old DO loops, and the "-S" output from f77
suggests that f77 isn't assigning variables to registers very well.
Curiously, assembling the output from "f77 -S" gives much slower results
than using f77 by itself and skipping the assembly language step, regardless
of optimizer level setting.

So, does anyone know where HP keeps a copy of the *really fast* BLAS for the
series 800, and if most of the BLAS functions are in it, and if it can be
made available to customers?  I'll do them myself if I absolutely have to,
but I'm not a guru in assembly language and I have other useful things to do
if these are already done somewhere.

	Walt <howard@ee.utah.edu>