howard@wasatch.utah.edu (Walt Howard) (08/29/89)
We recently acquired a 9000/850S with the floating point accelerator option. According to the Hardware Technical Data sheet, this combination should crank numbers at 1.86 MFLOPS (DP Linpack benchmark, using coded BLAS). It only gets about 1.3 MFLOPS with the generic Fortran BLAS, so I asked our SE if we could get the coded BLAS. Well, he came up with a set of HPPA-assembly-language BLAS, and they looked like they should run faster than the output from "f77 -O2 -S blas.f", but the difference is actually quite small and the machine is still quite a ways short of 1.86 MFLOPS. Even worse, these coded BLAS were a small subset of the total collection of BLAS. I will grant that some of the BLAS (like xROTG) run so fast anyway that hand-coding them is unlikely to make much difference, and some others (like xNRM2 and the multiple-precision routines) could be a *lot* of work to hand-code. Still, having only IDAMAX and DAXPY and DDOT seems short of a useful set. (For instance, is there a really clever way to do either DCOPY or DSWAP, especially if the increments are 1?) We have a lot of students running programs that would benefit if the BLAS were significantly faster than plain old DO loops, and the "-S" output from f77 suggests that f77 isn't assigning variables to registers very well. Curiously, assembling the output from "f77 -S" gives much slower results than using f77 by itself and skipping the assembly language step, regardless of optimizer level setting. So, does anyone know where HP keeps a copy of the *really fast* BLAS for the series 800, and if most of the BLAS functions are in it, and if it can be made available to customers? I'll do them myself if I absolutely have to, but I'm not a guru in assembly language and I have other useful things to do if these are already done somewhere. Walt <howard@ee.utah.edu>