rick@uicsg.UUCP (02/15/87)
New Benchmark A new benchmark has been written to overcome some of the deficiencies of Dhrystone. This benchmark, Dhampstone (written in C), was designed primarily for evaluation of multiple-register-set architectures (such as RISC), but is useful for general types of architectures as well. The benchmark is a modification of Dhrystone designed to incorporate several important statistics that were not used in the design of Dhrystone. For multiple-register-set architectures, the number of local variables and number of parameters are very important. Statistics for these distributions were included. Running Dhrystone on RISC would cause no overflow of the register file; the depth of nesting is too shallow. Dhampstone has overflow characteristics which are derived from measurements of several C programs (nroff, pi, sort, tbl, ccom, etc.). Other statistics useful for evaluating general architectures were also included. The distribution of register declarations was included in Dhampstone. Dhampstone also includes a representation of I/O processing overhead. No actual I/O is performed but the movement of data to an I/O buffer is included. I/O measurements were made in several programs and Dhampstone has a typical number of bytes of I/O per instruction executed. Other program statistics of Dhampstone are similar to those in Dhrystone. Neither synthetic benchmark uses statistics relevant for cache evaluation or code optimization evaluation. The code of Dhampstone follows. Rick Eickemeyer ihnp4!uiucdcs!uicsg!rick -------------------------------------------------------------------------- /* * DHAMPSTONE * * Synthetic Systems Program Benchmark for * Multiple Register Set Architectures * * AUTHORS: Richard J. Eickemeyer and Janak H. Patel * Computer Systems Group * Coordinated Science Laboratory * University of Illinois at Urbana-Champaign * 1101 W. Springfield * Urbana, IL 61801 * * DATE: September 1985 * * DHAMPSTONE is a synthetic benchmark program designed to match dynamic * program statistics of systems programs. The program is useful for * evaluating performance of general purpose computers. In designing * the program, a special emphasis was made in making the program * suitable for evaluating processors with multiple register sets. * Dhampstone, written in C, was created by modifying Dhrystone, * originally written in Ada, as described below. (Dhrystone is * described in: "Dhrystone: A Synthetic Systems Programming Benchmark" * by R. P. Weicker in _Communications_of_the_ACM_ Vol. 27, pp. 1013- * 1030.) Two copies of a C version of Dhrystone were used as a starting * point. This allowed greater precision in adjusting the number of * local variables and parameters. The two versions are the functions * Proc0 and Proc0a, which call each other in the proper sequence to * achieve the desired overflow statistics. As in Dhrystone, start and * stop marks are given to separate the initialization section from the * part of the program that meets the specified statistics. Measurements * should include only parts of the program executed between these two * points. The type names in the original have been preserved. * Dhampstone incorporates statistics on number of parameters and number * of local variables per procedure; multiple register set overflow; I/O; * and register declarations in addition to the statement and operand * type and locality distributions of Dhrystone. * * I/O is implemented with function "string_io" using a string of length 22. * No actual I/O is done, only the memory operations necessary for the * I/O. I/O is to blocks of memory. The input is the string * "DHAMPSTONE PROG INPUT" repeated 10 times (without intervening blanks * or new-lines) * * COMPILE Dhampstone as: * cc -o Dhampstone Dhampstone.c * If it is desired to run the program in a loop, compile as: * cc -DLOOP=n -o Dhampstone Dhampstone.c * with n the number of iterations. * Dhampstone was not designed with statistics relevant for optimization. * Results obtained using an optimizing compiler may not be indicative * of behavior of real programs. * * For each execution of both Proc0 and Proc0a all functions are * executed 2 times, unless indicated otherwise. */ /* definitions similar to those in stdio.h */ struct _iobuf { int _cnt; char *_ptr; char *_base; int _bufsiz }; # define FILE struct _iobuf # define NULL 0 # define getc(p) \ (--(p)->_cnt>=0? *(p)->_ptr++&0377:printf("I/O ERROR\n")) # define putc(x,p) \ (--(p)->_cnt>=0?((int)(*(p)->_ptr++=(unsigned)(x))):printf("I/O ERROR\n")) # define FALSE 0 # define TRUE 1 # define SLEN 22 /* I/O string length */ # define IOLEN 210 /* = 10 * (SLEN - 1) */ typedef enum {Id1, Id2, Id3, Id4, Id5} Enumeration; typedef int OneToThirty, OneToFifty; typedef char CapitalLetter; typedef char String30[30]; typedef int Array1DimInteger[50]; typedef int Array2DimInteger[50][50]; typedef int boolean; typedef struct RecordExample { Enumeration Discr; struct RecordExample *PtrComp; Enumeration EnumComp; OneToFifty IntComp; String30 StringComp; } RecordType, *RecordPtr; boolean BoolG; char CharG1, CharG2, CharG3; Array1DimInteger ArrG1; Array2DimInteger ArrG2; String30 StringG1; RecordPtr PtrG1, PtrG2; int IntG1, IntG2, IntG3, IntG4; FILE *File_In, *File_Out; char input[IOLEN+1] = "DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\ DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\ DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\ DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT"; char output[IOLEN+1]; int Proc0_Flag; /* indicates if in Proc0 or Proc0a */ Enumeration Func1(), Func1a(); boolean Func2(), Func3(); char *malloc(); main () { #ifdef LOOP int loop; #endif LOOP PtrG2 = (RecordPtr) malloc (sizeof (RecordType)); PtrG1 = (RecordPtr) malloc (sizeof (RecordType)); PtrG1->Discr = Id1; PtrG1->PtrComp = PtrG2; PtrG1->EnumComp = Id3; PtrG1->IntComp = 40; strcpy (PtrG1->StringComp, "DHAMPSTONE PROGRAM ONE STRING"); strcpy (StringG1, "DHAMPSTONE PROGRAM 3RD STRING"); File_In = (FILE *) malloc (sizeof (FILE)); File_Out = (FILE *) malloc (sizeof (FILE)); File_In->_ptr = File_In->_base = input; File_Out->_ptr = File_Out->_base = output; File_In->_cnt = File_Out->_cnt = IOLEN; File_In->_bufsiz = File_Out->_bufsiz = IOLEN; ArrG2[6][5] = 47; ArrG2[10][9] = 32; CharG3 = 'C'; IntG4 = 8; /* *------- Start Timer ------- */ #ifdef LOOP for (loop = 1; loop <= LOOP; loop++) { #endif LOOP Proc0a(2); Proc0(9); Proc0a(2); Proc0a(3); Proc0(2); Proc0a(2); #ifdef LOOP File_In->_ptr = File_In->_base; File_In->_cnt = File_In->_bufsiz; File_Out->_ptr = File_Out->_base; File_Out->_cnt = File_Out->_bufsiz; } #endif LOOP /* *------- Stop Timer -------* */ } Proc0 (IntParI) register int IntParI; /* IntParI is a no. to control calling depth (called from main and Proc0a) */ { register OneToFifty IntL1, IntL2; OneToFifty IntL3; char CharIndex; Enumeration EnumL; String30 StringL1, StringL2; if (IntParI >= 2) Proc0a (IntParI - 1); Proc0_Flag = 1; IntL1 = 2; IntL2 = 3; string_io (StringL1, SLEN); strcpy (StringL2, "DHAMPSTONE PROGRAM 2ND STRING"); EnumL = Id2; BoolG = ! Func2 ('E', &CharG3, 'R', StringL2, StringL1); while (IntL1 < IntL2) { /* body executed once */ IntL3 = 5 * IntL1 - IntL2; Proc7 (); IntL1 = IntL1 + 1; } Proc8 (ArrG1, ArrG2, IntL1, IntL3 - 5); Proc1 (PtrG1); for (CharIndex = 'A'; CharIndex <= CharG2; CharIndex++) if (EnumL == Func1a(Id1, CharIndex, 'Z')) /* body executed twice */ Proc6 (&EnumL, Id1, 20, IntL2); /* not executed */ IntL3 = IntL2 * IntL3; IntL2 = IntL3 / IntL1; IntL2 = 7 * (IntL3 - IntL2) - IntL1; Proc2 (&IntL3, 'C'); } Proc0a (IntParI) int IntParI; /* IntParI is a no. to control calling depth (called from main and Proc0a) */ { register OneToFifty IntL1, IntL2, IntL3; OneToFifty IntL4; char CharIndex, CharL1, CharL2; Enumeration EnumL1; String30 StringL; IntL1 = 2; IntL2 = 5; CharL2 = 'A'; CharL1 = 'C'; if (IntParI > 1) Proc0(IntParI - 1); Proc0_Flag = 0; Proc4 (CharL2); strcpy (StringL, "DHAMPSTONE PROGRAM 1ST STRING"); EnumL1 = Id2; BoolG = ! Func2 (CharL1, &CharL2, CharG1, StringG1, StringL); while (IntL1 < IntL2) { /* body executed 3 times */ if (IntL1 == 2) { IntL3 = 5 * IntL1 - IntL2; /* executed 1 time */ Proc7a (); } IntL1 = IntL1 + 1; } Proc8a (IntL1, ArrG1, ArrG2, IntL3, IntL2, &IntL4); Proc1 (PtrG1); for (CharIndex = 'A'; CharIndex <= CharG2; CharIndex++) if (EnumL1 == Func1a (Id1, CharIndex, 'Z')) { /* body executed twice */ EnumL1 = Id3; /* not executed */ Proc6 (&EnumL1, Id1, 4, 5); } IntL2 = IntL4 * IntL3; IntL3 = IntL3 / IntL4; IntL2 = 7 * (IntL4 - IntL3) - IntL2; Proc2 (&IntL4, CharL1); /* CharL1 & CharG3 are both 'C' */ } Proc1 (PtrParI) RecordPtr PtrParI; /* argument is PtrG1 (called from Proc0 and Proc0a) */ { # define NextRecord (*(PtrParI->PtrComp)) register RecordPtr PtrL; register int IntL; NextRecord = *PtrG1; PtrParI->IntComp = 5; NextRecord.IntComp = PtrParI->IntComp; PtrL = PtrParI->PtrComp; NextRecord.PtrComp = PtrL; Proc3 (&NextRecord.PtrComp); if (NextRecord.Discr == Id1) { /* executed */ Proc6 (&NextRecord.EnumComp, PtrParI->EnumComp, PtrParI->IntComp, 10); NextRecord.IntComp = 6; Proc7 (); NextRecord.PtrComp = PtrG1->PtrComp; } else { *PtrParI = NextRecord; /* not executed */ if (PtrL == NULL) IntL = 10; else { IntL = 1; PtrParI->IntComp = IntL; } } # undef NextRecord } Proc2 (IntParIO, CharParI) register OneToFifty *IntParIO; char CharParI; /* arguments are 21 and 'C' (called from Proc0). Returns 50 as IntParIO arguments are 7 and 'C' (called from Proc0a). Returns 22 as IntParIO */ { register OneToFifty IntL1, IntL2, IntL3; Enumeration EnumL1, EnumL2; IntL2 = *IntParIO + 10; do /* body executed once */ if (CharG3 == CharParI) { IntL3 = IntL2 - 2; /* executed */ *IntParIO = *IntParIO + IntL3 - IntG3; EnumL2 = Id1; } else { IntL1 = IntL2 + 4; /* not executed */ *IntParIO = IntL1 - IntG3; EnumL1 = Id1; EnumL2 = EnumL1; } while (EnumL2 != Id1); } Proc3 (PtrParO) RecordPtr *PtrParO; /* (called from Proc1). Returns address of PtrG2 as PtrParO */ { if (PtrG1 != NULL) *PtrParO = PtrG1->PtrComp; /* executed */ else IntG1 = 100; /* not executed */ Proc7a (); } Proc4 (CharParI) char CharParI; /* argument is 'C' (called from Proc8--from Proc0) argument is 'A' (called from Proc0a) */ { register boolean BoolL; if (Proc0_Flag) Proc5 (); BoolL = CharG1 == CharParI; BoolL = BoolL || BoolG; CharG2 = 'B'; } Proc5 () { /* executed 3 times per Proc0--Proc0a pair */ CharG1 = 'A'; BoolG = FALSE; } Proc6 (EnumParO, EnumParI, IntParI1, IntParI2) Enumeration *EnumParO; Enumeration EnumParI; OneToFifty IntParI1, IntParI2; /* arguments are address, Id3, 5, 10 (called from Proc1). Returns Id2 for EnumParO */ { if (IntParI1 > 10) IntG1 = 4; /* not executed */ else if (! Func3(EnumParI)) *EnumParO = Id4; /* not executed */ switch (EnumParI) { case Id1 : *EnumParO = Id1; IntG2 = IntParI2 + 2; IntG3 = IntParI1 + IntParI2; break; case Id2 : if (IntG3 > IntParI1) *EnumParO = Id1; else *EnumParO = Id4; break; case Id3 : *EnumParO = Id2; /* executed */ break; case Id4 : break; case Id5 : *EnumParO = Id3; break; } } Proc7 () { /* executed 3 times per Proc0--Proc0a pair */ register OneToFifty IntL; IntL = IntG1 + 2; IntG2 = IntG1 + IntL; } Proc7a () { /* executed 3 times per Proc0--Proc0a pair */ if (! Proc0_Flag) Proc5 (); if (IntG4 > 4) IntG2 = IntG3 + 2; /* executed */ else IntG2 = IntG3 + 1; /* not executed */ } Proc8 (ArrParIO1, ArrParIO2, IntParI1, IntParI2) Array1DimInteger ArrParIO1; Array2DimInteger ArrParIO2; register int IntParI1; int IntParI2; /* arguments are 2 arrays, 3, 2 (called from Proc0) */ { /* executed 1 time per Proc0--Proc0a pair */ OneToFifty IntL1; register OneToFifty IntL2; OneToFifty IntL3; register OneToFifty IntIndex; OneToFifty IntL4, IntL5; Proc4 ('C'); IntL1 = IntParI1 + 5; IntL2 = IntL1 + 1; if (IntParI2 == 2) { IntL5 = IntL1; /* executed */ IntL4 = IntL1 + 1 - IntParI1; } else { IntL3 = IntL2; /* not executed */ IntL5 = IntL3 + 3; IntL4 = IntL3 + 2 - IntParI1; } ArrParIO1[IntL5] = IntParI2; ArrParIO1[IntL1 + 1] = ArrParIO1[IntL1]; ArrParIO1[IntL1 + 30] = IntL4; for (IntIndex = IntL1; IntIndex <= IntL2; IntIndex++) ArrParIO2[IntL2][IntIndex] = IntParI1; /* body executed twice */ ArrParIO2[IntL4][IntL2 - 1] = ArrParIO2[IntL4][IntL4 - 1] + 1; ArrParIO2[IntL5 + 20][IntL5] = ArrParIO1[IntL1]; if (IntParI2 != 2) IntL3 = IntL3 + 4; /* not executed */ IntG1 = 5; } Proc8a (IntParI1, ArrParIO1, ArrParIO2, IntParI2, IntParI3, IntParO) OneToFifty IntParI1; register OneToFifty IntParI2; OneToFifty IntParI3, *IntParO; Array1DimInteger ArrParIO1; Array2DimInteger ArrParIO2; /* arguments are 5, two arrays, 5, 5, address (called from Proc0a). Returns 7 for IntParO */ { /* executed 1 time per Proc0--Proc0a pair */ register OneToFifty IntL1; OneToFifty IntL2; register OneToFifty IntIndex; OneToFifty IntL3, IntL4; IntL2 = IntParI1 + 5; IntL1 = IntL2 + 1; if (IntParI3 <= 23) IntL3 = IntL2; /* executed */ else { IntL4 = IntL2; /* not executed */ IntL3 = IntL4 + 3; } ArrParIO1[IntL3] = IntParI2; ArrParIO1[IntL2 + 1] = ArrParIO1[IntL2]; ArrParIO1[IntL3 + 30] = IntParI3; for (IntIndex = IntL2; IntIndex <= IntL1; IntIndex++) ArrParIO2[IntL1][IntIndex] = IntParI2; /* body executed twice */ ArrParIO2[IntL1][IntL2 - 1] = ArrParIO2[IntL3][IntL2 - 1] + 1; if (IntParI3 > 17) IntL4 = IntL4 + 4; /* not executed */ else ArrParIO2[IntL3 + 20][IntL3] = ArrParIO1[IntL2];/* executed */ *IntParO = IntParI2 + IntParI3 + 2 - IntParI1; } Enumeration Func1 (CharParI1, CharParI2) CapitalLetter CharParI1, CharParI2; /* arguments are 'M', 'M' (called from Func2). Returns Id1 */ { /* executed 2 times per Proc0--Proc0a pair */ CapitalLetter CharL1, CharL2; Enumeration EnumL; CharL1 = CharParI1; CharL2 = CharL1; if (CharL2 != CharParI2) { EnumL = Id2; /* not executed */ return (EnumL); } else return (Id1); /* executed */ } Enumeration Func1a (EnumParI, CharParI1, CharParI2) Enumeration EnumParI; CapitalLetter CharParI1, CharParI2; /* arguments are Id1, 'A' or 'B', 'Z' (called from Proc0 and Proc0a) Returns Id1 */ { /* executed 4 times per Proc0--Proc0a pair */ CapitalLetter CharL1, CharL2; CharL1 = CharParI1; CharL2 = CharL1; if (CharL2 != CharParI2) return (EnumParI); /* executed */ else return (Id3); /* not executed */ } boolean Func2 (CharParI1, CharParIO, CharParI2, StringParI1, StringParI2) char CharParI1; register char *CharParIO; char CharParI2; String30 StringParI1, StringParI2; /* arguments are 'E', 'C', 'R', 2 strings (called from Proc0) arguments are 'C', 'C', 'A', 2 strings (called from Proc0a) Returns FALSE, CharParIO is not changed */ { register OneToThirty IntL1, IntL2; CapitalLetter CharL1, CharL2; IntL1 = 2; CharG1 = StringParI1[IntL1]; while (IntL1 <= 2) /* body executed once */ if (Func1 (StringParI2[IntL1 + 1], 'M') == Id1) { CharL1 = 'A'; /* executed */ IntL1 = IntL1 + 1; } if (CharL1 >= 'W' && CharL1 < 'Z') { CharL2 = CharParI1; /* not executed */ IntL2 = 3; } if (CharL1 == 'X') { CharL2 = *CharParIO; /* not executed */ *CharParIO = 'J'; return TRUE; } else if (strcmp (StringParI2, StringParI1) > 0) { IntL2 = IntG3; /* not executed */ IntL2 = IntL1 + IntL2; CharL2 = CharParI2; CharL1 = CharL2; return TRUE; } else return FALSE; /* executed */ } boolean Func3 (EnumParI) Enumeration EnumParI; /* argument is Id3 (called from Proc6). Returns TRUE */ { Enumeration EnumL; EnumL = EnumParI; if (EnumL == Id3) return TRUE; /* executed */ return FALSE; /* not executed */ } string_io(string, length) register char *string; int length; /* arguments are string, SLEN (called from Proc0) */ { /* executed 1 time per Proc0--Proc0a pair */ register int index; --length; for (index = 0; index < length; index++) { string[index] = getc(File_In); /* body executed 21 times */ putc(string[index],File_Out); } string[length] = '\0'; }
pmontgom@oak.math.ucla.edu (Peter Montgomery) (01/07/90)
In article <4057@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: > >So what's your favorite application that does lots of >integer multiplies and divides? Code it up and pass it out >for people to run. > During the recent controversy about the SPARC and integer multiply/divide, multiple individuals have requested benchmarks which compare machine performances on applications needing these operations. In another posting is my response to that challenge, written primarily in FORTRAN 77. It times eleven algorithms (some integer, some floating point, but is biased towards ones needing integer multiplication and division). A machine's rating is defined to be the geometric mean of the times on these algorithms. So far I have run it on the SUN 3/50, SUN 3/280, SPARC, and Alliant. Here are the outputs (each row corresponds to a machine and compiler; each column corresponds to an algorithm): --------------------------------------------------------------------------- Gener = Generate matrices (integer arithmetic, trigonometry) Dble = Double precision floating point matrix multiplication Sngl = Single precision floating point matrix multiplication Cmplx = Complex (single precision) matrix multiplication Intgr = Integer matrix multiplication Modlr = Matrix multiplication modulo a prime number TrlDiv = Primality test, using trial division through square root PrbPrm = Probable prime test, using Fermat's Little Theorem GCD = Euclidean GCD algorithm BinDec = Binary-to-Decimal conversion Cfrac = Continued fraction expansion of quadratic irrational Geo mean = Geometric mean of above times SUN 3/50 with 68881, optimized (f77 -O -f68881) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.86 3.76 3.54 43.70 1.56 1.86 12.56 1.48 3.30 4.82 2.60 3.55 sec SUN 3/280 with 68881, optimized (f77 -O -f68881) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.56 2.84 2.56 30.16 0.94 1.04 7.44 1.12 1.96 2.38 1.54 2.24 sec SUN 3/280 with inline 68881, optimized (f77 -O -f68881 /usr/lib/f68881.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.48 2.82 2.54 18.54 0.86 1.10 7.58 1.16 1.94 2.38 1.54 2.11 sec SUN 3/280 with inline FPA, optimized (f77 -O -ffpa /usr/lib/ffpa.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.34 0.66 0.62 4.66 0.90 1.06 7.38 0.38 1.92 2.32 1.54 1.25 sec SPARC station optimized (f77 -O) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.30 0.38 0.25 3.25 1.06 0.86 17.79 0.44 1.24 1.56 1.07 1.03 sec SPARC station optimized, inline (f77 -O /usr/lib/libm.il) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.37 0.35 0.27 2.00 1.05 0.87 17.76 0.44 1.24 1.56 1.07 1.00 sec Alliant FX/80, one processor, optimized, unvectorized (fortran -Og) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.09 0.91 0.85 1.47 0.86 1.21 8.14 0.37 1.90 1.14 1.72 1.02 sec Alliant FX/80, one processor, optimized, vectorized (fortran -Ovg -AS) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.11 0.18 0.16 0.58 0.20 0.37 8.04 0.37 1.84 1.07 1.44 0.54 sec Alliant FX/80, six processors, optimized, vectorized (fortran -O -AS) Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean 0.02 0.04 0.03 0.11 0.04 0.07 8.02 0.37 1.84 1.07 1.38 0.22 sec ------------------------------------------------------------------------------- Comments about results: The SUN 3s have OS version 4.2. The SPARCs have SUN OS 4.0. The Alliant has Concentrix version 5.0 and FX/Fortran version V4.2.40. The Alliant has eight advanced computational elements (ACEs), configured into one cluster of six and two detached processors. The timings were made during the daytime, and may be affected by the load averages. In particular, our Alliant is rarely idle. The entries are approximately in decreasing order by geometric mean time. Geometric mean is fairer than arithmetic mean, because the relative order of the geometric means will remain unchanged if one application is run twice as far (and takes twice as long on all machines), whereas the relative order of the arithmetic means may change. When we compare the SUN 3/50 and SUN 3/280 with 68881 (first two entries), the improvements for the eleven timings are rather uniform: each takes 50%-75% as long on the 3/280 as on the 3/50. The use of an inline expansion template (/usr/lib/f68881.il) speeds the complex matrix multiplication by 30% and helps the trigonometric calculations (Gener), while making little change elsewhere. Using the FPA on the 280 speeds the non-trigonometric floating point computations by a factor of 4, while leaving integer times almost unchanged (N.B. PrbPrm uses floating point arithmetic to estimate an integer quotient). Next compare the SPARC station to the SUN 3/280 with FPA. The floating point times are further improved, as are the Euclidean GCD, binary-to-decimal conversion, and continued fraction times. The integer matrix multiplication is 20% worse but the modular matrix multiplication is 20% faster than on the 3/280. The primality test which uses repeated division (TrlDiv) is much worse on the SPARC, taking 30% longer than it did on the SUN 3/50. On the SPARC, modular matrix multiplication is faster than integer matrix multiplication even though the former inner loop is more complicated (it has an IF); this is presumably because the entries in the integer matrix are pseudorandom values modulo 2**32 whereas those in the modular matrices require only 15 bits. Both the TrlDiv and GCD codes have two remainder operations in the inner loop; the TrlDiv/GCD ratio is between 3.5 and 4.5 on SUN 3 and Alliant, but is 14 on a SPARC. A possible explanation is the high frequency of quotients of 1 and 2 (59%) in the Euclidean algorithm [Cfrac also has a high frequency of low quotients]. As on the SUN 3, the use of an inline expansion template affects primarily the SPARC time for complex multiplication (the trigonometric times get worse rather than better). In another comp.arch posting, Bo Thide' reports a 6-fold speedup in integer multiply time using /usr/lib/libm.il, but .mul, .div, and .rem are missing from my /usr/lib/libm.il (which is dated 88/02/08). The unvectorized Alliant (with one processor) is faster than the SPARC on six of the eleven tests, and is slower on five tests; the overall rating is too close to call. The unvectorized Alliant beats the SUN 3/280 on three tests, loses on five, with three (Intgr, PrbPrm, GCD) too close to call; the Alliant beats the SUN 3/280 overall. The Alliant has an -autoinline option to insert user procedures inline. Its use had minimal effect on these tests. When -autoinline was used with multiple processors on a draft version of this benchmark, the program aborted with arithmetic exception. It is interesting to compare the ratios Sngl/Intgr for real vs. integer matrix multiplication on various machines. All matrix entries have the same size, and the indexing is identical, so the memory subsystem should be equally fast with both, and the compiler should be equally good at optimizing the loops (e.g., assigning variables to registers). On the SUN 3s with 68881, the Sngl/Intgr ratio is between 2 and 3, with integer being faster. On the 3/280 with FPA, the ratio is 0.7, with real being faster. On the SPARC, the ratio is about 0.25, with real being faster. On the Alliant, the ratio is about 0.9, with real being slightly faster. Another interesting ratio Cmplx/Sngl. A complex multiplication uses 4 real multiplication and 2 real additions; a complex addition uses 2 real additions. The inner loop of the complex matrix multiplication will therefore have 4 real multiplications and 4 real additions, whereas the single precision matrix multiplication has one of each. The ratio Cmplx/Sngl should be less than 4, since loop overhead, indexing, and memory references are not four times as complicated; also there are many opportunities for parallelism and pipelining. The SUN 3 and SPARC ratios are between 7 and 8 with inline expansion templates, and between 11 and 13 without. The Alliant ratios (1.6 to 1.9 unvectorized, 3 to 4 vectorized) seem more reasonable. All of these machines have hardware double precision arithmetic, so the Dble/Sngl ratios are all between 1.0 and 1.5. In many cases, the penalty for double precision is very slight. The ratio of worst-to-best single-processor time is 70 for complex matrix multiplication and about 20 for single and double precision matrix multiplication. On the other hand, this ratio is between 2 and 3 for TrlDiv, Cfrac and GCD. Some comments on the object code of these compilers: (1) The code printed by the Alliant compiler is much easier to read than the others. For example, it retains the user variable names. It also prints the hexadecimal equivalent of the object code, which is unimportant here but useful on other occasions. (2) All recognizes the common subexpression in MOD(n,6).ne.1 .and. MOD(n,6).ne.5 within TrlDiv, computing the remainder only once. (3) In GCD, we need both MIN(ABS(n1), ABS(n2)) and MAX(ABS(n1), ABS(n2)). None of these compilers detects that MIN and MAX have identical arguments (although all compiled ABS, MIN, and MAX inline). Indeed, the Alliant computes ABS(n1) and ABS(n2) twice. (4) All generate inline code for statement function MODMUL within PrbPrm (function MODEXP). (5) The lines string(place:place) = DIGITS(MOD(left,10)) and string(place:place) = DIGITS(left) in BINDEC transfer a single character. The Alliant compiler recognizes these and generates movb instructions. The SUN 3 and SPARC compilers generate a subroutine call. The Alliant also compiles the LEN (character string length) intrinsic function inline; SUN 3 and SPARC do not. (6) Subroutine BINDEC needs both the quotient left/10 and the remainder MOD(left,10). The SUN 3 compiler generates two divide instructions (one for quotient, one for remainder), even though the hardware makes both available at once. The SPARC calls both .div and .rem . The Alliant (which is also MC68020-based) generates only one divide instruction, but computes the remainder from the quotient via a multiply and subtract (here and elsewhere). (7) The SUN 3 and SPARC compilers generate special code for MOD(n,2) and n/2 within MODEXP. The Alliant uses the same instructions as for other divisors, except to optimize the multiply by 2 into an add when computing the remainder. None recognize that the MOD(n,2).ne.0 test (i.e., the ODD function of PASCAL) requires testing only the bottom bit of n, whether n is positive or negative (on a twos' complement machine). -------- Peter Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024
pmontgom@oak.math.ucla.edu (01/07/90)
program bchmrk C C This benchmark times eleven codes, many of C which require integer multiplication and/or division. C Integer addition, subtraction, and multiplication are C assumed not to abort on overflow, but rather to return C the correct result modulo 2**32 or other high power of 2. C The benchmark includes the following: C (1) Multiplication of square matrices. Matrix entries C can be real, double precision, complex, integer, C or integers modulo a prime number. C (2) Check whether an integer is prime, using trial division C by integers up to the square root of the argument. C (3) Modular exponentiation. This routine requires a function C MODMUL(m1,m2) which computes the product m1*m2 modulo C local variable mdls, but where m1*m2 may overflow. C (4) Euclidean GCD (greatest common divisor) algorithm. C (5) Binary-to-decimal conversion. C (6) Continued fraction expansion of quadratic irrational. C An installation is allowed to change the following C parts of the source code when compiling the benchmark: C (1) Subroutine CPUTIM (return elapsed user CPU time). C (2) Statement function MODMUL within MODEXP C (may be optimized to exploit machine properties, C or be replaced by an assembly language subroutine). C (3) Convert entire source to upper case. C If this source code is modified in any other way, C please identify the modifications when reporting the time C on your system. An optimizing compiler is allowed. C If multiple CPUs are used, also report the times with a single CPU. C The program reads no input data. The memory required C is approximately 300000 bytes (decimal), if each real or C integer variable occupies 4 bytes. C Benchmark written by C Peter L. Montgomery C Department of Mathematics C University of California C Los Angeles, CA 90024 USA C C pmontgom@math.ucla.edu C January, 1990 C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - C Constants integer NCONV, PMODLS, PRMHGH, PRMLOW, SZ parameter (NCONV = 2000) parameter (PMODLS = 30011) parameter (PRMLOW = 2*10**9, PRMHGH = PRMLOW + 2000) parameter (SZ = 40) C Local arrays logical prm(0 : PRMHGH - PRMLOW) complex c1(SZ,SZ), c2(SZ,SZ), c3(SZ,SZ) complex ctmp(SZ,SZ), cp1(SZ,SZ), cp2(SZ,SZ) double precision d1(SZ,SZ), d2(SZ,SZ), d3(SZ,SZ) double precision dtmp(SZ,SZ), dp1(SZ,SZ), dp2(SZ,SZ) real s1(SZ,SZ), s2(SZ,SZ), s3(SZ,SZ) real stmp(SZ,SZ), sp1(SZ,SZ), sp2(SZ,SZ) integer convs(NCONV) integer i1(SZ,SZ), i2(SZ,SZ), i3(SZ,SZ) integer itmp(SZ,SZ), ip1(SZ,SZ), ip2(SZ,SZ) integer p1(SZ,SZ), p2(SZ,SZ), p3(SZ,SZ) integer ptmp(SZ,SZ), pp1(SZ,SZ), pp2(SZ,SZ) equivalence(prm, ctmp, dtmp, stmp, convs, itmp, ptmp) C Other local variables integer i, ierr, j, nprime, p, perr, rem, sumcnv, sumgcd real cerr, serr real t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13 real tc, tcfrac, td, tdcml, tgcd, tgen, ti real tovral, tp, tprbpr, tprime, ts double precision derr character*10 string C Functions referenced intrinsic ABS, CMPLX, COS, DBLE, MAX, REAL, SIN logical PRIME integer GCD, MODEXP external GCD, MODEXP, PRIME C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - C Select test matrices (not intended to be meaningful). C i1, i2, and i3 are modulo 2**32 or other power of 2. C If this causes trouble on your system, use smaller C values in the assignments and report the change. call CPUTIM(t1) do 100 j = 1, SZ do 90 i = 1, SZ d1(i,j) = COS(DBLE(i+j)) d2(i,j) = SIN(DBLE(i*j)) d3(i,j) = COS(DBLE(i-j)) s1(i,j) = SIN(REAL(i))*SIN(REAL(j)) s2(i,j) = COS(REAL(3*i - j)) s3(i,j) = SIN(REAL(3*i - j)) i1(i,j) = 23321*i*j + 4135151*i + 123321*j + 231235325 i2(i,j) = 231421*i*j + 632153*i + 3215312*j + 887231323 i3(i,j) = 98136*i*j + 4532355*i - 2231413*j + 343232231 p1(i,j) = MOD(23*i**2 + 21434*i*j + 3556*j + 56, PMODLS) p2(i,j) = MOD(8453*i*j + 24353*i + 765*j + 3145, PMODLS) p3(i,j) = MOD(124*(i-j)**2 + 4354*i + 56*j + 67, PMODLS) c1(i,j) = CMPLX(s1(i,j), REAL(d1(i,j))) c2(i,j) = CMPLX(s2(i,j), REAL(d2(i,j))) c3(i,j) = CMPLX(s3(i,j), REAL(d3(i,j))) 90 continue 100 continue C Perform double precision matrix multiplication. C Compute both (d1*d2)*d3 and d1*(d2*d3). call CPUTIM(t2) call DMTMUL(d1, d2, dtmp, SZ) call DMTMUL(dtmp, d3, dp1, SZ) call DMTMUL(d2, d3, dtmp, SZ) call DMTMUL(d1, dtmp, dp2, SZ) C Perform single-precision (real) matrix multiplication. call CPUTIM(t3) call SMTMUL(s1, s2, stmp, SZ) call SMTMUL(stmp, s3, sp1, SZ) call SMTMUL(s2, s3, stmp, SZ) call SMTMUL(s1, stmp, sp2, SZ) C Perform single-precision (complex) matrix multiplication. call CPUTIM(t4) call CMTMUL(c1, c2, ctmp, SZ) call CMTMUL(ctmp, c3, cp1, SZ) call CMTMUL(c2, c3, ctmp, SZ) call CMTMUL(c1, ctmp, cp2, SZ) C Perform integer matrix multiplication. call CPUTIM(t5) call IMTMUL(i1, i2, itmp, SZ) call IMTMUL(itmp, i3, ip1, SZ) call IMTMUL(i2, i3, itmp, SZ) call IMTMUL(i1, itmp, ip2, SZ) C Perform matrix multiplication modulo PMODLS. call CPUTIM(t6) call PMTMUL(p1, p2, ptmp, SZ, PMODLS) call PMTMUL(ptmp, p3, pp1, SZ, PMODLS) call PMTMUL(p2, p3, ptmp, SZ, PMODLS) call PMTMUL(p1, ptmp, pp2, SZ, PMODLS) C Check that products agree C (matrix multiplication should be associative). call CPUTIM(t7) cerr = 0.0 derr = 0.0D0 serr = 0.0 ierr = 0 perr = 0 do 200 j = 1, SZ do 190 i = 1, SZ cerr = MAX(cerr, ABS(cp1(i,j) - cp2(i,j))) derr = MAX(derr, ABS(dp1(i,j) - dp2(i,j))) serr = MAX(serr, ABS(sp1(i,j) - sp2(i,j))) ierr = MAX(ierr, ABS(ip1(i,j) - ip2(i,j))) perr = MAX(perr, ABS(pp1(i,j) - pp2(i,j))) 190 continue 200 continue if (ierr.ne.0 .or. perr.ne.0 * .or. derr .ge. 1.0E-12 * .or. serr .ge. 1.0E-5 * .or. cerr .ge. 2.0E-4 ) then print 210, ierr, perr, derr, serr, cerr 210 format(' ERRORS IN MATRIX MULTIPLY: ', 2I10, 3E10.3) end if call CPUTIM(t8) C Search for primes in the interval [PRMLOW, PRMHGH]. nprime = 0 do 300 j = 0, PRMHGH-PRMLOW p = j + PRMLOW prm(j) = PRIME(p) if (prm(j)) nprime = nprime + 1 300 continue C Use a probable prime test to verify these findings. C Recall, if p is an odd prime, then C 2**((p-1)/2) == +-1 mod p by Fermat's Theorem. C Most integers p passing this test are prime. C Counterexamples exist (e.g., 341, 561), but are rare. C call CPUTIM(t9) do 400 j = MOD(PRMLOW+1,2), PRMHGH-PRMLOW, 2 p = j + PRMLOW rem = MODEXP(2, (p-1)/2, p) if (prm(j) .neqv. (rem.eq.1 .or. rem.eq.p-1)) then print *, 'Probable prime test fails for ', p, rem end if 400 continue call CPUTIM(t10) sumgcd = 0 do 500 i = 10000000, 10000100 do 490 j = 11111111, 140000000, 500001 sumgcd = sumgcd + GCD(i,j) 490 continue 500 continue if (nprime.ne.100 .or. sumgcd.ne.407895) then print 600, nprime, sumgcd 600 format(1x, 'ERROR -- ', i7, ' primes found.', * ' Sum of GCDs is ', i10) end if call CPUTIM(t11) C Do binary-to-decimal conversion. do 700 i = 10000, 10000000, 537 call BINDEC(i, string) 700 continue if (string.ne.' 9999811') then print *, 'BINDEC Error - Final string is ', string end if call CPUTIM(t12) C Try continued fraction of quadratic irrational. sumcnv = 0 do 800 j = 1000003, 1001003, 10 call CFRAC(j, convs, NCONV) do 790 i = 1, NCONV sumcnv = sumcnv + convs(i) 790 continue 800 continue if (sumcnv.ne.7563088) then print *, ' CFRAC ERROR - Sum of convergents is ', sumcnv end if call CPUTIM(t13) C Compute times used. tgen = t2 - t1 td = t3 - t2 ts = t4 - t3 tc = t5 - t4 ti = t6 - t5 tp = t7 - t6 tprime = t9 - t8 tprbpr = t10 - t9 tgcd = t11 - t10 tdcml = t12 - t11 tcfrac = t13 - t12 C Overall rating = Geometric mean of above times. C (Clock resolution should be such that all times are nonzero). tovral = (tgen*td*ts*tc*ti*tp*tprime*tprbpr*tgcd*tdcml*tcfrac) * **(1.0/11.0) C Print timings. print 1000, tgen, td, ts, tc, ti, tp, tprime, tprbpr, tgcd, * tdcml, tcfrac, tovral 1000 format(1x, 5HGener, 2x,4HDble, 2x,4HSngl, 1x,5HCmplx, * 1x,5HIntgr, 1x,5HModlr, 1x,6HTrlDiv, * 1x,6HPrbPrm, 3x, 3HGCD, 1x, 6HBinDec, * 1x,5HCfrac, 2x,'Geo mean'/ * 1x, F5.2, 5F6.2, 2F7.2, F6.2, F7.2, 2F6.2, ' sec') end ************************************************************************ subroutine BINDEC(num, string) C Binary-to-decimal conversion of nonnegative integer. character*(*) string integer num C Local variables integer left, place character DIGITS(0:9) intrinsic LEN, MOD data DIGITS/'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'/ C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - left = num place = LEN(string) 10 continue if (left.ge.10) then string(place:place) = DIGITS(MOD(left, 10)) left = left/10 place = place - 1 if (place.ne.0) go to 10 return end if string(place:place) = DIGITS(left) if (place.gt.1) string(1:place-1) = ' ' end ************************************************************************ subroutine CFRAC(D, convs, nconv) C Return the first nconv convergents in the continued fraction C expansion of SQRT(D), where D > 0 and D is not a perfect square. integer D, nconv, convs(1:nconv) C Local variables integer droot, dadd, daddnw, denom, dinv, dinvnw, iconv, q intrinsic INT, REAL, SQRT C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - droot = INT(SQRT(REAL(D) + 0.5)) dadd = 0 denom = 1 dinv = D do 100 iconv = 1, nconv C At this time, it remains to expand X = (SQRT(D) + dadd)/denom. C We know D = dadd**2 + denom*dinv, where all variables are nonnegative. C Estimate q = INT(X) using droot = INT(SQRT(D)). C Replace X by 1/(X - q) = denom/(SQRT(D) + dadd - denom*q) C = (SQRT(D) + denom*q - dadd)/((D - (dadd - denom*q)**2)/denom). C Since D = dadd**2 + denom*dinv, the denominator is an integer. C Make special check for case q = 1. if (droot + dadd .ge. 2*denom) then q = (droot + dadd)/denom convs(iconv) = q daddnw = denom*q - dadd dinvnw = denom denom = dinv + q*(dadd - daddnw) dadd = daddnw dinv = dinvnw else convs(iconv) = 1 daddnw = denom - dadd dinvnw = denom denom = dinv + dadd - daddnw dadd = daddnw dinv = dinvnw end if 100 continue end ************************************************************************ subroutine CMTMUL(m1, m2, m3, SZ) C Complex single precision matrix multiply (square matrices). integer SZ complex m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ) C Local variables complex sum integer i, j, k C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - do 100 k = 1, SZ do 90 i = 1, SZ sum = (0.0, 0.0) do 80 j = 1, SZ sum = sum + m1(i,j)*m2(j,k) 80 continue m3(i,k) = sum 90 continue 100 continue end ************************************************************************ subroutine CPUTIM(tused) C Return elapsed user CPU time (seconds) since program began execution. C This routine will likely need replacement on other machines. C In particular, if the times are too big or too small for the C FORMAT statement, then they can be scaled by altering this routine. real tused real tarray(2) real etime external etime C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - tused = etime(tarray) tused = tarray(1) end ************************************************************************ subroutine DMTMUL(m1, m2, m3, SZ) C Double precision matrix multiply (square matrices). integer SZ double precision m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ) double precision sum integer i, j, k C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - do 100 k = 1, SZ do 90 i = 1, SZ sum = 0.0D0 do 80 j = 1, SZ sum = sum + m1(i,j)*m2(j,k) 80 continue m3(i,k) = sum 90 continue 100 continue end ************************************************************************ integer function GCD(n1, n2) C Euclidean GCD algorithm (greatest common divisor of n1 and n2). integer n1, n2 C Local variables integer m1, m2 intrinsic ABS, MAX, MIN, MOD C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - m1 = MIN(ABS(n1), ABS(n2)) m2 = MAX(ABS(n1), ABS(n2)) 10 continue if (m1.gt.3) then C Ensure m2 is not zero, to avoid divide by zero. m2 = 1 + MOD(m2-1,m1) m1 = MOD(m1,m2) go to 10 end if C Special codes for m1 = 0, 1, 2, 3. go to (100, 101, 102, 103), m1+1 stop 'GCD - m1 out of range' 100 continue GCD = m2 return 101 continue GCD = 1 return 102 continue GCD = 2 - MOD(m2,2) return 103 continue if (MOD(m2,3).eq.0) then GCD = 3 else GCD = 1 end if return end ************************************************************************ subroutine IMTMUL(m1, m2, m3, SZ) C Integer matrix multiply (square matrices). integer SZ, m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ) integer i, j, k, sum C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - do 100 k = 1, SZ do 90 i = 1, SZ sum = 0 do 80 j = 1, SZ sum = sum + m1(i,j)*m2(j,k) 80 continue m3(i,k) = sum 90 continue 100 continue end ************************************************************************ integer function MODEXP(base, exp, mdls) C Compute base**exp mod mdls, where exp >= 0 and 0 <= base < mdls. C Use binary algorithm, starting from the right. integer base, exp, mdls C Local variables integer baslft, explft double precision mdlsnv C Functions referenced intrinsic MOD, INT, DBLE integer m1, m2, MODMUL C Statement function MODMUL computes m1*m2 mod mdls. As written, C it allows the operands to be in the open interval (-n, n), C with the result guaranteed to be there. The estimated quotient, C INT(DBLE(m1)*DBLE(m2)*mdlsnv), should be either the true C integer quotient m1*m2/mdls, or 1 (in absolute value) too large, C because the double precision value DBLE(m1)*DBLE(m2)*mdlsnv will C be an upper bound for the rational number m1*m2/mdls C (since mdlsnv is biased away from 0), but will be less than the C rational numberm1*m2/mdls + 1 (if floating point is accurate enough). C C Users of this benchmark are allowed to replace this definition C by another one which performs better in their environment, C or to change the initialization of mdlsnv (below). C Be aware that mdls is barely below 2**31, so m1*m2 will need 63 bits. C If single precision floating point arithmetic is accurate to C 35 bits or so, then it may be used when estimating the quotient. C If the compiler and machine support the 64-bit product of two C 32-bit operands, and the 32-bit quotient and remainder when dividing C 64 bits by 32 bits, then that feature may be used. MODMUL(m1,m2) = m1*m2 - mdls*INT(DBLE(m1)*DBLE(m2)*mdlsnv) C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - C Get estimate for 1.0/mdls, biased away from 0. mdlsnv = 1.0D0/DBLE(mdls) mdlsnv = mdlsnv*(1.0D0 + 0.1D0*mdlsnv) if (exp.lt.0) then stop 'MODEXP - exponent negative' end if MODEXP = 1 explft = exp baslft = base if (explft.eq.0) return 10 continue if (MOD(explft, 2).ne.0) then MODEXP = MODMUL(MODEXP, baslft) explft = explft - 1 if (explft.eq.0) then if (MODEXP.lt.0) MODEXP = MODEXP + mdls return end if end if explft = explft/2 baslft = MODMUL(baslft, baslft) go to 10 end ************************************************************************ subroutine PMTMUL(m1, m2, m3, SZ, mdls) C Matrix multiply (square matrices, entries modulo a prime number). integer SZ, m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ), mdls integer i, j, k, mdsq, sum intrinsic MOD C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - mdsq = mdls**2 do 100 k = 1, SZ do 90 i = 1, SZ sum = 0 do 80 j = 1, SZ sum = sum + m1(i,j)*m2(j,k) if (sum .ge. mdsq) sum = sum - mdsq 80 continue m3(i,k) = MOD(sum, mdls) 90 continue 100 continue end ************************************************************************ logical function PRIME(n) C Check whether an integer n is prime. C Test for divisibility by 2, 3, and by 6k +- 1, for k = 1, 2, ... integer n C Local variables integer j intrinsic MOD, NINT, REAL, SQRT C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - if (n.lt.5) then PRIME = n.eq.2 .or. n.eq.3 else if (MOD(n,6).ne.1 .and. MOD(n,6).ne.5) then PRIME = .FALSE. else do 100 j = 5, NINT(SQRT(REAL(n))), 6 if (MOD(n,j).eq.0 .or. MOD(n,j+2).eq.0) then PRIME = .FALSE. return end if 100 continue PRIME = .TRUE. end if end ************************************************************************ subroutine SMTMUL(m1, m2, m3, SZ) C Single precision matrix multiply (square matrices). integer SZ real m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ) real sum integer i, j, k C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - do 100 k = 1, SZ do 90 i = 1, SZ sum = 0.0 do 80 j = 1, SZ sum = sum + m1(i,j)*m2(j,k) 80 continue m3(i,k) = sum 90 continue 100 continue end -------- Peter Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024
mark@mips.COM (Mark G. Johnson) (01/09/90)
Email failed me; therefore ...
Here are the results of running Peter Montgomery's "program bchmark" for
Integer Multiply and Divide, on a MIPS RC3240. This is a deskside machine
in a PC-sized cabinet (23" x 7" x 18"); essentially it's an upgrade of the
earlier M/120 machine.
I didn't change the source code; just took it as-is and applied the naive
command line "f77 -O montgomery.f". Running the executable file a.out gave:
MIPS RC3240 compiled with f77 -O
Gener Dble Sngl Cmplx Intgr Modlr TrlDiv PrbPrm GCD BinDec Cfrac Geo mean
0.08 0.26 0.23 0.34 0.39 0.46 2.84 0.10 0.63 0.60 0.46 0.36 sec
{This falls between the single-processor Alliant with vectorization and
optimization, and the 6-processor Alliant with V&O, in Geo mean performance.}
For those who desire to bicker about the benchmark's representative-ness
of "real code", or potential lack thereof, here is a teeny bit of profiling
data. A single line of source, marked below with >>>>, is responsible for
over 1/4 of all cycles consumed by the program (dynamically, on RC3240):
do 100 j = 5, NINT(SQRT(REAL(n))), 6 [575]
>>>> if (MOD(n,j).eq.0 .or. MOD(n,j+2).eq.0) then [576]
PRIME = .FALSE. [577]
return [578]
end if [579]
100 continue [580]
Here's the profiler output {for an RC3240, compiled with f77 -O):
procedure (file) line bytes cycles %
-----------------------------------------------------------------------
prime_ (montgomery.f) 576 108 15677188 25.67
... etc ...
Apologies in advance if I did the measurement incorrectly.,
--
-- Mark Johnson
MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
(408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
tihor@acf4.NYU.EDU (Stephen Tihor) (01/10/90)
Curiously I ran this on a VMS VAX under VAX FORTRAN with full error checking enabled and optimizations disabled and got a reasonable integer overflow in the inner loop of INTMUL multiplying IMTMUL\I: 1 IMTMUL\J: 1 IMTMUL\K: 1 IMTMUL\M1(1,1): 235517118 IMTMUL\M2(1,1): 891310209 Those do look large for a 32 integer product. Did I mess something up here?
tihor@acf4.NYU.EDU (Stephen Tihor) (01/10/90)
Curious to try certain VAX and DECsystem copiler options I ran this on a VMS VAX under VAX FORTRAN with full error checking enabled and optimizations disabled and got a reasonable integer overflow in the inner loop of INTMUL multiplying IMTMUL\I: 1 IMTMUL\J: 1 IMTMUL\K: 1 IMTMUL\M1(1,1): 235517118 IMTMUL\M2(1,1): 891310209 Those do look large for a 32 integer product. Did I mess something up here?