[comp.arch] New benchmark

rick@uicsg.UUCP (02/15/87)

New Benchmark

A new benchmark has been written to overcome some of the deficiencies of
Dhrystone.  This benchmark, Dhampstone (written in C), was designed primarily
for evaluation of multiple-register-set architectures (such as RISC), but is
useful for general types of architectures as well.

The benchmark is a modification of Dhrystone designed to incorporate several
important statistics that were not used in the design of Dhrystone.  For
multiple-register-set architectures, the number of local variables and
number of parameters are very important.  Statistics for these distributions
were included.  Running Dhrystone on RISC would cause no overflow of the
register file; the depth of nesting is too shallow.  Dhampstone has overflow
characteristics which are derived from measurements of several C programs
(nroff, pi, sort, tbl, ccom, etc.).

Other statistics useful for evaluating general architectures were also
included.  The distribution of register declarations was included in
Dhampstone.  Dhampstone also includes a representation of I/O processing
overhead.  No actual I/O is performed but the movement of data to an I/O
buffer is included.  I/O measurements were made in several programs and
Dhampstone has a typical number of bytes of I/O per instruction executed.
Other program statistics of Dhampstone are similar to those in Dhrystone.

Neither synthetic benchmark uses statistics relevant for cache evaluation or
code optimization evaluation.

The code of Dhampstone follows.

Rick Eickemeyer
ihnp4!uiucdcs!uicsg!rick

--------------------------------------------------------------------------

/*
 *				DHAMPSTONE
 *
 *		Synthetic Systems Program Benchmark for
 *		Multiple Register Set Architectures
 *
 *  AUTHORS: Richard J. Eickemeyer and Janak H. Patel
 *		Computer Systems Group
 *		Coordinated Science Laboratory
 *		University of Illinois at Urbana-Champaign
 *		1101 W. Springfield
 *		Urbana, IL 61801
 *
 *  DATE: September 1985
 *
 *  DHAMPSTONE is a synthetic benchmark program designed to match dynamic
 *	program statistics of systems programs.  The program is useful for
 *	evaluating performance of general purpose computers.  In designing
 *	the program, a special emphasis was made in making the program
 *	suitable for evaluating processors with multiple register sets.
 *	Dhampstone, written in C, was created by modifying Dhrystone,
 *	originally written in Ada, as described below.  (Dhrystone is
 *	described in: "Dhrystone: A Synthetic Systems Programming Benchmark"
 *	by R. P. Weicker in _Communications_of_the_ACM_ Vol. 27, pp. 1013-
 *	1030.)  Two copies of a C version of Dhrystone were used as a starting
 *	point.  This allowed greater precision in adjusting the number of
 *	local variables and parameters.  The two versions are the functions
 *	Proc0 and Proc0a, which call each other in the proper sequence to
 *	achieve the desired overflow statistics.  As in Dhrystone, start and
 *	stop marks are given to separate the initialization section from the
 *	part of the program that meets the specified statistics.  Measurements
 *	should include only parts of the program executed between these two
 *	points.  The type names in the original have been preserved.
 *	Dhampstone incorporates statistics on number of parameters and number
 *	of local variables per procedure; multiple register set overflow; I/O;
 *	and register declarations in addition to the statement and operand
 *	type and locality distributions of Dhrystone.
 *
 *  I/O is implemented with function "string_io" using a string of length 22.
 *	No actual I/O is done, only the memory operations necessary for the
 *	I/O.  I/O is to blocks of memory.  The input is the string
 *	"DHAMPSTONE PROG INPUT" repeated 10 times (without intervening blanks
 *	or new-lines)
 *
 *  COMPILE Dhampstone as:
 *			cc -o Dhampstone Dhampstone.c
 *	If it is desired to run the program in a loop, compile as:
 *			cc -DLOOP=n -o Dhampstone Dhampstone.c
 *	with n the number of iterations.
 *	Dhampstone was not designed with statistics relevant for optimization.
 *	Results obtained using an optimizing compiler may not be indicative
 *	of behavior of real programs.
 *
 *	For each execution of both Proc0 and Proc0a all functions are
 *	executed 2 times, unless indicated otherwise.
 */

/* definitions similar to those in stdio.h */
    struct _iobuf {
      int _cnt;
      char *_ptr;
      char *_base;
      int _bufsiz
      };
#   define FILE struct _iobuf
#   define NULL 0
#   define getc(p)	\
    (--(p)->_cnt>=0? *(p)->_ptr++&0377:printf("I/O ERROR\n"))
#   define putc(x,p)	\
    (--(p)->_cnt>=0?((int)(*(p)->_ptr++=(unsigned)(x))):printf("I/O ERROR\n"))
# define FALSE 0
# define TRUE 1
# define SLEN 22			/* I/O string length */
# define IOLEN 210			/* = 10 * (SLEN - 1) */

typedef enum {Id1, Id2, Id3, Id4, Id5} Enumeration;
typedef int OneToThirty, OneToFifty;
typedef char CapitalLetter;
typedef char String30[30];
typedef int Array1DimInteger[50];
typedef int Array2DimInteger[50][50];
typedef int boolean;
typedef struct RecordExample {
    Enumeration Discr;
    struct RecordExample *PtrComp;
    Enumeration EnumComp;
    OneToFifty IntComp;
    String30 StringComp;
    } RecordType, *RecordPtr;
boolean BoolG;
char CharG1, CharG2, CharG3;
Array1DimInteger ArrG1;
Array2DimInteger ArrG2;
String30 StringG1;
RecordPtr PtrG1, PtrG2;
int IntG1, IntG2, IntG3, IntG4;

FILE *File_In, *File_Out;
char input[IOLEN+1] = "DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\
DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\
DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT\
DHAMPSTONE PROG INPUTDHAMPSTONE PROG INPUT";

char output[IOLEN+1];
int Proc0_Flag;	/* indicates if in Proc0 or Proc0a */

Enumeration Func1(), Func1a();
boolean Func2(), Func3();
char *malloc();

main ()	
{
#ifdef LOOP
  int loop;
#endif LOOP
  PtrG2 = (RecordPtr) malloc (sizeof (RecordType));
  PtrG1 = (RecordPtr) malloc (sizeof (RecordType));
  PtrG1->Discr = Id1;
  PtrG1->PtrComp = PtrG2;
  PtrG1->EnumComp = Id3;
  PtrG1->IntComp = 40;
  strcpy (PtrG1->StringComp, "DHAMPSTONE PROGRAM ONE STRING");
  strcpy (StringG1, "DHAMPSTONE PROGRAM 3RD STRING");
  File_In = (FILE *) malloc (sizeof (FILE));
  File_Out = (FILE *) malloc (sizeof (FILE));
  File_In->_ptr = File_In->_base = input;
  File_Out->_ptr = File_Out->_base = output;
  File_In->_cnt = File_Out->_cnt = IOLEN;
  File_In->_bufsiz = File_Out->_bufsiz = IOLEN;
  ArrG2[6][5] = 47;
  ArrG2[10][9] = 32;
  CharG3 = 'C';
  IntG4 = 8;

	/*
	 *------- Start Timer -------
	 */

#ifdef LOOP
  for (loop = 1; loop <= LOOP; loop++) {
#endif LOOP

  Proc0a(2);
  Proc0(9);
  Proc0a(2);
  Proc0a(3);
  Proc0(2);
  Proc0a(2);

#ifdef LOOP
  File_In->_ptr = File_In->_base;
  File_In->_cnt = File_In->_bufsiz;
  File_Out->_ptr = File_Out->_base;
  File_Out->_cnt = File_Out->_bufsiz;
  }
#endif LOOP

	/*
	 *------- Stop  Timer -------*
	 */

  }

Proc0 (IntParI)
register int IntParI;
/* IntParI is a no. to control calling depth (called from main and Proc0a) */
{
  register OneToFifty IntL1, IntL2;
  OneToFifty IntL3;
  char CharIndex;
  Enumeration EnumL;
  String30 StringL1, StringL2;

  if (IntParI >= 2) Proc0a (IntParI - 1);
  Proc0_Flag = 1;
  IntL1 = 2;
  IntL2 = 3;
  string_io (StringL1, SLEN);
  strcpy (StringL2, "DHAMPSTONE PROGRAM 2ND STRING");
  EnumL = Id2;
  BoolG = ! Func2 ('E', &CharG3, 'R', StringL2, StringL1);
  while (IntL1 < IntL2) {	/* body executed once */
    IntL3 = 5 * IntL1 - IntL2;
    Proc7 ();
    IntL1 = IntL1 + 1;
    }
  Proc8 (ArrG1, ArrG2, IntL1, IntL3 - 5);
  Proc1 (PtrG1);
  for (CharIndex = 'A'; CharIndex <= CharG2; CharIndex++)
    if (EnumL == Func1a(Id1, CharIndex, 'Z'))   /* body executed twice */
      Proc6 (&EnumL, Id1, 20, IntL2);	/* not executed */
  IntL3 = IntL2 * IntL3;
  IntL2 = IntL3 / IntL1;
  IntL2 = 7 * (IntL3 - IntL2) - IntL1;
  Proc2 (&IntL3, 'C');
  }

Proc0a (IntParI)
int IntParI;
/* IntParI is a no. to control calling depth (called from main and Proc0a) */
{
  register OneToFifty IntL1, IntL2, IntL3;
  OneToFifty IntL4;
  char CharIndex, CharL1, CharL2;
  Enumeration EnumL1;
  String30 StringL;

  IntL1 = 2;
  IntL2 = 5;
  CharL2 = 'A';
  CharL1 = 'C';
  if (IntParI > 1) Proc0(IntParI - 1);
  Proc0_Flag = 0;
  Proc4 (CharL2);
  strcpy (StringL, "DHAMPSTONE PROGRAM 1ST STRING");
  EnumL1 = Id2;
  BoolG = ! Func2 (CharL1, &CharL2, CharG1, StringG1, StringL);
  while (IntL1 < IntL2) {	/* body executed 3 times */
    if (IntL1 == 2) {
      IntL3 = 5 * IntL1 - IntL2;	/* executed 1 time */
      Proc7a ();
      }
    IntL1 = IntL1 + 1;
    }
  Proc8a (IntL1, ArrG1, ArrG2, IntL3, IntL2, &IntL4);
  Proc1 (PtrG1);
  for (CharIndex = 'A'; CharIndex <= CharG2; CharIndex++)
    if (EnumL1 == Func1a (Id1, CharIndex, 'Z')) {   /* body executed twice */
      EnumL1 = Id3;	/* not executed */
      Proc6 (&EnumL1, Id1, 4, 5);
      }
  IntL2 = IntL4 * IntL3;
  IntL3 = IntL3 / IntL4;
  IntL2 = 7 * (IntL4 - IntL3) - IntL2;
  Proc2 (&IntL4, CharL1); /* CharL1 & CharG3 are both 'C' */
  }

Proc1 (PtrParI)
RecordPtr PtrParI;
/* argument is PtrG1 (called from Proc0 and Proc0a) */
{
# define NextRecord	(*(PtrParI->PtrComp))
  register RecordPtr PtrL;
  register int IntL;

  NextRecord = *PtrG1;
  PtrParI->IntComp = 5;
  NextRecord.IntComp = PtrParI->IntComp;
  PtrL = PtrParI->PtrComp;
  NextRecord.PtrComp = PtrL;
  Proc3 (&NextRecord.PtrComp);
  if (NextRecord.Discr == Id1) {	/* executed */
    Proc6 (&NextRecord.EnumComp, PtrParI->EnumComp, PtrParI->IntComp, 10);
    NextRecord.IntComp = 6;
    Proc7 ();
    NextRecord.PtrComp = PtrG1->PtrComp;
    }
  else {
    *PtrParI = NextRecord;	/* not executed */
    if (PtrL == NULL) IntL = 10;
    else {
      IntL = 1;
      PtrParI->IntComp = IntL;
      }
    }
# undef NextRecord
  }

Proc2 (IntParIO, CharParI)
register OneToFifty *IntParIO;
char CharParI;
/* arguments are 21 and 'C' (called from Proc0). Returns 50 as IntParIO
   arguments are 7 and 'C' (called from Proc0a). Returns 22 as IntParIO */
{
  register OneToFifty IntL1, IntL2, IntL3;
  Enumeration EnumL1, EnumL2;

  IntL2 = *IntParIO + 10;
  do	/* body executed once */
    if (CharG3 == CharParI) {
      IntL3 = IntL2 - 2;	/* executed */
      *IntParIO = *IntParIO + IntL3 - IntG3;
      EnumL2 = Id1;
      }
    else {
      IntL1 = IntL2 + 4;	/* not executed */
      *IntParIO = IntL1 - IntG3;
      EnumL1 = Id1;
      EnumL2 = EnumL1;
      }
  while (EnumL2 != Id1);
  }

Proc3 (PtrParO)
RecordPtr *PtrParO;
/* (called from Proc1). Returns address of PtrG2 as PtrParO */
{
  if (PtrG1 != NULL)
    *PtrParO = PtrG1->PtrComp;	/* executed */
  else IntG1 = 100;		/* not executed */
  Proc7a ();
  }

Proc4 (CharParI)
char CharParI;
/* argument is 'C' (called from Proc8--from Proc0)
   argument is 'A' (called from Proc0a) */
{
  register boolean BoolL;

  if (Proc0_Flag) Proc5 ();
  BoolL = CharG1 == CharParI;
  BoolL = BoolL || BoolG;
  CharG2 = 'B';
  }

Proc5 ()
{
/* executed 3 times per Proc0--Proc0a pair */
  CharG1 = 'A';
  BoolG = FALSE;
  }

Proc6 (EnumParO, EnumParI, IntParI1, IntParI2)
Enumeration *EnumParO;
Enumeration EnumParI;
OneToFifty IntParI1, IntParI2;
/* arguments are address, Id3, 5, 10 (called from Proc1).
  Returns Id2 for EnumParO */
{
  if (IntParI1 > 10) IntG1 = 4;	/* not executed */
  else if (! Func3(EnumParI)) *EnumParO = Id4;	/* not executed */
  switch (EnumParI) {
    case Id1 :
      *EnumParO = Id1;
      IntG2 = IntParI2 + 2;
      IntG3 = IntParI1 + IntParI2;
      break;
    case Id2 :
      if (IntG3 > IntParI1) *EnumParO = Id1;
      else *EnumParO = Id4;
      break;
    case Id3 :
      *EnumParO = Id2;	/* executed */
      break;
    case Id4 : break;
    case Id5 :
      *EnumParO = Id3;
      break;
    }
  }

Proc7 ()
{
/* executed 3 times per Proc0--Proc0a pair */
  register OneToFifty IntL;

  IntL = IntG1 + 2;
  IntG2 = IntG1 + IntL;
  }

Proc7a ()
{
/* executed 3 times per Proc0--Proc0a pair */
  if (! Proc0_Flag) Proc5 ();
  if (IntG4 > 4) IntG2 = IntG3 + 2;	/* executed */
  else IntG2 = IntG3 + 1;	/* not executed */
  }

Proc8 (ArrParIO1, ArrParIO2, IntParI1, IntParI2)
Array1DimInteger ArrParIO1;
Array2DimInteger ArrParIO2;
register int IntParI1;
int IntParI2;
/* arguments are 2 arrays, 3, 2 (called from Proc0) */
{
/* executed 1 time per Proc0--Proc0a pair */
  OneToFifty IntL1;
  register OneToFifty IntL2;
  OneToFifty IntL3;
  register OneToFifty IntIndex;
  OneToFifty IntL4, IntL5;

  Proc4 ('C');
  IntL1 = IntParI1 + 5;
  IntL2 = IntL1 + 1;
  if (IntParI2 == 2) {
    IntL5 = IntL1;	/* executed */
    IntL4 = IntL1 + 1 - IntParI1;
    }
  else {
    IntL3 = IntL2;	/* not executed */
    IntL5 = IntL3 + 3;
    IntL4 = IntL3 + 2 - IntParI1;
    }
  ArrParIO1[IntL5] = IntParI2;
  ArrParIO1[IntL1 + 1] = ArrParIO1[IntL1];
  ArrParIO1[IntL1 + 30] = IntL4;
  for (IntIndex = IntL1; IntIndex <= IntL2; IntIndex++)
    ArrParIO2[IntL2][IntIndex] = IntParI1;	/* body executed twice */
  ArrParIO2[IntL4][IntL2 - 1] = ArrParIO2[IntL4][IntL4 - 1] + 1;
  ArrParIO2[IntL5 + 20][IntL5] = ArrParIO1[IntL1];
  if (IntParI2 != 2) IntL3 = IntL3 + 4; /* not executed */
  IntG1 = 5;
  }

Proc8a (IntParI1, ArrParIO1, ArrParIO2, IntParI2, IntParI3, IntParO)
OneToFifty IntParI1;
register OneToFifty IntParI2;
OneToFifty IntParI3, *IntParO;
Array1DimInteger ArrParIO1;
Array2DimInteger ArrParIO2;
/* arguments are 5, two arrays, 5, 5, address (called from Proc0a).
   Returns 7 for IntParO */
{
/* executed 1 time per Proc0--Proc0a pair */
  register OneToFifty IntL1;
  OneToFifty IntL2;
  register OneToFifty IntIndex;
  OneToFifty IntL3, IntL4;

  IntL2 = IntParI1 + 5;
  IntL1 = IntL2 + 1;
  if (IntParI3 <= 23) IntL3 = IntL2;	/* executed */
  else {
    IntL4 = IntL2;	/* not executed */
    IntL3 = IntL4 + 3;
    }
  ArrParIO1[IntL3] = IntParI2;
  ArrParIO1[IntL2 + 1] = ArrParIO1[IntL2];
  ArrParIO1[IntL3 + 30] = IntParI3;
  for (IntIndex = IntL2; IntIndex <= IntL1; IntIndex++)
    ArrParIO2[IntL1][IntIndex] = IntParI2;	/* body executed twice */
  ArrParIO2[IntL1][IntL2 - 1] = ArrParIO2[IntL3][IntL2 - 1] + 1;
  if (IntParI3 > 17) IntL4 = IntL4 + 4;	/* not executed */
  else  ArrParIO2[IntL3 + 20][IntL3] = ArrParIO1[IntL2];/* executed */
  *IntParO = IntParI2 + IntParI3 + 2 - IntParI1;
  }

Enumeration Func1 (CharParI1, CharParI2)
CapitalLetter CharParI1, CharParI2;
/* arguments are 'M', 'M' (called from Func2). Returns Id1 */
{
/* executed 2 times per Proc0--Proc0a pair */
  CapitalLetter CharL1, CharL2;
  Enumeration EnumL;

  CharL1 = CharParI1;
  CharL2 = CharL1;
  if (CharL2 != CharParI2) {
    EnumL = Id2;	/* not executed */
    return (EnumL);
    }
  else return (Id1);	/* executed */
  }

Enumeration Func1a (EnumParI, CharParI1, CharParI2)
Enumeration EnumParI;
CapitalLetter CharParI1, CharParI2;
/* arguments are Id1, 'A' or 'B', 'Z' (called from Proc0 and Proc0a)
   Returns Id1 */
{
/* executed 4 times per Proc0--Proc0a pair */
  CapitalLetter CharL1, CharL2;

  CharL1 = CharParI1;
  CharL2 = CharL1;
  if (CharL2 != CharParI2) return (EnumParI);	/* executed */
  else return (Id3);	/* not executed */
  }

boolean Func2 (CharParI1, CharParIO, CharParI2, StringParI1, StringParI2)
char CharParI1;
register char *CharParIO;
char CharParI2;
String30 StringParI1, StringParI2;
/* arguments are 'E', 'C', 'R', 2 strings (called from Proc0)
   arguments are 'C', 'C', 'A', 2 strings (called from Proc0a)
   Returns FALSE, CharParIO is not changed */
{
  register OneToThirty IntL1, IntL2;
  CapitalLetter CharL1, CharL2;

  IntL1 = 2;
  CharG1 = StringParI1[IntL1];
  while (IntL1 <= 2)	/* body executed once */
    if (Func1 (StringParI2[IntL1 + 1], 'M') == Id1) {
      CharL1 = 'A';	/* executed */
      IntL1 = IntL1 + 1;
      }
  if (CharL1 >= 'W' && CharL1 < 'Z') {
    CharL2 = CharParI1;	/* not executed */
    IntL2 = 3;
    }
  if (CharL1 == 'X') {
    CharL2 = *CharParIO;	/* not executed */
    *CharParIO = 'J';
    return TRUE;
    }
  else if (strcmp (StringParI2, StringParI1) > 0) {
    IntL2 = IntG3;	/* not executed */
    IntL2 = IntL1 + IntL2;
    CharL2 = CharParI2;
    CharL1 = CharL2;
    return TRUE;
    }
  else return FALSE;	/* executed */
  }

boolean Func3 (EnumParI)
Enumeration EnumParI;
/* argument is Id3 (called from Proc6). Returns TRUE */
{
  Enumeration EnumL;

  EnumL = EnumParI;
  if (EnumL == Id3) return TRUE;	/* executed */
  return FALSE;	/* not executed */
  }

string_io(string, length)
register char *string;
int length;
/* arguments are string, SLEN (called from Proc0) */
{
/* executed 1 time per Proc0--Proc0a pair */
  register int index;

  --length;
  for (index = 0; index < length; index++) {
    string[index] = getc(File_In);	/* body executed 21 times */
    putc(string[index],File_Out);
    }
  string[length] = '\0';
  }

pmontgom@oak.math.ucla.edu (Peter Montgomery) (01/07/90)

In article <4057@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>
>So what's your favorite application that does lots of
>integer multiplies and divides?  Code it up and pass it out
>for people to run.
>
	During the recent controversy about the SPARC and integer 
multiply/divide, multiple individuals have requested benchmarks 
which compare machine performances on applications needing these
operations.  In another posting is my response to that challenge, 
written primarily in FORTRAN 77.  It times eleven algorithms 
(some integer, some floating point, but is biased towards ones needing 
integer multiplication and division).  A machine's rating is defined 
to be the geometric mean of the times on these algorithms.
So far I have run it on the SUN 3/50, SUN 3/280, SPARC, and Alliant.
Here are the outputs (each row corresponds to a machine and compiler;
each column corresponds to an algorithm):
---------------------------------------------------------------------------
Gener  = Generate matrices (integer arithmetic, trigonometry)
Dble   = Double precision floating point matrix multiplication
Sngl   = Single precision floating point matrix multiplication
Cmplx  = Complex (single precision) matrix multiplication
Intgr  = Integer matrix multiplication
Modlr  = Matrix multiplication modulo a prime number
TrlDiv = Primality test, using trial division through square root
PrbPrm = Probable prime test, using Fermat's Little Theorem
GCD    = Euclidean GCD algorithm
BinDec = Binary-to-Decimal conversion
Cfrac  = Continued fraction expansion of quadratic irrational
Geo mean = Geometric mean of above times

SUN 3/50 with 68881, optimized (f77 -O -f68881)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.86  3.76  3.54 43.70  1.56  1.86  12.56   1.48  3.30   4.82  2.60  3.55 sec
SUN 3/280 with 68881, optimized (f77 -O -f68881)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.56  2.84  2.56 30.16  0.94  1.04   7.44   1.12  1.96   2.38  1.54  2.24 sec
SUN 3/280 with inline 68881, optimized (f77 -O -f68881 /usr/lib/f68881.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.48  2.82  2.54 18.54  0.86  1.10   7.58   1.16  1.94   2.38  1.54  2.11 sec
SUN 3/280 with inline FPA, optimized (f77 -O -ffpa /usr/lib/ffpa.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.34  0.66  0.62  4.66  0.90  1.06   7.38   0.38  1.92   2.32  1.54  1.25 sec
SPARC station optimized (f77 -O)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.30  0.38  0.25  3.25  1.06  0.86  17.79   0.44  1.24   1.56  1.07  1.03 sec
SPARC station optimized, inline (f77 -O /usr/lib/libm.il)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.37  0.35  0.27  2.00  1.05  0.87  17.76   0.44  1.24   1.56  1.07  1.00 sec
Alliant FX/80, one processor, optimized, unvectorized (fortran -Og)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.09  0.91  0.85  1.47  0.86  1.21   8.14   0.37  1.90   1.14  1.72  1.02 sec
Alliant FX/80, one processor, optimized, vectorized (fortran -Ovg -AS)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.11  0.18  0.16  0.58  0.20  0.37   8.04   0.37  1.84   1.07  1.44  0.54 sec
Alliant FX/80, six processors, optimized, vectorized (fortran -O -AS)
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.02  0.04  0.03  0.11  0.04  0.07   8.02   0.37  1.84   1.07  1.38  0.22 sec

-------------------------------------------------------------------------------

Comments about results:

	The SUN 3s have OS version 4.2.  The SPARCs have SUN OS 4.0.
The Alliant has Concentrix version 5.0 and FX/Fortran version V4.2.40.
The Alliant has eight advanced computational elements (ACEs),
configured into one cluster of six and two detached processors.

	The timings were made during the daytime, and may be affected 
by the load averages.  In particular, our Alliant is rarely idle.

	The entries are approximately in decreasing order by geometric 
mean time.  Geometric mean is fairer than arithmetic mean, because the 
relative order of the geometric means will remain unchanged if one 
application is run twice as far (and takes twice as long on all machines), 
whereas the relative order of the arithmetic means may change.

	When we compare the SUN 3/50 and SUN 3/280 with 68881 
(first two entries), the improvements for the eleven timings are rather
uniform: each takes 50%-75% as long on the 3/280 as on the 3/50.  
The use of an inline expansion template (/usr/lib/f68881.il)
speeds the complex matrix multiplication by 30% and helps the
trigonometric calculations (Gener), while making little change elsewhere.
Using the FPA on the 280 speeds the non-trigonometric floating point 
computations by a factor of 4, while leaving integer times almost unchanged 
(N.B. PrbPrm uses floating point arithmetic to estimate an integer quotient).

	Next compare the SPARC station to the SUN 3/280 with FPA.
The floating point times are further improved, as are the Euclidean GCD,
binary-to-decimal conversion, and continued fraction times.
The integer matrix multiplication is 20% worse but the modular matrix
multiplication is 20% faster than on the 3/280.
The primality test which uses repeated division (TrlDiv) is much worse 
on the SPARC, taking 30% longer than it did on the SUN 3/50.  

	On the SPARC, modular matrix multiplication is faster than 
integer matrix multiplication even though the former inner loop is more 
complicated (it has an IF); this is presumably because the entries in 
the integer matrix are pseudorandom values modulo 2**32 whereas
those in the modular matrices require only 15 bits.
Both the TrlDiv and GCD codes have two remainder operations in the 
inner loop; the TrlDiv/GCD ratio is between 3.5 and 4.5 on SUN 3 and 
Alliant, but is 14 on a SPARC.  A possible explanation is the 
high frequency of quotients of 1 and 2 (59%) in the Euclidean 
algorithm [Cfrac also has a high frequency of low quotients].

	As on the SUN 3, the use of an inline expansion template affects
primarily the SPARC time for complex multiplication (the trigonometric 
times get worse rather than better).  In another comp.arch posting, 
Bo Thide' reports a 6-fold speedup in integer multiply time
using /usr/lib/libm.il, but .mul, .div, and .rem are missing from 
my /usr/lib/libm.il (which is dated 88/02/08).

        The unvectorized Alliant (with one processor) is faster than 
the SPARC on six of the eleven tests, and is slower on five tests;
the overall rating is too close to call.
The unvectorized Alliant beats the SUN 3/280 on three tests, loses
on five, with three (Intgr, PrbPrm, GCD) too close to call; the Alliant
beats the SUN 3/280 overall.

	The Alliant has an -autoinline option to insert user procedures 
inline.  Its use had minimal effect on these tests.  When -autoinline 
was used with multiple processors on a draft version of this
benchmark, the program aborted with arithmetic exception.

	It is interesting to compare the ratios Sngl/Intgr for
real vs. integer matrix multiplication on various machines.  
All matrix entries have the same size, and the indexing is identical, 
so the memory subsystem should be equally fast with both, 
and the compiler should be equally good at optimizing the loops 
(e.g., assigning variables to registers).  On the SUN 3s with 68881, 
the Sngl/Intgr ratio is between 2 and 3, with integer being faster.  
On the 3/280 with FPA, the ratio is 0.7, with real being faster.
On the SPARC, the ratio is about 0.25, with real being faster.
On the Alliant, the ratio is about 0.9, with real being slightly faster.

	Another interesting ratio Cmplx/Sngl.  A complex multiplication 
uses 4 real multiplication and 2 real additions; a complex addition uses 
2 real additions.  The inner loop of the complex matrix multiplication 
will therefore have 4 real multiplications and 4 real additions, 
whereas the single precision matrix multiplication has one of each.
The ratio Cmplx/Sngl should be less than 4, since loop overhead, 
indexing, and memory references are not four times as complicated; 
also there are many opportunities for parallelism and pipelining. 
The SUN 3 and SPARC ratios are between 7 and 8 with inline expansion 
templates, and between 11 and 13 without.  The Alliant ratios 
(1.6 to 1.9 unvectorized, 3 to 4 vectorized) seem more reasonable.

	All of these machines have hardware double precision arithmetic,
so the Dble/Sngl ratios are all between 1.0 and 1.5.  In many cases,
the penalty for double precision is very slight.

	The ratio of worst-to-best single-processor time is 70
for complex matrix multiplication and about 20 for single and double 
precision matrix multiplication.  On the other hand, this ratio is 
between 2 and 3 for TrlDiv, Cfrac and GCD.

	Some comments on the object code of these compilers:

	(1) The code printed by the Alliant compiler is much
	    easier to read than the others.  For example, it
	    retains the user variable names.  It also prints the 
	    hexadecimal equivalent of the object code, which is 
	    unimportant here but useful on other occasions.

	(2) All recognizes the common subexpression in
	    MOD(n,6).ne.1 .and. MOD(n,6).ne.5 
	    within TrlDiv, computing the remainder only once.

	(3) In GCD, we need both MIN(ABS(n1), ABS(n2)) and
	    MAX(ABS(n1), ABS(n2)).  None of these compilers
	    detects that MIN and MAX have identical arguments
	    (although all compiled ABS, MIN, and MAX inline).
	    Indeed, the Alliant computes ABS(n1) and ABS(n2) twice.

	(4) All generate inline code for statement function 
	    MODMUL within PrbPrm (function MODEXP). 

	(5) The lines  string(place:place) = DIGITS(MOD(left,10)) 
            and string(place:place) = DIGITS(left) in BINDEC transfer
            a single character.  The Alliant compiler recognizes these 
            and generates movb instructions.
	    The SUN 3 and SPARC compilers generate a subroutine call.
	    The Alliant also compiles the LEN (character string length)
	    intrinsic function inline; SUN 3 and SPARC do not.

	(6) Subroutine BINDEC needs both the quotient left/10
	    and the remainder MOD(left,10).  The SUN 3 compiler 
	    generates two divide instructions (one for quotient, 
	    one for remainder), even though the hardware makes 
	    both available at once.  The SPARC calls both .div and .rem .
	    The Alliant (which is also MC68020-based) generates only
	    one divide instruction, but computes the remainder from the 
	    quotient via a multiply and subtract (here and elsewhere).

	(7) The SUN 3 and SPARC compilers generate special code for
	    MOD(n,2) and n/2 within MODEXP.  The Alliant uses the same
	    instructions as for other divisors, except to optimize
	    the multiply by 2 into an add when computing the remainder.
	    None recognize that the MOD(n,2).ne.0 test (i.e., the ODD
	    function of PASCAL) requires testing only the bottom bit of n,
	    whether n is positive or negative (on a twos' complement machine).  
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024

pmontgom@oak.math.ucla.edu (01/07/90)

        program bchmrk
C
C               This benchmark times eleven codes, many of 
C       which require integer multiplication and/or division.
C       Integer addition, subtraction, and multiplication are
C       assumed not to abort on overflow, but rather to return 
C       the correct result modulo 2**32 or other high power of 2.
C       The benchmark includes the following:
       
C       (1)     Multiplication of square matrices.  Matrix entries
C               can be real, double precision, complex, integer,
C               or integers modulo a prime number.
C       (2)     Check whether an integer is prime, using trial division
C               by integers up to the square root of the argument.
C       (3)     Modular exponentiation.  This routine requires a function 
C               MODMUL(m1,m2) which computes the product m1*m2 modulo 
C               local variable mdls, but where m1*m2 may overflow.
C       (4)     Euclidean GCD (greatest common divisor) algorithm.
C       (5)     Binary-to-decimal conversion.
C       (6)     Continued fraction expansion of quadratic irrational.

C               An installation is allowed to change the following
C       parts of the source code when compiling the benchmark:

C       (1)     Subroutine CPUTIM (return elapsed user CPU time).
C       (2)     Statement function MODMUL within MODEXP 
C               (may be optimized to exploit machine properties,
C               or be replaced by an assembly language subroutine).
C       (3)     Convert entire source to upper case.

C               If this source code is modified in any other way,
C       please identify the modifications when reporting the time 
C       on your system.  An optimizing compiler is allowed.  
C       If multiple CPUs are used, also report the times with a single CPU.

C               The program reads no input data.   The memory required
C       is approximately 300000 bytes (decimal), if each real or
C       integer variable occupies 4 bytes.

C               Benchmark written by 

C                       Peter L. Montgomery
C                       Department of Mathematics
C                       University of California
C                       Los Angeles, CA 90024   USA
C
C                       pmontgom@math.ucla.edu
C                       January, 1990

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

C               Constants

        integer NCONV, PMODLS, PRMHGH, PRMLOW, SZ
        parameter (NCONV = 2000)
        parameter (PMODLS = 30011)
        parameter (PRMLOW = 2*10**9, PRMHGH = PRMLOW + 2000)
        parameter (SZ = 40)

C               Local arrays

        logical prm(0 : PRMHGH - PRMLOW)
        complex c1(SZ,SZ), c2(SZ,SZ), c3(SZ,SZ)
        complex ctmp(SZ,SZ), cp1(SZ,SZ), cp2(SZ,SZ)
        double precision d1(SZ,SZ), d2(SZ,SZ), d3(SZ,SZ)
        double precision dtmp(SZ,SZ), dp1(SZ,SZ), dp2(SZ,SZ)
        real s1(SZ,SZ), s2(SZ,SZ), s3(SZ,SZ)
        real stmp(SZ,SZ), sp1(SZ,SZ), sp2(SZ,SZ)
        integer convs(NCONV)
        integer i1(SZ,SZ), i2(SZ,SZ), i3(SZ,SZ)
        integer itmp(SZ,SZ), ip1(SZ,SZ), ip2(SZ,SZ)
        integer p1(SZ,SZ), p2(SZ,SZ), p3(SZ,SZ)
        integer ptmp(SZ,SZ), pp1(SZ,SZ), pp2(SZ,SZ)
        equivalence(prm, ctmp, dtmp, stmp, convs, itmp, ptmp)

C               Other local variables

        integer i, ierr, j, nprime, p, perr, rem, sumcnv, sumgcd
        real cerr, serr
        real t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13
        real tc, tcfrac, td, tdcml, tgcd, tgen, ti
        real tovral, tp, tprbpr, tprime, ts
        double precision derr
        character*10 string

C               Functions referenced

        intrinsic ABS, CMPLX, COS, DBLE, MAX, REAL, SIN
        logical PRIME
        integer GCD, MODEXP
        external GCD, MODEXP, PRIME

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

C               Select test matrices (not intended to be meaningful).
C               i1, i2, and i3 are modulo 2**32 or other power of 2.
C               If this causes trouble on your system, use smaller
C               values in the assignments and report the change.

        call CPUTIM(t1)
        do 100 j = 1, SZ
            do 90 i = 1, SZ
                d1(i,j) = COS(DBLE(i+j))
                d2(i,j) = SIN(DBLE(i*j))
                d3(i,j) = COS(DBLE(i-j))
                s1(i,j) = SIN(REAL(i))*SIN(REAL(j))
                s2(i,j) = COS(REAL(3*i - j))
                s3(i,j) = SIN(REAL(3*i - j))
                i1(i,j) = 23321*i*j + 4135151*i + 123321*j + 231235325
                i2(i,j) = 231421*i*j + 632153*i + 3215312*j + 887231323 
                i3(i,j) = 98136*i*j + 4532355*i - 2231413*j + 343232231
                p1(i,j) = MOD(23*i**2 + 21434*i*j + 3556*j + 56, PMODLS)
                p2(i,j) = MOD(8453*i*j + 24353*i + 765*j + 3145, PMODLS)
                p3(i,j) = MOD(124*(i-j)**2 + 4354*i + 56*j + 67, PMODLS)
                c1(i,j) = CMPLX(s1(i,j), REAL(d1(i,j)))
                c2(i,j) = CMPLX(s2(i,j), REAL(d2(i,j)))
                c3(i,j) = CMPLX(s3(i,j), REAL(d3(i,j)))
90          continue
100     continue

C               Perform double precision matrix multiplication.
C               Compute both (d1*d2)*d3 and d1*(d2*d3).

        call CPUTIM(t2)
        call DMTMUL(d1,   d2,   dtmp, SZ)
        call DMTMUL(dtmp, d3,   dp1,  SZ)
        call DMTMUL(d2,   d3,   dtmp, SZ)
        call DMTMUL(d1,   dtmp, dp2,  SZ)

C               Perform single-precision (real) matrix multiplication.

        call CPUTIM(t3)
        call SMTMUL(s1,   s2,   stmp, SZ)
        call SMTMUL(stmp, s3,   sp1,  SZ)
        call SMTMUL(s2,   s3,   stmp, SZ)
        call SMTMUL(s1,   stmp, sp2,  SZ)

C               Perform single-precision (complex) matrix multiplication.

        call CPUTIM(t4)
        call CMTMUL(c1,   c2,   ctmp, SZ)
        call CMTMUL(ctmp, c3,   cp1,  SZ)
        call CMTMUL(c2,   c3,   ctmp, SZ)
        call CMTMUL(c1,   ctmp, cp2,  SZ)

C               Perform integer matrix multiplication.

        call CPUTIM(t5)
        call IMTMUL(i1,   i2,   itmp, SZ)
        call IMTMUL(itmp, i3,   ip1,  SZ)
        call IMTMUL(i2,   i3,   itmp, SZ)
        call IMTMUL(i1,   itmp, ip2,  SZ)

C               Perform matrix multiplication modulo PMODLS.

        call CPUTIM(t6)
        call PMTMUL(p1,   p2,   ptmp, SZ, PMODLS)
        call PMTMUL(ptmp, p3,   pp1,  SZ, PMODLS)
        call PMTMUL(p2,   p3,   ptmp, SZ, PMODLS)
        call PMTMUL(p1,   ptmp, pp2,  SZ, PMODLS)

C               Check that products agree
C               (matrix multiplication should be associative).

        call CPUTIM(t7)
        cerr = 0.0
        derr = 0.0D0
        serr = 0.0      
        ierr = 0
        perr = 0
        do 200 j = 1, SZ
            do 190 i = 1, SZ
                cerr = MAX(cerr, ABS(cp1(i,j) - cp2(i,j)))
                derr = MAX(derr, ABS(dp1(i,j) - dp2(i,j)))
                serr = MAX(serr, ABS(sp1(i,j) - sp2(i,j)))
                ierr = MAX(ierr, ABS(ip1(i,j) - ip2(i,j)))
                perr = MAX(perr, ABS(pp1(i,j) - pp2(i,j)))
190         continue
200     continue
        if (ierr.ne.0 .or. perr.ne.0 
     *      .or. derr .ge. 1.0E-12 
     *      .or. serr .ge. 1.0E-5
     *      .or. cerr .ge. 2.0E-4   ) then
            print 210, ierr, perr, derr, serr, cerr
210         format(' ERRORS IN MATRIX MULTIPLY: ', 2I10, 3E10.3)
        end if

        call CPUTIM(t8)

C               Search for primes in the interval [PRMLOW, PRMHGH].

        nprime = 0
        do 300 j = 0, PRMHGH-PRMLOW
            p = j + PRMLOW
            prm(j) = PRIME(p)
            if (prm(j)) nprime = nprime + 1
300     continue

C               Use a probable prime test to verify these findings.
C               Recall, if p is an odd prime, then
C               2**((p-1)/2) == +-1 mod p by Fermat's Theorem.
C               Most integers p passing this test are prime.
C               Counterexamples exist (e.g., 341, 561), but are rare.
C
        call CPUTIM(t9)
        do 400 j = MOD(PRMLOW+1,2), PRMHGH-PRMLOW, 2
            p = j + PRMLOW
            rem = MODEXP(2, (p-1)/2, p)
            if (prm(j) .neqv. (rem.eq.1 .or. rem.eq.p-1)) then
                print *, 'Probable prime test fails for ', p, rem
            end if
400     continue
        call CPUTIM(t10)

        sumgcd = 0
        do 500 i = 10000000, 10000100
            do 490 j = 11111111, 140000000, 500001
                sumgcd = sumgcd + GCD(i,j)
490         continue
500     continue
        if (nprime.ne.100 .or. sumgcd.ne.407895) then
            print 600, nprime, sumgcd
600         format(1x, 'ERROR -- ', i7, ' primes found.',
     *                '  Sum of GCDs is ', i10)
        end if
        call CPUTIM(t11)

C           Do binary-to-decimal conversion.

        do 700 i = 10000, 10000000, 537
            call BINDEC(i, string)
700     continue
        if (string.ne.'   9999811') then
            print *, 'BINDEC Error - Final string is ', string
        end if
        call CPUTIM(t12)

C               Try continued fraction of quadratic irrational.

        sumcnv = 0
        do 800 j = 1000003, 1001003, 10
            call CFRAC(j, convs, NCONV)
            do 790 i = 1, NCONV
                sumcnv = sumcnv + convs(i)
790         continue
800     continue
        if (sumcnv.ne.7563088) then
            print *, ' CFRAC ERROR - Sum of convergents is ', sumcnv
        end if
        call CPUTIM(t13)

C               Compute times used.

        tgen   = t2  - t1
        td     = t3  - t2
        ts     = t4  - t3
        tc     = t5  - t4
        ti     = t6  - t5
        tp     = t7  - t6

        tprime = t9  - t8
        tprbpr = t10 - t9
        tgcd   = t11 - t10
        tdcml  = t12 - t11
        tcfrac = t13 - t12
C Overall rating = Geometric mean of above times.
C (Clock resolution should be such that all times are nonzero).
        tovral = (tgen*td*ts*tc*ti*tp*tprime*tprbpr*tgcd*tdcml*tcfrac)
     *         **(1.0/11.0)

C               Print timings.

        print 1000, tgen, td, ts, tc, ti, tp, tprime, tprbpr, tgcd,
     *  tdcml, tcfrac, tovral
1000    format(1x, 5HGener, 2x,4HDble, 2x,4HSngl, 1x,5HCmplx,
     *             1x,5HIntgr, 1x,5HModlr, 1x,6HTrlDiv,
     *             1x,6HPrbPrm, 3x, 3HGCD, 1x, 6HBinDec,
     *             1x,5HCfrac, 2x,'Geo mean'/
     *         1x, F5.2, 5F6.2, 2F7.2, F6.2, F7.2, 2F6.2, ' sec')
        end
************************************************************************
        subroutine BINDEC(num, string)
C Binary-to-decimal conversion of nonnegative integer.
        character*(*) string
        integer num

C           Local variables

        integer left, place
        character DIGITS(0:9)
        intrinsic LEN, MOD
        data DIGITS/'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'/

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        left = num
        place = LEN(string)

10      continue
        if (left.ge.10) then
            string(place:place) = DIGITS(MOD(left, 10))
            left = left/10
            place = place - 1
            if (place.ne.0) go to 10
            return
        end if

        string(place:place) = DIGITS(left)
        if (place.gt.1) string(1:place-1) = ' '
        end
************************************************************************
        subroutine CFRAC(D, convs, nconv)
C Return the first nconv convergents in the continued fraction
C expansion of SQRT(D), where D > 0 and D is not a perfect square.
        integer D, nconv, convs(1:nconv)

C               Local variables

        integer droot, dadd, daddnw, denom, dinv, dinvnw, iconv, q
        intrinsic INT, REAL, SQRT

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        droot = INT(SQRT(REAL(D) + 0.5))
        dadd = 0
        denom = 1
        dinv = D
        do 100 iconv = 1, nconv
C At this time, it remains to expand X = (SQRT(D) + dadd)/denom.
C We know D = dadd**2 + denom*dinv, where all variables are nonnegative.
C Estimate q = INT(X) using droot = INT(SQRT(D)).  
C Replace X by 1/(X - q) = denom/(SQRT(D) + dadd - denom*q)
C   = (SQRT(D) + denom*q - dadd)/((D - (dadd - denom*q)**2)/denom).
C Since D = dadd**2 + denom*dinv, the denominator is an integer.

C Make special check for case q = 1.
            if (droot + dadd .ge. 2*denom)  then
                q = (droot + dadd)/denom
                convs(iconv) = q

                daddnw = denom*q - dadd
                dinvnw = denom
                denom  = dinv + q*(dadd - daddnw)
                dadd   = daddnw
                dinv   = dinvnw
            else
                convs(iconv) = 1
                daddnw = denom - dadd
                dinvnw = denom
                denom  = dinv + dadd - daddnw
                dadd   = daddnw
                dinv   = dinvnw  
            end if
100     continue
        end
************************************************************************
        subroutine CMTMUL(m1, m2, m3, SZ)
C  Complex single precision matrix multiply (square matrices).
        integer SZ
        complex m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ)

C               Local variables

        complex sum
        integer i, j, k

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        do 100 k = 1, SZ
            do 90 i = 1, SZ
                sum = (0.0, 0.0)
                do 80 j = 1, SZ
                    sum = sum + m1(i,j)*m2(j,k)
80              continue
                m3(i,k) = sum
90          continue
100     continue
        end
************************************************************************
        subroutine CPUTIM(tused)
C Return elapsed user CPU time (seconds) since program began execution.
C This routine will likely need replacement on other machines.
C In particular, if the times are too big or too small for the 
C FORMAT statement, then they can be scaled by altering this routine.
        real tused
        real tarray(2)
        real etime
        external etime

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        tused = etime(tarray)
        tused = tarray(1)
        end
************************************************************************
        subroutine DMTMUL(m1, m2, m3, SZ)
C  Double precision matrix multiply (square matrices).
        integer SZ
        double precision m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ)
        double precision sum
        integer i, j, k

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        do 100 k = 1, SZ
            do 90 i = 1, SZ
                sum = 0.0D0
                do 80 j = 1, SZ
                    sum = sum + m1(i,j)*m2(j,k)
80              continue
                m3(i,k) = sum
90          continue
100     continue
        end
************************************************************************
        integer function GCD(n1, n2)
C Euclidean GCD algorithm (greatest common divisor of n1 and n2).
        integer n1, n2

C               Local variables

        integer m1, m2
        intrinsic ABS, MAX, MIN, MOD

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        m1 = MIN(ABS(n1), ABS(n2))
        m2 = MAX(ABS(n1), ABS(n2))
10      continue
        if (m1.gt.3) then
C Ensure m2 is not zero, to avoid divide by zero.
            m2 = 1 + MOD(m2-1,m1)
            m1 = MOD(m1,m2)
            go to 10
        end if
C Special codes for m1 = 0, 1, 2, 3.
        go to (100, 101, 102, 103), m1+1
        stop 'GCD - m1 out of range' 
100     continue
            GCD = m2
            return
101     continue
            GCD = 1
            return
102     continue
            GCD = 2 - MOD(m2,2)
            return 
103     continue
            if (MOD(m2,3).eq.0) then
                GCD = 3
            else
                GCD = 1
            end if
            return
        end
************************************************************************
        subroutine IMTMUL(m1, m2, m3, SZ)
C Integer matrix multiply (square matrices).
        integer SZ, m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ)
        integer i, j, k, sum

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        do 100 k = 1, SZ
            do 90 i = 1, SZ
                sum = 0
                do 80 j = 1, SZ
                    sum = sum + m1(i,j)*m2(j,k)
80              continue
                m3(i,k) = sum
90          continue
100     continue
        end
************************************************************************
        integer function MODEXP(base, exp, mdls)
C Compute base**exp mod mdls, where exp >= 0 and 0 <= base < mdls.
C Use binary algorithm, starting from the right.
        integer base, exp, mdls

C               Local variables

        integer baslft, explft
        double precision mdlsnv

C               Functions referenced

        intrinsic MOD, INT, DBLE
        integer m1, m2, MODMUL
C Statement function MODMUL computes m1*m2 mod mdls.  As written,
C it allows the operands to be in the open interval (-n, n),
C with the result guaranteed to be there.  The estimated quotient,
C INT(DBLE(m1)*DBLE(m2)*mdlsnv), should be either the true
C integer quotient m1*m2/mdls, or 1 (in absolute value) too large,
C because the double precision value DBLE(m1)*DBLE(m2)*mdlsnv will
C be an upper bound for the rational number m1*m2/mdls 
C (since mdlsnv is biased away from 0), but will be less than the
C rational numberm1*m2/mdls + 1 (if floating point is accurate enough).
C
C Users of this benchmark are allowed to replace this definition 
C by another one which performs better in their environment,
C or to change the initialization of mdlsnv (below).  
C Be aware that mdls is barely below 2**31, so m1*m2 will need 63 bits.
C If single precision floating point arithmetic is accurate to 
C 35 bits or so, then it may be used when estimating the quotient.
C If the compiler and machine support the 64-bit product of two
C 32-bit operands, and the 32-bit quotient and remainder when dividing
C 64 bits by 32 bits, then that feature may be used.
        MODMUL(m1,m2) = m1*m2 - mdls*INT(DBLE(m1)*DBLE(m2)*mdlsnv)

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

C Get estimate for 1.0/mdls, biased away from 0.
        mdlsnv = 1.0D0/DBLE(mdls)
        mdlsnv = mdlsnv*(1.0D0 + 0.1D0*mdlsnv)
        if (exp.lt.0) then
            stop 'MODEXP - exponent negative'
        end if
        
        MODEXP = 1
        explft = exp
        baslft = base
        if (explft.eq.0) return
10      continue
        if (MOD(explft, 2).ne.0) then
            MODEXP = MODMUL(MODEXP, baslft)
            explft = explft - 1
            if (explft.eq.0) then
                if (MODEXP.lt.0) MODEXP = MODEXP + mdls
                return
            end if
        end if
        explft = explft/2
        baslft = MODMUL(baslft, baslft)
        go to 10
        end 
************************************************************************
        subroutine PMTMUL(m1, m2, m3, SZ, mdls)
C Matrix multiply (square matrices, entries modulo a prime number).
        integer SZ, m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ), mdls
        integer i, j, k, mdsq, sum
        intrinsic MOD

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        mdsq = mdls**2
        do 100 k = 1, SZ
            do 90 i = 1, SZ
                sum = 0
                do 80 j = 1, SZ
                    sum = sum + m1(i,j)*m2(j,k)
                    if (sum .ge. mdsq) sum = sum - mdsq
80              continue
                m3(i,k) = MOD(sum, mdls)
90          continue
100     continue
        end
************************************************************************
        logical function PRIME(n)
C Check whether an integer n is prime.
C Test for divisibility by 2, 3, and by 6k +- 1, for k = 1, 2, ...
        integer n

C               Local variables

        integer j
        intrinsic MOD, NINT, REAL, SQRT 

C  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        if (n.lt.5) then
            PRIME = n.eq.2 .or. n.eq.3
        else if (MOD(n,6).ne.1 .and. MOD(n,6).ne.5) then
            PRIME = .FALSE.
        else
            do 100 j = 5, NINT(SQRT(REAL(n))), 6
                if (MOD(n,j).eq.0 .or. MOD(n,j+2).eq.0) then
                    PRIME = .FALSE.
                    return
                end if
100         continue
            PRIME = .TRUE.
        end if
        end
************************************************************************
        subroutine SMTMUL(m1, m2, m3, SZ)
C Single precision matrix multiply (square matrices).
        integer SZ
        real m1(SZ,SZ), m2(SZ,SZ), m3(SZ,SZ)
        real sum
        integer i, j, k

C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

        do 100 k = 1, SZ
            do 90 i = 1, SZ
                sum = 0.0
                do 80 j = 1, SZ
                    sum = sum + m1(i,j)*m2(j,k)
80              continue
                m3(i,k) = sum
90          continue
100     continue
        end
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024

mark@mips.COM (Mark G. Johnson) (01/09/90)

Email failed me; therefore ...

Here are the results of running Peter Montgomery's "program bchmark" for
Integer Multiply and Divide, on a MIPS RC3240.  This is a deskside machine
in a PC-sized cabinet (23" x 7" x 18"); essentially it's an upgrade of the
earlier M/120 machine.

I didn't change the source code; just took it as-is and applied the naive
command line "f77 -O montgomery.f".  Running the executable file a.out gave:

MIPS RC3240 compiled with f77 -O
 Gener  Dble  Sngl Cmplx Intgr Modlr TrlDiv PrbPrm   GCD BinDec Cfrac  Geo mean
  0.08  0.26  0.23  0.34  0.39  0.46   2.84   0.10  0.63   0.60  0.46  0.36 sec

{This falls between the single-processor Alliant with vectorization and
optimization, and the 6-processor Alliant with V&O, in Geo mean performance.}

For those who desire to bicker about the benchmark's representative-ness
of "real code", or potential lack thereof, here is a teeny bit of profiling
data.  A single line of source, marked below with >>>>, is responsible for
over 1/4 of all cycles consumed by the program (dynamically, on RC3240):


            do 100 j = 5, NINT(SQRT(REAL(n))), 6                      [575]
      >>>>      if (MOD(n,j).eq.0 .or. MOD(n,j+2).eq.0) then          [576]
                    PRIME = .FALSE.                                   [577]
                    return                                            [578]
                end if                                                [579]
100         continue                                                  [580]


Here's the profiler output {for an RC3240, compiled with f77 -O):

    procedure (file)                           line bytes     cycles      %
    -----------------------------------------------------------------------
    prime_ (montgomery.f)                       576   108   15677188  25.67
    ... etc ...

Apologies in advance if I did the measurement incorrectly.,
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

tihor@acf4.NYU.EDU (Stephen Tihor) (01/10/90)

Curiously I ran this on a VMS VAX under VAX FORTRAN with full error checking enabled and optimizations disabled and got a reasonable integer overflow in the 
inner loop of INTMUL multiplying 

IMTMUL\I:       1
IMTMUL\J:       1
IMTMUL\K:       1
IMTMUL\M1(1,1): 235517118
IMTMUL\M2(1,1): 891310209

Those do look large for a 32 integer product.  Did I mess something up here?
   

tihor@acf4.NYU.EDU (Stephen Tihor) (01/10/90)

Curious to try certain VAX and DECsystem copiler options I ran this on a VMS 
VAX under VAX FORTRAN with full error checking enabled and optimizations 
disabled and got a reasonable integer overflow in the inner loop of INTMUL 
multiplying 

IMTMUL\I:       1
IMTMUL\J:       1
IMTMUL\K:       1
IMTMUL\M1(1,1): 235517118
IMTMUL\M2(1,1): 891310209

Those do look large for a 32 integer product.  Did I mess something up here?