[comp.sys.apollo] DSP9000 Benchmarks

K538911@CZHRZU1A.BITNET (Patrik Eschle) (06/17/87)
(This letter was also sent to info-alliant, so you may get it twice.)

This April we visited Apollo Frankfurt to run benchmarks on an
Alliant FX/8, which will (hopefully!) be the new computer of the
physics institute of the University of Zuerich (you may remember
our postings some month ago about the DEC-Lobby fighting the
project). The Alliant FX/8 is known in the Apollo world as DSP9000.

Please send all comments to Patrik Eschle (address at the end of
this letter)

----------------------------------------------------------------

 Performance Test of an Alliant FX/8
 ===================================

 Contents: 1. Introduction
           2. Diagonalisation of large matrices
           3. GEANT Simulation
           4. Data Analysis directly from magnetic tape
           5. Computation of Fractal Dimensions
           6. Operating System and Documentation
           7. Conclusions
           Acknowledgements



 G. Broggi, M. Doser, S. Eichenberger, P.  Eschle, S.  Poole,
 W. Reichart, U. Straumann, D.  Vermeulen, S.  Vogel, Physics
 Institute, University of Zurich

 April 24th, 1987

 1. Introduction

 During the evaluation  process for our institute's  new main
 computer we received an    invitation from Apollo  Computer
 (the representative of  Alliant in Europe)  to  test such a
 machine at the company's headquarters in Frankfurt.

 Our  goals  were to  simulate  as  closely as  possible  the
 expected typical  working situation  and to  see how  far we
 could get in porting 4 programs from various environments to
 the Alliant,  in  two days time.  The  selected applications
 cover  the whole  range  of  computational problems  of  our
 institute.

 The following 4  sections describe briefly each  of them and
 the progress we made.  Eventually,  all the applications ran
 successfully.

 The  last  section  reports  some  impressions  obtained  by
 studying the operating system and  the documentation in more
 detail.

 We  were working  simultaneously  in 5  groups  on 5  Apollo
 workstations running VT100 emulators, on a system configured
 as described in section 6.1  of this report.   Dr.  Butscher
 and Dr.  Fruehwein of Apollo Computer were assisting us.  We
 were nontheless consulting the manuals extensively.

 About half of us had already  had some experience with UNIX,
 while  the  others  were used  to  various  other  operating
 systems like VMS,  VM/CMS,  TSO  etc.   Comments on the UNIX
 implementation are found in section 6.



 2. Diagonalisation of large matrices

 The program TETRA is used to calculate the polarization of a
 muon sitting  at the  tetrahedral site  in a  copper lattice
 immerged  in  a  magnetic  field.     It  was  developed  to
 investigate the  "level crossing resonance". Essentially, it
 diagonalises a  large matrix (512x512 Hermitian),   i.e.  it
 computes   all  the   eigenvalues   and  the   corresponding
 eigenvectors,  in order to build  the so called  "W-matrix "
 (512x512 real  symmetric),  that  contains all  the relevant
 information on the time dependence of the muon polarization.
 It  achieves this  by  using a  package  of routines  called
 "EISPACK ".   Typical problems  with this  progr am  are the
 computational power and memory consumption.   On the CRAY 1S
 in  Lausanne it  needed about  1  cpu minute  and about  900
 Kwords (8 bytes each) main memory.  Further,  to keep the W-
 matrix  for subsequent  investigation,  another  1 MByte  of
 secondary storage was necessary.

 To port  this program to the  Alliant FX/8 in  Frankfurt was
 easy, as far as the runnability was concerned. After about 3
 hours of work the first,   non-optimized version was running
 and  produced correct  results.  Included  in  this time  is
 becoming acquainted  with the operating system,   the editor
 and  the  compiler.   In getting  the  program  running  the
 following problems occurred:

 - Since the  program was developed on a CRAY  with a natural
 word size of 8 bytes,   corrections in declaration of arrays
 and  variables  were necessary.   The  implicit  declaration
 mechanism in FORTRAN gave some difficulties!

 -  Another  feature of  the  CRAY  FORTRAN compiler  is  the
 treatment of subroutine/function  and common names.  In CRAY
 FORTRAN a common block and a subroutine or function may have
 the same name.  This is not  allowed in standard FORTRAN 77,
 nor in many expanded dialects, as for instance VAX FORTRAN.

 - The most serious problem was the fact that the name SYSTEM
 is reserved in CONCENTRIX.  A common block with this name in
 TETRA led to abortion
  of the  program with a  "Bus  Error " message.   This would
 have meant some  hours of search,  if the  Apollo people had
 not recognized the problem immediatlely.

 The execution  of the  first version of  TETRA took  about 1
 hour of cpu  time.  An analysis of the  running times showed
 that most of the time was used in the EISPACK routines. In a
 second phase  we allowed  the compiler  to produce  vector-,
 concurrent        and vector-concurrent-code. The benchmarks
 are presented at the end of this section.

 It is important  to note that these  speed improvements were
 achieved without  changing one  line of  FORTRAN source  but
 only by  switching compiler options.    The results  of runs
 with enabled  and disabled options  were compared  to detect
 possible optimization errors.

 In a  next step we  introduced compiler directives  into the
 FORTRAN  source to  allow the  compiler  to ignore  apparent
 obstacles to  optimization that  it could  not recognize  as
 irrelevant.  This  of course needs  an understanding  of the
 FORTRAN  source in  order to  perceive the  problems of  the
 compiler and  to ascertain that  ignoring the  obstacle will
 not change the results.  As a  further step we even tried to
 clear away actual obstacles by  changing the FORTRAN source.
 This  is  obviously  only  meaningful   for  the  most  time
 intensive routines.   The  compiler aids in this  process by
 explaining why it is not able  to optimize a loop,  and what
 directives would cause it to ignore apparent problems..sk 4;
 This  was successful  mainly  for  the routines  HTRID3  and
 HTRIB3,  as  far as  speed is concerned.   A sore  point is,
 however,  that the output files of  the latter runs were not
 the same  as the former ones  on a byte by  byte comparison.
 The  results anyway  agreed  within  the expected  round-off
 errors.

 For a  comparison we  executed TETRA on  a VAX  8650 running
 VMS.  After  removing some obstacles of  bureaucratic nature
 (such  as  getting  enough  page  file  quota  to  link  the
 program!)  it ran successfully.  The size of the working set
 was left  at the default on  this machine (this was  done to
 avoid creating  a special benchmarking  environment).  TETRA
 was executed on a NAS AS/XL  V60,  aswell.   There it ran in
 about 6  min of cpu time,   independant of vector  or scalar
 mode,   resumably due  to  large  strides that  cause  heavy
 paging. The same phenomenon may
 cause the speed  decrease in vector  modes compared  to non-
 vector modes (g -> gv and gc -> gvc ) on the Alliant .

 Benchmark Results:

 Routine     Time[s]
             g      gv    gc   gvc  opt   VAX 8650

 DEFSYS      0      0     0     0    0       0
 HAM         1      0     1     0    1       1
 HTRID3   1008   1217   769   804  225    1343   (EISPACK)
 TQL2      743    237   342    88   95     847   (EISPACK)
 HTRIB3   1431   1521   526   542  334    2064   (EISPACK)
 COMPW     637    415   185   184  183     985
 SAVEW      10      5     6     7    6       1

 Total    3830   3395  1829   1625  844   5241

  [g]   : global optimisation
  [v]   : vector-code
  [c]   : concurrent-code
  [opt] : hand optimisation
  [VAX] : working set size: 2000 pages, program size: 18000 pages



 3. GEANT Simulation

 GEANT is  a program  package which can  be used  to simulate
 electromagnetic showers. By means of it, we have implemented
 a  simulation  of the  SINDRUM  II  (The  SINDRUM II  is  an
 experiment which looks for rare  decays at the medium-energy
 particle  accelerator of  the  Swiss  Institute for  Nuclear
 Research (SIN)) setup. The CERNLIB and ZEBRA routines called
 by GEANT were locally available.   The source (roughly 50000
 FORTRAN lines )  was copied from tape in standard VAX format
 without any problems. The compilation was successful without
 modifying the code except for tab characters (Cntrl I) which
 are not  accepted by  the compiler.   The only  problem that
 occurred at run time was  an incompatibility between the VAX
 and the FX/8 compiler involving incomplete argument lists in
 function or subroutine calls.  The FX/8 compiler is not able
 to  handle  non  matching argument  lists  whereas  the  VAX
 compiler can.    The results of  the test runs  with various
 compiler switches (g = global optimization, c = concurrency)
 are  shown  in  the   table.   The  CPU  time
 consumption on  the Venus (VAX 8650)   at SIN was 85  ms per
 event.   In parallel  processing  the  CPU time  consumption
 decreases linearly with the number of CE's involved.

  GEANT runs with several compiler options:

  run-number  remarks                        time/event
     1     no optimization                       240 ms
     2     optimized compilation on one CE       140 ms
     3     optimized and concurrent  on one CE   140 ms
     4     4 jobs parallel on 4 CE's              35 ms

 Conclusions and remarks:

 - Fortran programs written under  VMS are easily portable to
 the FX/8.

 - Simulation  of single particle histories  can economically
 be processed simply using the appropriate compiler switches.

 -  The source  code  debugger  dbx is  easy  to  use and  is
 equivalent in its characteristics to the VMS debugger.



 4. Data Analysis directly from magnetic tape

 One of  the reasons for the  purchase of an Alliant  for the
 physics  institute was  to be  able to  analyse High  Energy
 Physics (HEP)  data which are written  on magnetic tape in a
 non-standard format.  The possibility  of reading/writing an
 arbitrary number of  records from/to an arbitrary  file on a
 tape was  tested.   The tape used  for this test was  a 1600
 bpi,  non ANSI  labeled tape consisting of 3  files of about
 100 records,   each record  having a  length of  12960 words
 (25920 bytes).

 Since it  was not possible to  use the Alliant  Fortran tape
 handling routines TOPEN, TREAD, TWRITE...  (the input buffer
       -  being of type  character -  is limited to  a record
 size of  2000 characters (8000  bytes)),  the tape  unit was
 treated like an ordinary file.  An  example of the code used
 is :

   integer*2 area(12960)
   open(unit=17,file='/dev/xmt00m',form='unformatted',recl=25920,
        recordtype='fixed')
   read(17)  area
      or
   write(17) area

 With this code, it was possible to successfully read several
 consecutive records from any of the  three files on tape and
 copy these  records to a file  on disk.  Similarly,   it was
 possible to read the records from the disk file and to write
 them to a file on tape. All files successfully passed a data
 integrity check. Tape positioning (file and record skipping)
 was done with C-shell commands.

 Conclusion/comments:

 -  the feasability  of handling  non-standard  tapes on  the
 Alliant was shown.

 - timing tests  were performed for an  average computer load
 (10 interactive sessions,  5 batch  jobs).  The (real)  time
 needed to read in one 12960 words record was of the order of
 1  s.   This  is satisfactory  for  applications  where  the
 computation time is large in comparison to the I/O time.

 - We  were not  able to  find any  Fortran tape  positioning
 routines.   The  available C-shell  tape  handling  commands
 (rl,mt) however worked very well,  so that the corresponding
 system calls must be available.



 5. Computation of Fractal Dimensions

 The  algorithm  calculates  the   fractal  dimension  of  an
 experimental set,  reconstructed in  an  "Embedding " space,
 by using the  "Fixed Mass " method.  (See:  R.  Badii and G.
 Broggi,  Physik-Institut der Universitaeat  Zuerich,
 CAP  Software Report  Nr.   6,   Mai 1987,   and  references
 therein.)   The input  data points are signed  integers of 5
 digits,   obtained  through  an ADC  from  a  fluid-dynamics
 experiment.  Most of the data processing consists of integer
 arithmetics.

 The code was written and optimised (safeguarding readability
 and ease of use on different experimental systems)  on a DEC
 uVAX II.

 The  program was  tested on  a  Cray 1-S,   yielding a  poor
 performance,  which can be explained  by the absence on that
 machine of an integer processor and  by the presence of very
 short inner DO-LOOPS.  These, when vectorisation is enabled,
 cause so much overhead that a better performance is obtained
 in scalar mode (see table).

 At this point it was already  clear that the program did not
 lend itself to a simple adaptation to a vector machine,  but
 that,  on  the contrary,   it needed  a complete  re-design.
 Nevertheless,  it  was decided to run  it on the  Alliant in
 order to  test the  efficiency of  integer operations,   the
 amount  of   overhead  created  by  vectorisation   and  the
 advantages and disadvantages of concurrent execution on more
 CEs.
 With reference to the table,   the results can be summarized
 in the following way:

 -  Vectorisation is  in this  case clearly  disadvantageous,
 introducing  an  overhead  which is  exalted  by  concurrent
 operation of more CEs.

 - The concurrent execution of parts  of the code succeeds in
 speeding  up some  of the  loops,   but,  again  due to  the
 overhead introduced,   the global  result is  still slightly
 worse that in scalar mode.    (A limited modification of the
 program actually allowed the  concurrent execution of longer
 parts of code,  showing clearly  that,  when a more complete
 redesign is not possible,  this is a way which is worthwhile
 following.)

 -  A  purely scalar  execution  on  a  single CE  yields  an
 execution time which is a factor three longer than that on a
 VAX 8650, and the same factor shorter than on a uVAX II.  It
 should nevertheless  be noticed that,  in  the configuration
 tested, three quarters of the  machine's computational  power
 were left free  for the other users,  while  the program was
 being  executed in this mode.

 In  conclusion,  one  should expect  programs  which do  not
 perform well on  a Cray to exhibit the  same problems whilst
 in vector mode on an Alliant,   and vice versa.  The results
 obtained by means of concurrent execution are more difficult
 to  predict,   and  can  be  improved  by  reasonably  small
 modifications of  the code.  The  execution time  of integer
 operations is not very brilliant,  probably due to intrinsic
 aspects of the architecture,  but is nevertheless comparable
 to that of a VAX of the 8000 series.

 Performance of the program: 100 reference points,
                             28000 data points.

    Company          Computer       Mode       Ex. Time

 Digital Equipment   uVAX II        scalar       303 s
 Digital Equipment   VAX 782        scalar       202 s
 Digital Equipment   VAX 8650       scalar        36 s
 Cray Research       CRAY 1-S       vector        36 s
 Cray Research       CRAY 1-S       scalar        19 s
 Alliant   Domain 4 CEs     vector-concurrent    205 s
 Alliant   Domain 4 CEs     vector               139 s
 Alliant   Domain 4 CEs     concurrent           106 s
 Alliant   Domain 4 CEs     scalar 1 CE          103 s



 6. Operating System and Documentation

 6.1 Configuration and behavior of the machine


 - The configuration was: 4 CE, 3 IP, 24 MB Core Memory,
                          2  x  378  MB   Disk  (ca.   30  MB
                          available free space),
                          Magnetic Tape, Line Printer
                          5  Apollo   workstations  connected
                          through Ethernet,
                          1 system console terminal

 - The typical  load during the tests was  12 users (multiple
 logins)   with  19  processes.    There  was  no  noticeable
 degradation of interactive performance,  except for problems
 with the vt100 emulator on some of the workstations.

 - A heavy load of 13 processes,   each filling an array of 8
 MByte,  that first filled up core memory and then the paging
 area on the disk,  caused a slow  down of the machine due to
 the high paging rate, but did not crash it.



 6.2 The users point of view

 - The operating system CONCENTRIX is  a port of UNIX BSD4.2.
 (Most of the utilities were  present in the current version,
 exceptions are mentioned below)

 - The  possible user interfaces  (shells)  are  the C-shell,
 Bourne-shell and the VMS-shell, which has been announced for
 the next version of CONCENTRIX.  (All of us used the C-shell
 without problems.
 Command line editing is possible, even if not as comfortable
 as with VMS)

 - Generally,  those users unfamiliar  with the C-shell found
 it  fairly  easy  to  become  familiar  with  .   Especially
 pipelining,    input  and   output   redirection  and   easy
 foreground-background operation were  considered very useful
 features.

 - Online help is possible with  the info-command or the UNIX
 man(ual) pages.  The manual pages were not implemented,  and
 info only had help on the emacs-editor.

 - The  manuals we were  able to  access in their  final form
 (some were only  available as preliminary versions)   made a
 good impression for what  concerns contents and typesetting.
 Some problems remain unsolved. The binding of the manuals is
 poor;  besides it being mechanically bad,  one cannot update
 the manuals.   Some of them had  no index and  we discovered
 some inconsistencies and errors.

 - For  editing we used CCL-emacs. The EDT  emulation was not
 installed.  (As installed, CCL-emacs has a very annoying way
 of updating a VT100 screen when scrolling)

 - Most of the  "standard "  BSD 4.2 utilities were available
 on the tested system. Notable exceptions were:

    -- apropos  (We hope  this  will be  installed
       together with man)
    -- man (Mentioned above)
    -- quota (Status  not clear,  the  quota command and the
       corresponding system-manager commands   quotaon and
       quotacheck  were present  on  the system,   but  it was
       not possible to check if they actually worked)


 - CONCENTRIX  offers a source  level debugger  (for FORTRAN)
 and a timing utility lprof,  that displays the consumed time
 for each line of source (lprof crashes with  a divide by 0
 error if the time consumed is less then 1 timing unit...)

 - Languages: FORTRAN, C and Pascal were installed.  Comments
 on the FORTRAN compiler are found in previous sections.
 Pascal is very poorly documented (about 50 pages) and does
 not  support  vectorization  and   parallelism.   (We  don't
 understand why.  It's easy for alliant to say that no one uses
 Pascal -  no wonder with  this implementation!).  We  did not
 manage to execute (i.e. force execution  on  one  of  the
 various  computing  resources available)  a Pascal-program using
 floating point numbers on an IP (cause: Emulator trap EMT).

   For the C-Language there was only a preliminary version of
 the manual available. A test program could not find the math
 lib). Finally a program was able to fork into several children
 that ran in parallel on several CEs.

   We tried  to call FORTRAN  subroutines from  within Pascal
 and C. The linker could not find  the FORTRAN libraries (and
 vice versa did not  find  the  Pascal  library  when  calling
 Pascal  from FORTRAN). The Pascal manual neither  describes the
 calling sequences  for  calls  to FORTRAN,   nor  are  there
 any  Pascal-libraries  acting  as interfaces to FORTRAN-libraries.

 - A batch  processing system has been  announced for Version
 3.0.  (A batch system in UNIX is not as important as in VMS,
 since UNIX allows to run jobs in the background, schedule them
 to run at any time and change priorities interactively)

 - We did not find any documentation on error messages.



 6.3 The system administrators point of view

 We  were not  able to  try any  of the  system-administrator
 commands on-line, nor did we boot the machine (it refused to
 crash). We studied  the preliminary  version of  the system-
 administrator   manual  and   the   documentation  from   an
 introductory course for system-administrators.

 - No  special tools  are available  for user-administration.
 Insertion and deletion of users etc. is done in the standard
 UNIX-way: editing the password file.

 - The mon  command allows a detailed monitoring  of the load
 of the system.

 -  The system  can  be tailored  at boot  time  to meet  the
 specific  needs  of  an  installation  (e.g.   size  of  the
 computational complex). The manuals describe this in detail.




 7.Conclusions

 We conclude  that the  Alliant FX/8 is  very well  suited to
 cover both  the general needs  of a small  physics institute
 and the requirements of advanced applications,  ranging from
 numerical analysis  to high energy  physics.  It  gives very
 convenient tools for program  development in FORTRAN,  while
 other languages are not particularly well supported. Several
 mathematical libraries and  a limited but growing  number of
 applications  are   available.    In   comparison  to   more
 traditional  solutions,   it  has  a   much  lower  cost  to
 computational  power  ratio.  The  interesting  architecture
 which  combines  parallel  computing   and  vector  features
 introduces  new  conceptions  of data  processing  into  the
 university and scientific environments.


 Acknowledgements
 ================
 We would like to thank the staff members of Apollo Computer,
 who  kindly  assisted  us  and  explained  some  interesting
 details of the implementation.


                                 -------------

Alliant, Concentrix are Trademarks of Alliant Computer Systems
                                      Corporation
Apollo is a Trademark of Apollo Computer Inc.
UNIX is a Trademark of Bell Laboratories
CCA EMACS is a Trademark of Computer Corporation of America
VAX, VMS, VT100 are Trademarks of Digital Equipment Corporation
VM/CMS, TSO, MVS are Trademarks of IBM


---------------------------------------------------------------
 Patrik Eschle
 E-Mail    :  K538911@CZHRZU1A.BITNET
 Private   :  Kronwiesenstr. 82, CH-8051 Zuerich (Switzerland)
              Phone : 1-40 72 39
 Institute :  Physikinstitut der Universitaet Zuerich
              Schoenberggasse 9, CH-8001 Zuerich
              Phone : 1-257 29 44
---------------------------------------------------------------