K538911@CZHRZU1A.BITNET (Patrik Eschle) (06/17/87)
(This letter was also sent to info-alliant, so you may get it twice.) This April we visited Apollo Frankfurt to run benchmarks on an Alliant FX/8, which will (hopefully!) be the new computer of the physics institute of the University of Zuerich (you may remember our postings some month ago about the DEC-Lobby fighting the project). The Alliant FX/8 is known in the Apollo world as DSP9000. Please send all comments to Patrik Eschle (address at the end of this letter) ---------------------------------------------------------------- Performance Test of an Alliant FX/8 =================================== Contents: 1. Introduction 2. Diagonalisation of large matrices 3. GEANT Simulation 4. Data Analysis directly from magnetic tape 5. Computation of Fractal Dimensions 6. Operating System and Documentation 7. Conclusions Acknowledgements G. Broggi, M. Doser, S. Eichenberger, P. Eschle, S. Poole, W. Reichart, U. Straumann, D. Vermeulen, S. Vogel, Physics Institute, University of Zurich April 24th, 1987 1. Introduction During the evaluation process for our institute's new main computer we received an invitation from Apollo Computer (the representative of Alliant in Europe) to test such a machine at the company's headquarters in Frankfurt. Our goals were to simulate as closely as possible the expected typical working situation and to see how far we could get in porting 4 programs from various environments to the Alliant, in two days time. The selected applications cover the whole range of computational problems of our institute. The following 4 sections describe briefly each of them and the progress we made. Eventually, all the applications ran successfully. The last section reports some impressions obtained by studying the operating system and the documentation in more detail. We were working simultaneously in 5 groups on 5 Apollo workstations running VT100 emulators, on a system configured as described in section 6.1 of this report. Dr. Butscher and Dr. Fruehwein of Apollo Computer were assisting us. We were nontheless consulting the manuals extensively. About half of us had already had some experience with UNIX, while the others were used to various other operating systems like VMS, VM/CMS, TSO etc. Comments on the UNIX implementation are found in section 6. 2. Diagonalisation of large matrices The program TETRA is used to calculate the polarization of a muon sitting at the tetrahedral site in a copper lattice immerged in a magnetic field. It was developed to investigate the "level crossing resonance". Essentially, it diagonalises a large matrix (512x512 Hermitian), i.e. it computes all the eigenvalues and the corresponding eigenvectors, in order to build the so called "W-matrix " (512x512 real symmetric), that contains all the relevant information on the time dependence of the muon polarization. It achieves this by using a package of routines called "EISPACK ". Typical problems with this progr am are the computational power and memory consumption. On the CRAY 1S in Lausanne it needed about 1 cpu minute and about 900 Kwords (8 bytes each) main memory. Further, to keep the W- matrix for subsequent investigation, another 1 MByte of secondary storage was necessary. To port this program to the Alliant FX/8 in Frankfurt was easy, as far as the runnability was concerned. After about 3 hours of work the first, non-optimized version was running and produced correct results. Included in this time is becoming acquainted with the operating system, the editor and the compiler. In getting the program running the following problems occurred: - Since the program was developed on a CRAY with a natural word size of 8 bytes, corrections in declaration of arrays and variables were necessary. The implicit declaration mechanism in FORTRAN gave some difficulties! - Another feature of the CRAY FORTRAN compiler is the treatment of subroutine/function and common names. In CRAY FORTRAN a common block and a subroutine or function may have the same name. This is not allowed in standard FORTRAN 77, nor in many expanded dialects, as for instance VAX FORTRAN. - The most serious problem was the fact that the name SYSTEM is reserved in CONCENTRIX. A common block with this name in TETRA led to abortion of the program with a "Bus Error " message. This would have meant some hours of search, if the Apollo people had not recognized the problem immediatlely. The execution of the first version of TETRA took about 1 hour of cpu time. An analysis of the running times showed that most of the time was used in the EISPACK routines. In a second phase we allowed the compiler to produce vector-, concurrent and vector-concurrent-code. The benchmarks are presented at the end of this section. It is important to note that these speed improvements were achieved without changing one line of FORTRAN source but only by switching compiler options. The results of runs with enabled and disabled options were compared to detect possible optimization errors. In a next step we introduced compiler directives into the FORTRAN source to allow the compiler to ignore apparent obstacles to optimization that it could not recognize as irrelevant. This of course needs an understanding of the FORTRAN source in order to perceive the problems of the compiler and to ascertain that ignoring the obstacle will not change the results. As a further step we even tried to clear away actual obstacles by changing the FORTRAN source. This is obviously only meaningful for the most time intensive routines. The compiler aids in this process by explaining why it is not able to optimize a loop, and what directives would cause it to ignore apparent problems..sk 4; This was successful mainly for the routines HTRID3 and HTRIB3, as far as speed is concerned. A sore point is, however, that the output files of the latter runs were not the same as the former ones on a byte by byte comparison. The results anyway agreed within the expected round-off errors. For a comparison we executed TETRA on a VAX 8650 running VMS. After removing some obstacles of bureaucratic nature (such as getting enough page file quota to link the program!) it ran successfully. The size of the working set was left at the default on this machine (this was done to avoid creating a special benchmarking environment). TETRA was executed on a NAS AS/XL V60, aswell. There it ran in about 6 min of cpu time, independant of vector or scalar mode, resumably due to large strides that cause heavy paging. The same phenomenon may cause the speed decrease in vector modes compared to non- vector modes (g -> gv and gc -> gvc ) on the Alliant . Benchmark Results: Routine Time[s] g gv gc gvc opt VAX 8650 DEFSYS 0 0 0 0 0 0 HAM 1 0 1 0 1 1 HTRID3 1008 1217 769 804 225 1343 (EISPACK) TQL2 743 237 342 88 95 847 (EISPACK) HTRIB3 1431 1521 526 542 334 2064 (EISPACK) COMPW 637 415 185 184 183 985 SAVEW 10 5 6 7 6 1 Total 3830 3395 1829 1625 844 5241 [g] : global optimisation [v] : vector-code [c] : concurrent-code [opt] : hand optimisation [VAX] : working set size: 2000 pages, program size: 18000 pages 3. GEANT Simulation GEANT is a program package which can be used to simulate electromagnetic showers. By means of it, we have implemented a simulation of the SINDRUM II (The SINDRUM II is an experiment which looks for rare decays at the medium-energy particle accelerator of the Swiss Institute for Nuclear Research (SIN)) setup. The CERNLIB and ZEBRA routines called by GEANT were locally available. The source (roughly 50000 FORTRAN lines ) was copied from tape in standard VAX format without any problems. The compilation was successful without modifying the code except for tab characters (Cntrl I) which are not accepted by the compiler. The only problem that occurred at run time was an incompatibility between the VAX and the FX/8 compiler involving incomplete argument lists in function or subroutine calls. The FX/8 compiler is not able to handle non matching argument lists whereas the VAX compiler can. The results of the test runs with various compiler switches (g = global optimization, c = concurrency) are shown in the table. The CPU time consumption on the Venus (VAX 8650) at SIN was 85 ms per event. In parallel processing the CPU time consumption decreases linearly with the number of CE's involved. GEANT runs with several compiler options: run-number remarks time/event 1 no optimization 240 ms 2 optimized compilation on one CE 140 ms 3 optimized and concurrent on one CE 140 ms 4 4 jobs parallel on 4 CE's 35 ms Conclusions and remarks: - Fortran programs written under VMS are easily portable to the FX/8. - Simulation of single particle histories can economically be processed simply using the appropriate compiler switches. - The source code debugger dbx is easy to use and is equivalent in its characteristics to the VMS debugger. 4. Data Analysis directly from magnetic tape One of the reasons for the purchase of an Alliant for the physics institute was to be able to analyse High Energy Physics (HEP) data which are written on magnetic tape in a non-standard format. The possibility of reading/writing an arbitrary number of records from/to an arbitrary file on a tape was tested. The tape used for this test was a 1600 bpi, non ANSI labeled tape consisting of 3 files of about 100 records, each record having a length of 12960 words (25920 bytes). Since it was not possible to use the Alliant Fortran tape handling routines TOPEN, TREAD, TWRITE... (the input buffer - being of type character - is limited to a record size of 2000 characters (8000 bytes)), the tape unit was treated like an ordinary file. An example of the code used is : integer*2 area(12960) open(unit=17,file='/dev/xmt00m',form='unformatted',recl=25920, recordtype='fixed') read(17) area or write(17) area With this code, it was possible to successfully read several consecutive records from any of the three files on tape and copy these records to a file on disk. Similarly, it was possible to read the records from the disk file and to write them to a file on tape. All files successfully passed a data integrity check. Tape positioning (file and record skipping) was done with C-shell commands. Conclusion/comments: - the feasability of handling non-standard tapes on the Alliant was shown. - timing tests were performed for an average computer load (10 interactive sessions, 5 batch jobs). The (real) time needed to read in one 12960 words record was of the order of 1 s. This is satisfactory for applications where the computation time is large in comparison to the I/O time. - We were not able to find any Fortran tape positioning routines. The available C-shell tape handling commands (rl,mt) however worked very well, so that the corresponding system calls must be available. 5. Computation of Fractal Dimensions The algorithm calculates the fractal dimension of an experimental set, reconstructed in an "Embedding " space, by using the "Fixed Mass " method. (See: R. Badii and G. Broggi, Physik-Institut der Universitaeat Zuerich, CAP Software Report Nr. 6, Mai 1987, and references therein.) The input data points are signed integers of 5 digits, obtained through an ADC from a fluid-dynamics experiment. Most of the data processing consists of integer arithmetics. The code was written and optimised (safeguarding readability and ease of use on different experimental systems) on a DEC uVAX II. The program was tested on a Cray 1-S, yielding a poor performance, which can be explained by the absence on that machine of an integer processor and by the presence of very short inner DO-LOOPS. These, when vectorisation is enabled, cause so much overhead that a better performance is obtained in scalar mode (see table). At this point it was already clear that the program did not lend itself to a simple adaptation to a vector machine, but that, on the contrary, it needed a complete re-design. Nevertheless, it was decided to run it on the Alliant in order to test the efficiency of integer operations, the amount of overhead created by vectorisation and the advantages and disadvantages of concurrent execution on more CEs. With reference to the table, the results can be summarized in the following way: - Vectorisation is in this case clearly disadvantageous, introducing an overhead which is exalted by concurrent operation of more CEs. - The concurrent execution of parts of the code succeeds in speeding up some of the loops, but, again due to the overhead introduced, the global result is still slightly worse that in scalar mode. (A limited modification of the program actually allowed the concurrent execution of longer parts of code, showing clearly that, when a more complete redesign is not possible, this is a way which is worthwhile following.) - A purely scalar execution on a single CE yields an execution time which is a factor three longer than that on a VAX 8650, and the same factor shorter than on a uVAX II. It should nevertheless be noticed that, in the configuration tested, three quarters of the machine's computational power were left free for the other users, while the program was being executed in this mode. In conclusion, one should expect programs which do not perform well on a Cray to exhibit the same problems whilst in vector mode on an Alliant, and vice versa. The results obtained by means of concurrent execution are more difficult to predict, and can be improved by reasonably small modifications of the code. The execution time of integer operations is not very brilliant, probably due to intrinsic aspects of the architecture, but is nevertheless comparable to that of a VAX of the 8000 series. Performance of the program: 100 reference points, 28000 data points. Company Computer Mode Ex. Time Digital Equipment uVAX II scalar 303 s Digital Equipment VAX 782 scalar 202 s Digital Equipment VAX 8650 scalar 36 s Cray Research CRAY 1-S vector 36 s Cray Research CRAY 1-S scalar 19 s Alliant Domain 4 CEs vector-concurrent 205 s Alliant Domain 4 CEs vector 139 s Alliant Domain 4 CEs concurrent 106 s Alliant Domain 4 CEs scalar 1 CE 103 s 6. Operating System and Documentation 6.1 Configuration and behavior of the machine - The configuration was: 4 CE, 3 IP, 24 MB Core Memory, 2 x 378 MB Disk (ca. 30 MB available free space), Magnetic Tape, Line Printer 5 Apollo workstations connected through Ethernet, 1 system console terminal - The typical load during the tests was 12 users (multiple logins) with 19 processes. There was no noticeable degradation of interactive performance, except for problems with the vt100 emulator on some of the workstations. - A heavy load of 13 processes, each filling an array of 8 MByte, that first filled up core memory and then the paging area on the disk, caused a slow down of the machine due to the high paging rate, but did not crash it. 6.2 The users point of view - The operating system CONCENTRIX is a port of UNIX BSD4.2. (Most of the utilities were present in the current version, exceptions are mentioned below) - The possible user interfaces (shells) are the C-shell, Bourne-shell and the VMS-shell, which has been announced for the next version of CONCENTRIX. (All of us used the C-shell without problems. Command line editing is possible, even if not as comfortable as with VMS) - Generally, those users unfamiliar with the C-shell found it fairly easy to become familiar with . Especially pipelining, input and output redirection and easy foreground-background operation were considered very useful features. - Online help is possible with the info-command or the UNIX man(ual) pages. The manual pages were not implemented, and info only had help on the emacs-editor. - The manuals we were able to access in their final form (some were only available as preliminary versions) made a good impression for what concerns contents and typesetting. Some problems remain unsolved. The binding of the manuals is poor; besides it being mechanically bad, one cannot update the manuals. Some of them had no index and we discovered some inconsistencies and errors. - For editing we used CCL-emacs. The EDT emulation was not installed. (As installed, CCL-emacs has a very annoying way of updating a VT100 screen when scrolling) - Most of the "standard " BSD 4.2 utilities were available on the tested system. Notable exceptions were: -- apropos (We hope this will be installed together with man) -- man (Mentioned above) -- quota (Status not clear, the quota command and the corresponding system-manager commands quotaon and quotacheck were present on the system, but it was not possible to check if they actually worked) - CONCENTRIX offers a source level debugger (for FORTRAN) and a timing utility lprof, that displays the consumed time for each line of source (lprof crashes with a divide by 0 error if the time consumed is less then 1 timing unit...) - Languages: FORTRAN, C and Pascal were installed. Comments on the FORTRAN compiler are found in previous sections. Pascal is very poorly documented (about 50 pages) and does not support vectorization and parallelism. (We don't understand why. It's easy for alliant to say that no one uses Pascal - no wonder with this implementation!). We did not manage to execute (i.e. force execution on one of the various computing resources available) a Pascal-program using floating point numbers on an IP (cause: Emulator trap EMT). For the C-Language there was only a preliminary version of the manual available. A test program could not find the math lib). Finally a program was able to fork into several children that ran in parallel on several CEs. We tried to call FORTRAN subroutines from within Pascal and C. The linker could not find the FORTRAN libraries (and vice versa did not find the Pascal library when calling Pascal from FORTRAN). The Pascal manual neither describes the calling sequences for calls to FORTRAN, nor are there any Pascal-libraries acting as interfaces to FORTRAN-libraries. - A batch processing system has been announced for Version 3.0. (A batch system in UNIX is not as important as in VMS, since UNIX allows to run jobs in the background, schedule them to run at any time and change priorities interactively) - We did not find any documentation on error messages. 6.3 The system administrators point of view We were not able to try any of the system-administrator commands on-line, nor did we boot the machine (it refused to crash). We studied the preliminary version of the system- administrator manual and the documentation from an introductory course for system-administrators. - No special tools are available for user-administration. Insertion and deletion of users etc. is done in the standard UNIX-way: editing the password file. - The mon command allows a detailed monitoring of the load of the system. - The system can be tailored at boot time to meet the specific needs of an installation (e.g. size of the computational complex). The manuals describe this in detail. 7.Conclusions We conclude that the Alliant FX/8 is very well suited to cover both the general needs of a small physics institute and the requirements of advanced applications, ranging from numerical analysis to high energy physics. It gives very convenient tools for program development in FORTRAN, while other languages are not particularly well supported. Several mathematical libraries and a limited but growing number of applications are available. In comparison to more traditional solutions, it has a much lower cost to computational power ratio. The interesting architecture which combines parallel computing and vector features introduces new conceptions of data processing into the university and scientific environments. Acknowledgements ================ We would like to thank the staff members of Apollo Computer, who kindly assisted us and explained some interesting details of the implementation. ------------- Alliant, Concentrix are Trademarks of Alliant Computer Systems Corporation Apollo is a Trademark of Apollo Computer Inc. UNIX is a Trademark of Bell Laboratories CCA EMACS is a Trademark of Computer Corporation of America VAX, VMS, VT100 are Trademarks of Digital Equipment Corporation VM/CMS, TSO, MVS are Trademarks of IBM --------------------------------------------------------------- Patrik Eschle E-Mail : K538911@CZHRZU1A.BITNET Private : Kronwiesenstr. 82, CH-8051 Zuerich (Switzerland) Phone : 1-40 72 39 Institute : Physikinstitut der Universitaet Zuerich Schoenberggasse 9, CH-8001 Zuerich Phone : 1-257 29 44 ---------------------------------------------------------------