[comp.arch] QCDPAX

aglew@oberon.csg.uiuc.edu (Andy Glew) (06/14/90)

Path: ux1.cso.uiuc.edu!brutus.cs.uiuc.edu!apple!sun-barr!ccut!titcca!etlcom!gama!oyanagi
From: oyanagi@gama.is.tsukuba.ac.jp (Yoshio Oyanagi)
Newsgroups: comp.sys.super
Subject: QCDPAX attained 12.25 GFLOPS peak speed.
Message-ID: <5074@gama.is.tsukuba.ac.jp>
Date: 31 May 90 06:41:07 GMT
Reply-To: oyanagi@gama.is.tsukuba.JUNET (Yoshio Oyanagi)
Organization: Info Sci & Elec, Univ of Tsukuba, Tsukuba-City, Ibaraki 305, JAPAN
Lines: 99


       ===QCDPAX attained 12.25 GFLOPS peak speed===

Parallel Computer QCDPAX has reached the world-fastest(probably) 
effective speed in scientific calculations.   If any computer can 
exceed the speed of QCDPAX, please let us know.

QCDPAX was made public on April 6, 1990, at University of Tsukuba, 
Tsukuba Science City, Japan.

QCDPAX is a parallel computer with 432 PU's (Processing Units).   
Each PU is running in 28.7 MFLOPS peak speed, and the system in 
about 12.38 GFLOPS at peak.  12.25 GFLOPS speed is measured for
the summation of squares of 500,000 elements within each PU.

The machine is a torus-shaped PU array (2-D Nearest Neighbor Mesh 
with end-around connections), enhanced by a global barrier (hardware) 
synchronizer, broad bandwidth(32 bits) in nearest neighbor links, 
broadcast from any PU to all PU's, and feedback ofthe logical AND 
of status registers of all PU's to all PU's.   

T. Shirakawa, et. al.  "QCDPAX - An MIMD array of vector processors 
for the numerical simulation of quantum chromodynamics", in Proceedings 
of Supercomputing '89, Nov. 13-17, 1989, at Reno, Nevada, pp. 495-504.
 
T. Hoshino "PAX Computer, High-Speed Parallel Processing and 
Scientific Computing" Addison Wesley, 1989.

Each PU is a single board vector processor, and employs 
	M68020(25MHz) as the CPU, 
	L64133(60ns, scalar floating-point processor with ALU and MPY, 
	       LSI Logic Corp., run actually with 69.8 ns clock), 
	50K gate ASIC controller for L64133(also LSI Logic's), 
	2MB SRAM for vector data store(35ns, Japanese), 
	4MB DRAM for program and archive data store(100ns, Japanese).  

QCDPAX was designed by us in University of Tsukuba and manufactured
by Anritsu Corporation.  The project is funded by the Ministry of 
Education, Science and Culture of the Japanese Government under the
Grant-in-Aid for Specially Promoted Research (#62060001).

QCDPAX is dedicated to Quantum Chromodynamics simulation (lattice 
gauge theory) as the budgest required, though the functions are not 
restricted to that purpose.   It is right on the extension of the 
past 4 prototype PAX machines, in the sense that QCDPAX is of wide 
use in scientific applications.

The machine was benchmarked by the QCD model.   In the most time 
consuming part, 3 by 3 unitary matrix product, QCDPAX with 432 PU's 
recorded the speed nearly 4 times as fast as that of CM-2, (CM-2's
measurement was reported in Supercomputing '89 in Reno by C. F. Baillie,
pp.2-9).  Single link update time for the subspace heat bath method
with 8 hits is 1.8 micro second, which is three times faster than
the HITAC S820/80 at KEK (peak 3 Gflops).

The benchmark persistently made in the past PAX development is the 
Poisson equation by Red-Black point-SOR method.   This is a typical
but quite communication-intensive scientific calculation.   We believe 
that the parallel computer that cannot well process this point-SOR 
is of no use in the scientific applications.

The following is the measurement in the the biggest size 
that QCDPAX can solve.
    Definition: 3-D Poisson equation in the pillar region with 	
	the size of 408(in X), 414(in Y), and 408(in Z).   Mesh 
	spaing is 1.  Periodic boundary conditions are set in X 
	and Y-directions, and Dirichlet boundary to zero in 
	the Z-direction.  Two point sources of intensity +1 and -1
	are located at (102, 103, 102) and (306, 311, 306),
	respectively.
    Measurement: Single update-sweep of all points (both red and 
	black) took 175 msec and it is equivalent to 8.99 MFLOPS/PU 
	and 3.88 GFLOPS/system.  The nearest neighbor communication 
	took 158 msec for the boundary points between a PU and its 
	4 neighboring PU's.  The efficiency defined by 
		(update)/(update + communication) 
	is 52.46%.   The overall effective speed is 2.04 GFLOPS.

The program was coded in a compiler-language "psc".   
Communication was made by calling a function coded in an 
assembler-language.   M68020's cache was disabled.
Time is measured by a hardware timer that each PU installs,
and MFLOPS value is obtained by total number of + - * divided by
the measured time.

If any computer can exceed this speed, please let us know.
We would like to know if our machine is really the world-fastest 
or not.

		T. Hoshino (hoshino@qcdpax.kz.tsukuba.ac.jp)
		Y. Iwasaki (iwasaki@quark.ph.tsukuba.ac.jp)
		Y. Oyanagi (oyanagi@gama.is.tsukuba.ac.jp)
		T. Shirakawa (shirakaw@qcdpax.kz.tsukuba.ac.jp)
		K. Kanaya (kanaya@quark.ph.tsukuba.ac.jp)
		T. Yoshie (yoshie@quark.ph.tsukuba.ac.jp)
		S. Ichii (ichii@kek.ac.jp)
		T. Kawai (kawai@kz.phys.keio.ac.jp)

		
--
Andy Glew, aglew@uiuc.edu