[comp.arch] 1000000x1000000 Matrix

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/23/89)

In article <46500082@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>I heartily agree with that: I have two projects that I tried on 
>a big supercomputer: one diagonalized zillions of 20x20 matrices,
>the other wants to diagonalize a single 1000000x1000000 matrix.
>Neither was suitable for a Cray!

Perhaps you could use a different algorithm to diagonalize
many 20x20 matrices.  It sounds like the kind of thing that
could be re-written with significant improvement.  
Like I was going to say in response to Eugene,  you  cant get
behind the wheels of a Ferarri, go to the school zone, and 
complain that you cant go faster than 25MPH!
If you have lots of scalar code (or code that doesnt run very fast
on a super), and if it can't be rewritten, all you are doing
is announcing *your* departure from supercomputers.  There will
always be applications that need the combination of speed, memory size,
memory bandwidth, IO bandwidth, disk size, the number
of users supported, etc. that only supers provide.  Yes, as advances
are made, a super will be just a smart way of integrating off-the-shelf
components rather than desiging the whole machine from scratch 
(although, as someone else pinted out, there is something to be said 
about total re-designs).
And dont assume for a second that those who make supers have not
noticed this!

But that's not really the point of this posting!!   My real question is:

1) Is your 1000000x1000000 matrix dense?  If not, how sparse is it?
2) How did you solve it?
3) On what machine did you solve it?
4) What is the application?  (Its not a complex matrix, is it?!)
 
Finally, I dont see why *any* micro would be better at this one
than a super?!

mcdonald@uxe.cso.uiuc.edu (10/23/89)

In article <46500082@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>I heartily agree with that: I have two projects that I tried on 
>a big supercomputer: one diagonalized zillions of 20x20 matrices,
>the other wants to diagonalize a single 1000000x1000000 matrix.
>Neither was suitable for a Cray!
Somebody replies:

>Perhaps you could use a different algorithm to diagonalize
>many 20x20 matrices.  It sounds like the kind of thing that
>could be re-written with significant improvement.  

The Cray does this just fine. It is just that cheaper computers are more
cost effective (even with the Cray doing the vectorizable part
efficiently) - but, at the present time, the more important reason
snot to use the Cray for things like this is accessibility ,
which is better for lesser machines.

>But that's not really the point of this posting!!   My real question is:

>1) Is your 1000000x1000000 matrix dense?  If not, how sparse is it?
It is dense. Totally dense. No element is (in principle) zero.

>2) How did you solve it?
Are you kidding? All I get is uproarious laughter. But, now
and then, it begins to look like hope in the next few years!

On the other  hand, we have a theoretician here who works on
equally large, but very sparse (I don't know how "very") matrices. These
he does just fine. I don't really know how, as he is rather secretive,
but he does say that the methods are well known ones for sparse
matrices.

>3) On what machine did you solve it?
The guy with the sparse one does fine on Vaxes, Crays, and Mipss.

>4) What is the application?  (Its not a complex matrix, is it?!)
Mine is quantum mechanics of vibrations. No, it is real symmetric.
But I need at least some of the eigenvectors, though the eigenvalues
would be good enough to start with. Actually, generating the elements
is a worse problem than the diagonalization, as it requires
doing as many multidimensional integrals as there are independent
matrix elements. I have done it up to 1000x1000, but the results
are too coarse grained to be comparable to experiments.

I used to do the same problem in classical mechanics. Now THERE
is a good problem for parallel computers: doing many systems
of differential equations in parallel - you get as many sets of data
as one wants, all executing the same instructions at the same time.
In other words, one would have say 100 or more different initial
conditions, all running the same set of 20-50 simultaneous 
first order linear equations. This was done way back when, 1973-1975,
on the Illiac IV, which suited it perfectly. Only problem
was, the results only proved that classical mechanics won't
give correct results.

Doug McDonald 

brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)

In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>Like I was going to say in response to Eugene,  you  cant get
>behind the wheels of a Ferarri, go to the school zone, and 
>complain that you cant go faster than 25MPH!
I have never complained about not being able to go faster than 25MPH in a
school zone when driving to work in my 71 Vette.  Last time I checked, my
Vette dusted them thar stiking Italian cars from a stop light.

>If you have lots of scalar code (or code that doesnt run very fast
>on a super), and if it can't be rewritten, all you are doing
>is announcing *your* departure from supercomputers.
The series of supercomputers made by Cray was for some time the fastest you
could get for either scalar or vector coding.  LLNL in fact has preferred them
for their superior scalar and short vector performance.  (These days, of course,
we are not so pure in our decision making.  We prefer Cray machines for their
software compatibility with the Cray-1.  Although we are starting to work on
the problem, we have not fully embraced the notion of portable operating
systems or even standard high level language models.  We use carbon dating
to keep track of some of our dusty decks.) The Cyber 205 could provide more
performance on long vectors, so I guess by your reasoning that you would call
it a supercomputer and accuse anyone buying a Cray machine of departing from
the real supercomputers.  I guess that world has decided to not use enough
REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205
and its children on the market.  Japanese machines blow the doors off the Cray
machines these days (scalar or vector), but CRI is not worried about this.

           Their real nightmares have KILLER MICROS in them.

JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
IT IS THE "REAL SUPERCOMPUTER".  THE BOTTOM LINE IS HOW FAST DOES A MACHINE
RUN YOUR APPLICATION.  MY APPLICATION HAPPENS TO FARE POORLY ON SIMD MACHINES
AND HAPPENS TO BE VERY EFFICIENT AND HIGHLY PARALLEL ON MIMD MACHINES WITH
PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE.  BECAUSE OF THIS I AM ACCUSED
OF DEPARTING FROM SUPERCOMPUTING.  THIS IS GIBBERISH.  IF I FIND A MIMD MACHINE
WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.

			   FOR MY APPLICATION


brooks@maddog.llnl.gov, brooks@maddog.uucp

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)

In article <36553@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>>If you have lots of scalar code (or code that doesnt run very fast
>>on a super), and if it can't be rewritten, all you are doing
>>is announcing *your* departure from supercomputers.
>The series of supercomputers made by Cray was for some time the fastest you
>could get for either scalar or vector coding.  LLNL in fact has preferred them
>for their superior scalar and short vector performance.  (These days, of course

Well, yes.  these days you dont always buy a vector machine for its scalar
speed.  The problem is really because of lack of modularity in some
vector machines, that you hhave to get the vector whether or not you like 
it.  There is a definite trend towards more modularity, where the vector 
is an option.

In fact, I see no reason why a single scalar cant support more than one vector!

>we are not so pure in our decision making.  We prefer Cray machines for their
>software compatibility with the Cray-1.  Although we are starting to work on
>the problem, we have not fully embraced the notion of portable operating

Like I said, its a good idea to
REWRITE the code.
The LLNL problem is not a problem.  If you are on the cutting edge of
things, sometimes you get cuts!

>to keep track of some of our dusty decks.) The Cyber 205 could provide more
>performance on long vectors, so I guess by your reasoning that you would call
>it a supercomputer and accuse anyone buying a Cray machine of departing from
>the real supercomputers.  I guess that world has decided to not use enough

I will give you the benefit ot the doubt and not even respond to this.
Surely you don't really think that's what I meant.
I only used the YMP because you had used the XMP as an example. 

>REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205
>and its children on the market.  Japanese machines blow the doors off the Cray
>machines these days (scalar or vector), but CRI is not worried about this.

The ETA machine was/is a great machine and absoluttely first-rate technology.
The problem of ETA was not the machine or the architecture.  It boiled down
to management even though they had a lot of good people at all levels.
It was too little too late, and a problemm of not having deep-enough
pockets and a lack of beleief, in the business community, in long-term....

>           Their real nightmares have KILLER MICROS in them.
YOUR nightmares, maybe.

>
>JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
>POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
>IT IS THE "REAL SUPERCOMPUTER".  THE BOTTOM LINE IS HOW FAST DOES A MACHINE

All I asked was to request everyone to stop using the 
single processor XMP as the definition.  Indeed, a supercomputer has
many definitions and all of them are relative to some base characteristic.
As you well know.

>PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE.  BECAUSE OF THIS I AM ACCUSED
>OF DEPARTING FROM SUPERCOMPUTING.  THIS IS GIBBERISH.  IF I FIND A MIMD MACHINE
>WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
>100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.
>			   FOR MY APPLICATION

BINGO!  For  Y O U R  application!  That's the whole point.
Funny we should start from a general red-alert about the attack of
killerr micros and end up where we all knew about:  "It depends on
the application".

Yes, if your application is "pagemaker" you redefine supercomputing
by going to a mac-ii-ci !!

Now in this case, there is something to be said about highly parallel
machines.  Indeed, there is a lot of that happenning right here.
My own experience in supercomputing started with loosely coupled systems
and scalar-lloking codes.
But it's too early for the supers to pack up and go home.  WAY too early.
Universities are usually several years ahead of the market and they
haven't got it quite right yet.

By the time micros become killers, they wont be micros anymore.

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)

In article <36553@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>>If you have lots of scalar code (or code that doesnt run very fast
>>on a super), and if it can't be rewritten, all you are doing
>>is announcing *your* departure from supercomputers.
>The series of supercomputers made by Cray was for some time the fastest you
>could get for either scalar or vector coding.  LLNL in fact has preferred them
>for their superior scalar and short vector performance.  (These days, of course

Well, yes.  these days you dont always buy a vector machine for its scalar
speed.  The problem is really because of lack of modularity in some
vector machines, that you hhave to get the vector whether or not you like 
it.  There is a definite trend towards more modularity, where the vector 
is an option.

In fact, I see no reason why a single scalar cant support more than one vector!

>we are not so pure in our decision making.  We prefer Cray machines for their
>software compatibility with the Cray-1.  Although we are starting to work on
>the problem, we have not fully embraced the notion of portable operating

Like I said, its a good idea to
REWRITE the code.
The LLNL problem is not a problem.  If you are on the cutting edge of
things, sometimes you get cuts!

>to keep track of some of our dusty decks.) The Cyber 205 could provide more
>performance on long vectors, so I guess by your reasoning that you would call
>it a supercomputer and accuse anyone buying a Cray machine of departing from
>the real supercomputers.  I guess that world has decided to not use enough

I will give you the benefit ot the doubt and not even respond to this.
Surely you don't really think that's what I meant.
I only used the YMP because you had used the XMP as an example. 

>REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205
>and its children on the market.  Japanese machines blow the doors off the Cray
>machines these days (scalar or vector), but CRI is not worried about this.

The ETA machine was/is a great machine and absoluttely first-rate technology.
The problem of ETA was not the machine or the architecture.  It boiled down
to management even though they had a lot of good people at all levels.
It was too little too late, and a problemm of not having deep-enough
pockets and a lack of beleief, in the business community, in long-term....

>           Their real nightmares have KILLER MICROS in them.
YOUR nightmares, maybe.

>
>JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING
>POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT
>IT IS THE "REAL SUPERCOMPUTER".  THE BOTTOM LINE IS HOW FAST DOES A MACHINE

All I asked was to request everyone to stop using the 
single processor XMP as the definition.  Indeed, a supercomputer has
many definitions and all of them are relative to some base characteristic.
As you well know.

>PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE.  BECAUSE OF THIS I AM ACCUSED
>OF DEPARTING FROM SUPERCOMPUTING.  THIS IS GIBBERISH.  IF I FIND A MIMD MACHINE
>WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION
>100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER.
>			   FOR MY APPLICATION

BINGO!  For  Y O U R  application!  That's the whole point.
Funny we should start from a general red-alert about the attack of
killerr micros and end up where we all knew about:  "It depends on
the application".

Yes, if your application is "pagemaker" you redefine supercomputing
by going to a mac-ii-ci !!

Now in this case, there is something to be said about highly parallel
machines.  Indeed, there is a lot of that happenning right here.
My own experience in supercomputing started with loosely coupled systems
and scalar-lloking codes.
But it's too early for the supers to pack up and go home.  WAY too early.
Universities are usually several years ahead of the market and they
haven't got it quite right yet.

By the time micros become killers, they wont be micros anymore

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)

In article <46500084@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>>1) Is your 1000000x1000000 matrix dense?  If not, how sparse is it?
>It is dense. Totally dense. No element is (in principle) zero.
 
Wow!

>>2) How did you solve it?
>Are you kidding? All I get is uproarious laughter. But, now

There are out-of-core packages that seem to do a good job.  If you can
get the disk space!

>but he does say that the methods are well known ones for sparse
>matrices.

He probably doesn't have the storage problem.  If he does statistics,
I have only seen very sparse ones in that field,, so far.
(A side comment,,  an application also with huge sparse matrices
is 'animal science'!  An obvious one, right?  I say that only to
point out that real supercomputing applications come up all the time
as people start tackling more realistic problems, etc...)

>>4) What is the application?  (Its not a complex matrix, is it?!)
>Mine is quantum mechanics of vibrations. No, it is real symmetric.
 
How come the tricks used in ab-Initio Quantum Chem cant be used?

>on the Illiac IV, which suited it perfectly. Only problem
>was, the results only proved that classical mechanics won't
>give correct results.

I find this a fascinating subject.
Here's a real supercomputing application with use and need for
everything, it seems.
 
*What* is an ideal architecture for dealing with a 1,000,000x1,000,000
dense matrix??

brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)

In article <9123@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
>>           Their real nightmares have KILLER MICROS in them.
>YOUR nightmares, maybe.
I have pleasant dreams about KILLER MICROS.  Let me give you a quote from
an engineer at CRI, I will withhold the name to protect the innocent.

"We have lost the scalar performance battle to the microprocessor...
                 We will never recover it."

Clearly, you are not worried, but the fellow from CRI is quite worried.


>All I asked was to request everyone to stop using the 
>single processor XMP as the definition.
Comparing CPU to CPU, the YMP is only 30% faster than the XMP.
Twice as many are supplied in a box.  There is no significant difference
between their performances on scalar code that won't be completely overrun
by the next wave of killer micros in 6 months.  Any crap about 8 YMP
CPUS being lots faster than one XMP CPU will only be answered by
having to deal with 1024 KILLER MICROS.

>>			   FOR MY APPLICATION
>
>BINGO!  For  Y O U R  application!  That's the whole point.
>Funny we should start from a general red-alert about the attack of
>killerr micros and end up where we all knew about:  "It depends on
>the application".
Whether or not the current KILLER MICROS actually OUTRUN current
supercomputers is application dependent.  It is not, however, a set
of measure zero and the set of such applications is growing. KILLER
MICROS are more cost effective for ALL APPLICATIONS, however, with
the supercomputers only getting close to BREAK EVEN on codes which run
both the adder and multiplier on every cycle.  Highly parallel machines
also applicable to these embarrarsingly parallel problems and the KILLER
MICRO powered multiprocessors will provide both more total horsepower
and more total main memory.

>
>Yes, if your application is "pagemaker" you redefine supercomputing
>by going to a mac-ii-ci !!
I see no need to attempt an insult, my application is not "pagemaker".
It is getting difficult to carry on a sensible discussion with such
stuff leaking in.  I will not reply to postings containing this sort
of abusive retoric in the future.



>But it's too early for the supers to pack up and go home.  WAY too early.
Supers have been sent home with their tails between their legs for
scalar code by the KILLER MICROS.  It will only be a matter 5 years
before the same thing happens for vectorized code.  We probably won't
call them micros, out of respect for the dearly departed, but micros
is what they will be.


brooks@maddog.llnl.gov, brooks@maddog.uucp

gillies@m.cs.uiuc.edu (10/25/89)

Re: KILLER MICROS

At some point I heard that MIPS pulled out just about every stopper to
speed up the floating point speed of the R2000/R3000.  In other words,
they hardwired all the FPU ops, and provided 32 or 64-bit circuitry
wherever it was needed, and used all the optimal designs (like carry
lookahead and wallace trees -- please excuse my ignorance of
arithmetic circuitry) in their arithmetic unit?

So now they just wait for device technology and heavy pipelining to
speed up their chip?  Couldn't Crays make a small comeback by
exploiting their 1's complement arithmetic, which is supposed to be an
inherently faster number system for digital implementation?

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)

Here is a posting on the 1Mx1M matrix problem that I got from 
my friend Brad Carlile at FPS.  It is a nice analysis and
I have permission to post it for him.  

As users (i.e. Customers!) move to more realistic, 3D analysis,
finer mesh sizes, more raelistic models, etc..  I think we will
start seeing more problems like this.  A super still has to do
small problems fast, but its competitiveness shows better for
these "real"istic problems!
---------

 From fpssun!@tektronix.tek.com:!bradc Fri Oct 27 16:34:25 1989
 To: kahn@tcgould.TN.CORNELL.EDU
 Subject: Re: 1000000x1000000 Matrix (was: li
 In-Reply-To: your article <9125@batcomputer.tn.cornell.edu>
 Status: R

 Solving a 1M x 1M matrix, a very good problem indeed.

 First, lets consider storage.

 Storage of a real symmetric matrix = N(N+1)/2, where N is the size of
 the matrix.  If the matrix was general than a storage of N*N is
 required.  The below table will consider the number of bytes required
 to store the (1M x 1M) matrix in 32-bit and floating-point numbers.
 We will look at both Real Symmetric (RS) and Real
 non-symmetric (General).

 Storage(in bytes)     RS        General
 -----------------  -------      -------
 32-bit             2*10^12      4*10^12
 64-bit             4*10^12      8*10^12
 64-bit             4*10^12      8*10^12
 
 At this point one can consider the number of address bits required for
 this storage.  
 
 Address bits needed     RS        General
 -------------------  -------      -------
 32-bit                  41          42
 64-bit                  42          43
 
 At this size of matrix we are left with two options.  First, we are
 outside the standard 32-bit address space of most computers.  In the
 future the industry should consider using more address bits.  Our second
 option is to solve the problem in an out-of-core manner.  This option
 requires us to store most of the matrix on disk and bring in blocks of
 data to operate on (these blocks must be small enough to fit in the
 32-bit address space, leaving room for the program, os, etc.).  Even
 though it sounds like this may be a I/O limited problem, it is not.  The
 reuse in the sub-matrix partitioning is enough to double buffer and hide
 most of the I/O to disk.
 
 Now lets turn to the computation.
 
 The problem has two phases: factor and solve.  One thing not mentioned,
 or I missed it, was the number of vectors one is solving for in this
 problem.  
 
 Flops needed            RS        General 
 -------------------  -------      -------
 Factor               (N^3)/3     (2/3)*N^3 
 Solve                2*X*N^2      2*X*N^2 
 
 Where X is the number of vectors that one is solving for.
 
 Flops needed (1M*1M)    RS        General 
 -------------------  -------      -------
 Factor               3*10^17      6*10^17
 Solve               X*2*10^12    X*2*10^12   
 
 Factoring is the major computational user.  The below table will
 estimate the time required to solve the problem on different speeds of
 machines.
 
 Machine                 RS
 ----------           --------
 100 MFlops           96.1 years
 1 GFlops             9.52 years
 10 GFlops            348 days
 100 GFlops           34.7 days
 1 TFlops             3.47 days
 
 I like problems like this they are fun.
 
 A couple of other notes.
 
 When you solve large matrices you are working with matrix multiply based
 kernels.  The adder and the multiplier operations are balanced and this
 is a compute bound problem.  It is also not memory bound like the
 100*100 LINPACK benchmark.  [Do not attempt this size of problem using 
 the LINPACK 100 code :), I've always wanted to use of those ":)"]
 The real-symmetric case does not require pivoting for numerical
 stability.  I would suggest doing this calculation in 64-bit, or
 larger, floating-point numbers. 
 
 I hope I haven't made any errors in my calculations, I am doing this
 from memory.
 
 Brad Carlile