kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/23/89)
In article <46500082@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes: >I heartily agree with that: I have two projects that I tried on >a big supercomputer: one diagonalized zillions of 20x20 matrices, >the other wants to diagonalize a single 1000000x1000000 matrix. >Neither was suitable for a Cray! Perhaps you could use a different algorithm to diagonalize many 20x20 matrices. It sounds like the kind of thing that could be re-written with significant improvement. Like I was going to say in response to Eugene, you cant get behind the wheels of a Ferarri, go to the school zone, and complain that you cant go faster than 25MPH! If you have lots of scalar code (or code that doesnt run very fast on a super), and if it can't be rewritten, all you are doing is announcing *your* departure from supercomputers. There will always be applications that need the combination of speed, memory size, memory bandwidth, IO bandwidth, disk size, the number of users supported, etc. that only supers provide. Yes, as advances are made, a super will be just a smart way of integrating off-the-shelf components rather than desiging the whole machine from scratch (although, as someone else pinted out, there is something to be said about total re-designs). And dont assume for a second that those who make supers have not noticed this! But that's not really the point of this posting!! My real question is: 1) Is your 1000000x1000000 matrix dense? If not, how sparse is it? 2) How did you solve it? 3) On what machine did you solve it? 4) What is the application? (Its not a complex matrix, is it?!) Finally, I dont see why *any* micro would be better at this one than a super?!
mcdonald@uxe.cso.uiuc.edu (10/23/89)
In article <46500082@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes: >I heartily agree with that: I have two projects that I tried on >a big supercomputer: one diagonalized zillions of 20x20 matrices, >the other wants to diagonalize a single 1000000x1000000 matrix. >Neither was suitable for a Cray! Somebody replies: >Perhaps you could use a different algorithm to diagonalize >many 20x20 matrices. It sounds like the kind of thing that >could be re-written with significant improvement. The Cray does this just fine. It is just that cheaper computers are more cost effective (even with the Cray doing the vectorizable part efficiently) - but, at the present time, the more important reason snot to use the Cray for things like this is accessibility , which is better for lesser machines. >But that's not really the point of this posting!! My real question is: >1) Is your 1000000x1000000 matrix dense? If not, how sparse is it? It is dense. Totally dense. No element is (in principle) zero. >2) How did you solve it? Are you kidding? All I get is uproarious laughter. But, now and then, it begins to look like hope in the next few years! On the other hand, we have a theoretician here who works on equally large, but very sparse (I don't know how "very") matrices. These he does just fine. I don't really know how, as he is rather secretive, but he does say that the methods are well known ones for sparse matrices. >3) On what machine did you solve it? The guy with the sparse one does fine on Vaxes, Crays, and Mipss. >4) What is the application? (Its not a complex matrix, is it?!) Mine is quantum mechanics of vibrations. No, it is real symmetric. But I need at least some of the eigenvectors, though the eigenvalues would be good enough to start with. Actually, generating the elements is a worse problem than the diagonalization, as it requires doing as many multidimensional integrals as there are independent matrix elements. I have done it up to 1000x1000, but the results are too coarse grained to be comparable to experiments. I used to do the same problem in classical mechanics. Now THERE is a good problem for parallel computers: doing many systems of differential equations in parallel - you get as many sets of data as one wants, all executing the same instructions at the same time. In other words, one would have say 100 or more different initial conditions, all running the same set of 20-50 simultaneous first order linear equations. This was done way back when, 1973-1975, on the Illiac IV, which suited it perfectly. Only problem was, the results only proved that classical mechanics won't give correct results. Doug McDonald
brooks@vette.llnl.gov (Eugene Brooks) (10/23/89)
In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >Like I was going to say in response to Eugene, you cant get >behind the wheels of a Ferarri, go to the school zone, and >complain that you cant go faster than 25MPH! I have never complained about not being able to go faster than 25MPH in a school zone when driving to work in my 71 Vette. Last time I checked, my Vette dusted them thar stiking Italian cars from a stop light. >If you have lots of scalar code (or code that doesnt run very fast >on a super), and if it can't be rewritten, all you are doing >is announcing *your* departure from supercomputers. The series of supercomputers made by Cray was for some time the fastest you could get for either scalar or vector coding. LLNL in fact has preferred them for their superior scalar and short vector performance. (These days, of course, we are not so pure in our decision making. We prefer Cray machines for their software compatibility with the Cray-1. Although we are starting to work on the problem, we have not fully embraced the notion of portable operating systems or even standard high level language models. We use carbon dating to keep track of some of our dusty decks.) The Cyber 205 could provide more performance on long vectors, so I guess by your reasoning that you would call it a supercomputer and accuse anyone buying a Cray machine of departing from the real supercomputers. I guess that world has decided to not use enough REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205 and its children on the market. Japanese machines blow the doors off the Cray machines these days (scalar or vector), but CRI is not worried about this. Their real nightmares have KILLER MICROS in them. JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE RUN YOUR APPLICATION. MY APPLICATION HAPPENS TO FARE POORLY ON SIMD MACHINES AND HAPPENS TO BE VERY EFFICIENT AND HIGHLY PARALLEL ON MIMD MACHINES WITH PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION 100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER. FOR MY APPLICATION brooks@maddog.llnl.gov, brooks@maddog.uucp
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)
In article <36553@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >>If you have lots of scalar code (or code that doesnt run very fast >>on a super), and if it can't be rewritten, all you are doing >>is announcing *your* departure from supercomputers. >The series of supercomputers made by Cray was for some time the fastest you >could get for either scalar or vector coding. LLNL in fact has preferred them >for their superior scalar and short vector performance. (These days, of course Well, yes. these days you dont always buy a vector machine for its scalar speed. The problem is really because of lack of modularity in some vector machines, that you hhave to get the vector whether or not you like it. There is a definite trend towards more modularity, where the vector is an option. In fact, I see no reason why a single scalar cant support more than one vector! >we are not so pure in our decision making. We prefer Cray machines for their >software compatibility with the Cray-1. Although we are starting to work on >the problem, we have not fully embraced the notion of portable operating Like I said, its a good idea to REWRITE the code. The LLNL problem is not a problem. If you are on the cutting edge of things, sometimes you get cuts! >to keep track of some of our dusty decks.) The Cyber 205 could provide more >performance on long vectors, so I guess by your reasoning that you would call >it a supercomputer and accuse anyone buying a Cray machine of departing from >the real supercomputers. I guess that world has decided to not use enough I will give you the benefit ot the doubt and not even respond to this. Surely you don't really think that's what I meant. I only used the YMP because you had used the XMP as an example. >REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205 >and its children on the market. Japanese machines blow the doors off the Cray >machines these days (scalar or vector), but CRI is not worried about this. The ETA machine was/is a great machine and absoluttely first-rate technology. The problem of ETA was not the machine or the architecture. It boiled down to management even though they had a lot of good people at all levels. It was too little too late, and a problemm of not having deep-enough pockets and a lack of beleief, in the business community, in long-term.... > Their real nightmares have KILLER MICROS in them. YOUR nightmares, maybe. > >JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING >POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT >IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE All I asked was to request everyone to stop using the single processor XMP as the definition. Indeed, a supercomputer has many definitions and all of them are relative to some base characteristic. As you well know. >PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED >OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE >WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION >100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER. > FOR MY APPLICATION BINGO! For Y O U R application! That's the whole point. Funny we should start from a general red-alert about the attack of killerr micros and end up where we all knew about: "It depends on the application". Yes, if your application is "pagemaker" you redefine supercomputing by going to a mac-ii-ci !! Now in this case, there is something to be said about highly parallel machines. Indeed, there is a lot of that happenning right here. My own experience in supercomputing started with loosely coupled systems and scalar-lloking codes. But it's too early for the supers to pack up and go home. WAY too early. Universities are usually several years ahead of the market and they haven't got it quite right yet. By the time micros become killers, they wont be micros anymore.
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)
In article <36553@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <9118@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >>If you have lots of scalar code (or code that doesnt run very fast >>on a super), and if it can't be rewritten, all you are doing >>is announcing *your* departure from supercomputers. >The series of supercomputers made by Cray was for some time the fastest you >could get for either scalar or vector coding. LLNL in fact has preferred them >for their superior scalar and short vector performance. (These days, of course Well, yes. these days you dont always buy a vector machine for its scalar speed. The problem is really because of lack of modularity in some vector machines, that you hhave to get the vector whether or not you like it. There is a definite trend towards more modularity, where the vector is an option. In fact, I see no reason why a single scalar cant support more than one vector! >we are not so pure in our decision making. We prefer Cray machines for their >software compatibility with the Cray-1. Although we are starting to work on >the problem, we have not fully embraced the notion of portable operating Like I said, its a good idea to REWRITE the code. The LLNL problem is not a problem. If you are on the cutting edge of things, sometimes you get cuts! >to keep track of some of our dusty decks.) The Cyber 205 could provide more >performance on long vectors, so I guess by your reasoning that you would call >it a supercomputer and accuse anyone buying a Cray machine of departing from >the real supercomputers. I guess that world has decided to not use enough I will give you the benefit ot the doubt and not even respond to this. Surely you don't really think that's what I meant. I only used the YMP because you had used the XMP as an example. >REAL SUPERCOMUTERS, because CDC could not sell enough of them to keep the 205 >and its children on the market. Japanese machines blow the doors off the Cray >machines these days (scalar or vector), but CRI is not worried about this. The ETA machine was/is a great machine and absoluttely first-rate technology. The problem of ETA was not the machine or the architecture. It boiled down to management even though they had a lot of good people at all levels. It was too little too late, and a problemm of not having deep-enough pockets and a lack of beleief, in the business community, in long-term.... > Their real nightmares have KILLER MICROS in them. YOUR nightmares, maybe. > >JUST BECAUSE ONE FRIGGING COMPUTER HAS A HIGHER UPPER BOUND ON THE FLOATING >POINT RATE THAN ANOTHER DOES NOT MEAN THAT IT IS A BETTER COMPUTER, OR THAT >IT IS THE "REAL SUPERCOMPUTER". THE BOTTOM LINE IS HOW FAST DOES A MACHINE All I asked was to request everyone to stop using the single processor XMP as the definition. Indeed, a supercomputer has many definitions and all of them are relative to some base characteristic. As you well know. >PROCESSORS OPTIMIZED FOR SCALAR PERFORMANCE. BECAUSE OF THIS I AM ACCUSED >OF DEPARTING FROM SUPERCOMPUTING. THIS IS GIBBERISH. IF I FIND A MIMD MACHINE >WITH 100 "KILLER MICROS" COSTING THE SAME AS A YMP AND WHICH RUNS MY APPLICATION >100 TIMES FASTER, I HAVE SIMPLY REDEFINED THE NOTION OF SUPERCOMPUTER. > FOR MY APPLICATION BINGO! For Y O U R application! That's the whole point. Funny we should start from a general red-alert about the attack of killerr micros and end up where we all knew about: "It depends on the application". Yes, if your application is "pagemaker" you redefine supercomputing by going to a mac-ii-ci !! Now in this case, there is something to be said about highly parallel machines. Indeed, there is a lot of that happenning right here. My own experience in supercomputing started with loosely coupled systems and scalar-lloking codes. But it's too early for the supers to pack up and go home. WAY too early. Universities are usually several years ahead of the market and they haven't got it quite right yet. By the time micros become killers, they wont be micros anymore
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/24/89)
In article <46500084@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes: >>1) Is your 1000000x1000000 matrix dense? If not, how sparse is it? >It is dense. Totally dense. No element is (in principle) zero. Wow! >>2) How did you solve it? >Are you kidding? All I get is uproarious laughter. But, now There are out-of-core packages that seem to do a good job. If you can get the disk space! >but he does say that the methods are well known ones for sparse >matrices. He probably doesn't have the storage problem. If he does statistics, I have only seen very sparse ones in that field,, so far. (A side comment,, an application also with huge sparse matrices is 'animal science'! An obvious one, right? I say that only to point out that real supercomputing applications come up all the time as people start tackling more realistic problems, etc...) >>4) What is the application? (Its not a complex matrix, is it?!) >Mine is quantum mechanics of vibrations. No, it is real symmetric. How come the tricks used in ab-Initio Quantum Chem cant be used? >on the Illiac IV, which suited it perfectly. Only problem >was, the results only proved that classical mechanics won't >give correct results. I find this a fascinating subject. Here's a real supercomputing application with use and need for everything, it seems. *What* is an ideal architecture for dealing with a 1,000,000x1,000,000 dense matrix??
brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)
In article <9123@batcomputer.tn.cornell.edu> kahn@batcomputer.tn.cornell.edu (Shahin Kahn) writes: >> Their real nightmares have KILLER MICROS in them. >YOUR nightmares, maybe. I have pleasant dreams about KILLER MICROS. Let me give you a quote from an engineer at CRI, I will withhold the name to protect the innocent. "We have lost the scalar performance battle to the microprocessor... We will never recover it." Clearly, you are not worried, but the fellow from CRI is quite worried. >All I asked was to request everyone to stop using the >single processor XMP as the definition. Comparing CPU to CPU, the YMP is only 30% faster than the XMP. Twice as many are supplied in a box. There is no significant difference between their performances on scalar code that won't be completely overrun by the next wave of killer micros in 6 months. Any crap about 8 YMP CPUS being lots faster than one XMP CPU will only be answered by having to deal with 1024 KILLER MICROS. >> FOR MY APPLICATION > >BINGO! For Y O U R application! That's the whole point. >Funny we should start from a general red-alert about the attack of >killerr micros and end up where we all knew about: "It depends on >the application". Whether or not the current KILLER MICROS actually OUTRUN current supercomputers is application dependent. It is not, however, a set of measure zero and the set of such applications is growing. KILLER MICROS are more cost effective for ALL APPLICATIONS, however, with the supercomputers only getting close to BREAK EVEN on codes which run both the adder and multiplier on every cycle. Highly parallel machines also applicable to these embarrarsingly parallel problems and the KILLER MICRO powered multiprocessors will provide both more total horsepower and more total main memory. > >Yes, if your application is "pagemaker" you redefine supercomputing >by going to a mac-ii-ci !! I see no need to attempt an insult, my application is not "pagemaker". It is getting difficult to carry on a sensible discussion with such stuff leaking in. I will not reply to postings containing this sort of abusive retoric in the future. >But it's too early for the supers to pack up and go home. WAY too early. Supers have been sent home with their tails between their legs for scalar code by the KILLER MICROS. It will only be a matter 5 years before the same thing happens for vectorized code. We probably won't call them micros, out of respect for the dearly departed, but micros is what they will be. brooks@maddog.llnl.gov, brooks@maddog.uucp
gillies@m.cs.uiuc.edu (10/25/89)
Re: KILLER MICROS At some point I heard that MIPS pulled out just about every stopper to speed up the floating point speed of the R2000/R3000. In other words, they hardwired all the FPU ops, and provided 32 or 64-bit circuitry wherever it was needed, and used all the optimal designs (like carry lookahead and wallace trees -- please excuse my ignorance of arithmetic circuitry) in their arithmetic unit? So now they just wait for device technology and heavy pipelining to speed up their chip? Couldn't Crays make a small comeback by exploiting their 1's complement arithmetic, which is supposed to be an inherently faster number system for digital implementation?
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (10/28/89)
Here is a posting on the 1Mx1M matrix problem that I got from my friend Brad Carlile at FPS. It is a nice analysis and I have permission to post it for him. As users (i.e. Customers!) move to more realistic, 3D analysis, finer mesh sizes, more raelistic models, etc.. I think we will start seeing more problems like this. A super still has to do small problems fast, but its competitiveness shows better for these "real"istic problems! --------- From fpssun!@tektronix.tek.com:!bradc Fri Oct 27 16:34:25 1989 To: kahn@tcgould.TN.CORNELL.EDU Subject: Re: 1000000x1000000 Matrix (was: li In-Reply-To: your article <9125@batcomputer.tn.cornell.edu> Status: R Solving a 1M x 1M matrix, a very good problem indeed. First, lets consider storage. Storage of a real symmetric matrix = N(N+1)/2, where N is the size of the matrix. If the matrix was general than a storage of N*N is required. The below table will consider the number of bytes required to store the (1M x 1M) matrix in 32-bit and floating-point numbers. We will look at both Real Symmetric (RS) and Real non-symmetric (General). Storage(in bytes) RS General ----------------- ------- ------- 32-bit 2*10^12 4*10^12 64-bit 4*10^12 8*10^12 64-bit 4*10^12 8*10^12 At this point one can consider the number of address bits required for this storage. Address bits needed RS General ------------------- ------- ------- 32-bit 41 42 64-bit 42 43 At this size of matrix we are left with two options. First, we are outside the standard 32-bit address space of most computers. In the future the industry should consider using more address bits. Our second option is to solve the problem in an out-of-core manner. This option requires us to store most of the matrix on disk and bring in blocks of data to operate on (these blocks must be small enough to fit in the 32-bit address space, leaving room for the program, os, etc.). Even though it sounds like this may be a I/O limited problem, it is not. The reuse in the sub-matrix partitioning is enough to double buffer and hide most of the I/O to disk. Now lets turn to the computation. The problem has two phases: factor and solve. One thing not mentioned, or I missed it, was the number of vectors one is solving for in this problem. Flops needed RS General ------------------- ------- ------- Factor (N^3)/3 (2/3)*N^3 Solve 2*X*N^2 2*X*N^2 Where X is the number of vectors that one is solving for. Flops needed (1M*1M) RS General ------------------- ------- ------- Factor 3*10^17 6*10^17 Solve X*2*10^12 X*2*10^12 Factoring is the major computational user. The below table will estimate the time required to solve the problem on different speeds of machines. Machine RS ---------- -------- 100 MFlops 96.1 years 1 GFlops 9.52 years 10 GFlops 348 days 100 GFlops 34.7 days 1 TFlops 3.47 days I like problems like this they are fun. A couple of other notes. When you solve large matrices you are working with matrix multiply based kernels. The adder and the multiplier operations are balanced and this is a compute bound problem. It is also not memory bound like the 100*100 LINPACK benchmark. [Do not attempt this size of problem using the LINPACK 100 code :), I've always wanted to use of those ":)"] The real-symmetric case does not require pivoting for numerical stability. I would suggest doing this calculation in 64-bit, or larger, floating-point numbers. I hope I haven't made any errors in my calculations, I am doing this from memory. Brad Carlile