pmk@hall.cray.com (Peter Klausler) (03/25/88)
In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >As long as programmers are taught to think in terms of a language, rather > >than the machine, it will be difficult to get vendors to allow the forcing > >of inline. > > As long as programmers are taught to think in terms of a machine, > rather than the language, it will be difficult to get portable code > that can be moved to that 1 teraflop computer that will come out > tomorrow, then to the 10 teraflop computer that will come out the day > after, and then to the 100 teraflop computer that will come out a week > from Monday. A routine written for optimal performance on one architecture may be coded so that it is portable to others, but it'll be optimal on only the one. Write for optimal performance on a VAX and your code will crawl on a Cray. Code for maximal portability and you'll be suboptimal everywhere, having reduced yourself to a lowest common denominator of machine. Compilers can hide lots of annoying differences, but they can't change a nifty linked VAX-optimal data structure into a simple vectorizable Cray-optimal array. This is not to say that performance goals necessarily rule out portability, just that portability may be restricted to a smaller range of systems. Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your local Mimsy Data sales office :-)) are not likely to be much like an 8088 or 6502. But they could look similar to each other - and their differences in instruction timings, functional units, and processor counts would be differences that one can expect a reasonable compiler to handle. It's reasonable to write code that's portable to both the Cray-1/A and the Cray-3 (with minimal machine-dependent tweaking) and still give maximal performance on each, but you'll sacrifice PC/AT portability in the process. What I'm getting at with all this rambling is this: Knowing your target architecture is a GOOD THING if you're after optimal performance. Knowing your processor's assembly language - and how to optimize it - will make you a better user of your "high-level" language (FORTRAN, C, etc.). It will also frustrate you, as it has me, and as I fear it has frustrated Herman. (Could a language be designed that would preserve some measure of portability across some set of supercomputer systems while allowing maximal control and performance? C and C++ are great at this amongst scalar byte-addressable architectures but seem inadequate (to me!) to the task of extracting the most speed from a multiprocessor vector box with only large-word addressing. I've been trying (on and off) to devise such a language. Not easy, and probably not desirable, but an interesting experiment nontheless. Mail ideas, please.) [The above is solely my opinion, and is not to be construed as the viewpoint of CRI or anyone else, including my other personalities. Have a nice weekend.]
edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/26/88)
This really doesn't belong in comp.lang.c but..... > Compilers can > hide lots of annoying differences, but they can't change a nifty linked > VAX-optimal data structure into a simple vectorizable Cray-optimal array. I ask you to prove the above statement. I think this debate is really over who is responsible for optimizing code. The compiler or the programmer. It's the comilers job in my opinion. I'm not going to list the many reasons why. However, I have to admit I do hand tune my code sometimes to the paricular machine that I'm working on, but only because the compiler I use "ccp" is not a very good optimizing compiler. > This is not to say that performance goals necessarily rule out portability, > just that portability may be restricted to a smaller range of systems. > Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your > local Mimsy Data sales office :-)) are not likely to be much like > an 8088 or 6502. You laugh, a representative of IMB Yorktown came to CMU and gave a talk on the TF3, a 3 teraflop machine that they plan to deliver in two years. It consists of 4096 processing elements interconnected by an omega ring. BTW, if I remeber correctly, the machine will take on the order of 30 Megawatts of power to run. Does anyone know how much money that is per hour? > (Could a language be designed that would preserve some measure of portability > across some set of supercomputer systems while allowing maximal control and > performance? C and C++ are great at this amongst scalar byte-addressable > architectures but seem inadequate (to me!) to the task of extracting the most > speed from a multiprocessor vector box with only large-word addressing. > I've been trying (on and off) to devise such a language. Not easy, and probably > not desirable, but an interesting experiment nontheless. Mail ideas, please.) How about data-flow languages. -- Eddie Wyatt e-mail: edw@ius1.cs.cmu.edu
tada@athena.mit.edu (Michael Zehr) (03/26/88)
In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes: >In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: >> As long as programmers are taught to think in terms of a machine, >> rather than the language, it will be difficult to get portable code >> that can be moved to that 1 teraflop computer that will come out >> tomorrow, then to the 10 teraflop computer that will come out the day >> after, and then to the 100 teraflop computer that will come out a week >> from Monday. > >A routine written for optimal performance on one architecture may be coded >so that it is portable to others, but it'll be optimal on only the one. >Write for optimal performance on a VAX and your code will crawl on a Cray. >Code for maximal portability and you'll be suboptimal everywhere, having >reduced yourself to a lowest common denominator of machine. Other people write: [other stuff about portability vs. performance] When we have 1, 10, or 100 TFLOP machines, it's likely that no one will care about the software for it being portable to 1-5 MIP machines, cuz no one wants to wait 100,000 times as long for the program to finish. Also, if we port software from current 1-5 MIP machines, who's going to care that it's sub-optimal on a 20 TFLOP machine? If it needs to be optimal on one of those in order to run fast enough, then no one has spent time trying to make it optimal for a 5 MIP machine, cuz there wouldn't be any use. It's much more important that code written for one 5 MIP machine is portable to (and fairly easily optimized on) another 5 MIP machine, and likewise for TFLOP machines, but who cares about going between the different categories? Hopefully by the time we have machines of that speed, we'll have much better languages to work with, and software that will convert C code into a Language that is portable-but-has-fantastically-optimizing- compilers-for-lots-of-different-architectures. At least I hope so! :-) ------- michael j zehr "My opinions are my own ... as is my spelling."
chris@mimsy.UUCP (Chris Torek) (03/26/88)
>>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >>>As long as programmers are taught to think in terms of a language, rather >>>than the machine, it will be difficult to get vendors to allow the forcing >>>of inline. >In article <10763@mimsy.UUCP> I answered with >>As long as programmers are taught to think in terms of a machine, >>rather than the language, it will be difficult to get portable code >>that can be moved to that 1 teraflop computer that will come out >>tomorrow, then to the 10 teraflop computer that will come out the day >>after, and then to the 100 teraflop computer that will come out a week >>from Monday. In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes: >A routine written for optimal performance on one architecture may be coded >so that it is portable to others, but it'll be optimal on only the one. >Write for optimal performance on a VAX and your code will crawl on a Cray. >Code for maximal portability and you'll be suboptimal everywhere, having >reduced yourself to a lowest common denominator of machine. Compilers can >hide lots of annoying differences, but they can't change a nifty linked >VAX-optimal data structure into a simple vectorizable Cray-optimal array. This is true. That is, it is true now, and it has been true in the past, and no doubt it will be true in the future. Yet I claim that whenever it is true, that means only that you are working in too low level a language. For instance, it is certainly the case that if the language were at the absolute top level of the problem (`compute the answer to this question'), and the `compiler' for this produced decent `code' (referenced the program written specifically for that machine that solves that problem), the program would be portable (`find the input-value-th digit of e' is perfectly portable) yet still optimal. As another example, it has always surprised me that supercomputers do not come with good APL systems. APL is one of the few languages where one can write what one means when dealing with matrices. (It has other problems, but I think it has a place here.) I would hope the matrix multiply routine for the Cray would be reasonably close to optimal, even though it would no doubt be quite different from the one on the VAX. The programmer does not have to deal with the difference: A <- B * +/ C portably says to multiply B by the (scalar) sum of all the elements of vector C and put the result in A. >... What I'm getting at with all this rambling is this: Knowing your target >architecture is a GOOD THING if you're after optimal performance. Knowing >your processor's assembly language - and how to optimize it - will make you >a better user of your "high-level" language (FORTRAN, C, etc.). It will also >frustrate you, as it has me, and as I fear it has frustrated Herman. Sure. But maybe you can make it `good enough' without sacrificing portability; or perhaps you will have to throw portability to the wind---but maybe not entirely! The VAX assembly bcopy() routine is not portable, but use of it is. Perhaps you can hide your fast but machine specific code in a subroutine. More likely you will need a smarter compiler and/or higher level language. >(Could a language be designed that would preserve some measure of portability > across some set of supercomputer systems while allowing maximal control and > performance? ...) Answer: Yes. It will not be easy; it will certainly not be C; and it may have a small value for `maximal control'. But there is no theoretical bar, just lots of practical ones. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
cik@l.cc.purdue.edu (Herman Rubin) (03/27/88)
In article <10808@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > >>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: ...... > >In article <10763@mimsy.UUCP> I answered with ...... > In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes: ....... > > >... What I'm getting at with all this rambling is this: Knowing your target > >architecture is a GOOD THING if you're after optimal performance. Knowing > >your processor's assembly language - and how to optimize it - will make you > >a better user of your "high-level" language (FORTRAN, C, etc.). It will also > >frustrate you, as it has me, and as I fear it has frustrated Herman. > > Sure. But maybe you can make it `good enough' without sacrificing > portability; or perhaps you will have to throw portability to the > wind---but maybe not entirely! ....... > >(Could a language be designed that would preserve some measure of portability > > across some set of supercomputer systems while allowing maximal control and > > performance? ...) > > Answer: Yes. It will not be easy; it will certainly not be C; and > it may have a small value for `maximal control'. But there is no > theoretical bar, just lots of practical ones. > -- > In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) > Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris Of course, one should be able to describe the problem in straightforward terms and have the compiler produce optimal code :-). That we cannot now do this is merely a practical problem. So your sorting problem is trans- lated by the compiler into the O(n^2) worst-case code. Now we all know that that is the wrong way. But suppose that I know that it is extremely unlikely that there are more than a very small number of items will have to be moved up more than a small amount? Then it is the right way. There are algorithms which are highly portable. There are algorithms which are moderately portable. There are algorithms which look like they are portable, but the portable code crawls. There are situations where the aim of the algorithm is portable, but a portable algorithm is a total waste of machine time. These are not theoretical bars, just practical ones. The quantity and diversity of the practical bars are such as to cause me to suggest that we do not try for perfection, but for versatility. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
noise@eneevax.UUCP (Johnson Noise) (03/29/88)
In article <1226@PT.CS.CMU.EDU> edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes: >This really doesn't belong in comp.lang.c but..... >take on the order of 30 Megawatts of power to run. Does anyone know >how much money that is per hour? ~ $ 3000.00 / hour depending where you live, time, etc. There's a 500 MW (that's megawatt) modulator running next door... *TINK* There it goes again... Oh, by the way...500 MW pulsed power (psych) 30 MW ? Are you sure?
hankd@pur-ee.UUCP (Hank Dietz) (03/30/88)
In article <1226@PT.CS.CMU.EDU>, edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes: > This really doesn't belong in comp.lang.c but..... > > > Compilers can > > hide lots of annoying differences, but they can't change a nifty linked > > VAX-optimal data structure into a simple vectorizable Cray-optimal array. > > I ask you to prove the above statement. Better still, a disproof exists in a paper which I'll be presenting at this year's ICPP... it describes a loop transformation which often can perform this kind of conversion. As to its optimality, well, who can really say? I'm willing to send interested folk advance copies of the paper. Prof. Hank Dietz School of EE Purdue University West Lafayette, IN 47907