[comp.lang.misc] Languages vs. machines

pmk@hall.cray.com (Peter Klausler) (03/25/88)

In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >As long as programmers are taught to think in terms of a language, rather
> >than the machine, it will be difficult to get vendors to allow the forcing
> >of inline.
> 
> As long as programmers are taught to think in terms of a machine,
> rather than the language, it will be difficult to get portable code
> that can be moved to that 1 teraflop computer that will come out
> tomorrow, then to the 10 teraflop computer that will come out the day
> after, and then to the 100 teraflop computer that will come out a week
> from Monday.

A routine written for optimal performance on one architecture may be coded
so that it is portable to others, but it'll be optimal on only the one.
Write for optimal performance on a VAX and your code will crawl on a Cray.
Code for maximal portability and you'll be suboptimal everywhere, having
reduced yourself to a lowest common denominator of machine.  Compilers can
hide lots of annoying differences, but they can't change a nifty linked
VAX-optimal data structure into a simple vectorizable Cray-optimal array.

This is not to say that performance goals necessarily rule out portability,
just that portability may be restricted to a smaller range of systems.
Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your
local Mimsy Data sales office :-)) are not likely to be much like
an 8088 or 6502.  But they could look similar to each other - and their
differences in instruction timings, functional units, and processor counts
would be differences that one can expect a reasonable compiler to handle.
It's reasonable to write code that's portable to both the Cray-1/A and
the Cray-3 (with minimal machine-dependent tweaking) and still give maximal
performance on each, but you'll sacrifice PC/AT portability in the process.

What I'm getting at with all this rambling is this: Knowing your target
architecture is a GOOD THING if you're after optimal performance.  Knowing
your processor's assembly language - and how to optimize it - will make you
a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
frustrate you, as it has me, and as I fear it has frustrated Herman.

(Could a language be designed that would preserve some measure of portability
 across some set of supercomputer systems while allowing maximal control and
 performance? C and C++ are great at this amongst scalar byte-addressable
 architectures but seem inadequate (to me!) to the task of extracting the most
 speed from a multiprocessor vector box with only large-word addressing.
 I've been trying (on and off) to devise such a language.  Not easy, and probably
 not desirable, but an interesting experiment nontheless.  Mail ideas, please.)

[The above is solely my opinion, and is not to be construed as the viewpoint of
 CRI or anyone else, including my other personalities.  Have a nice weekend.]

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/26/88)

This really doesn't belong in comp.lang.c but.....

> Compilers can
> hide lots of annoying differences, but they can't change a nifty linked
> VAX-optimal data structure into a simple vectorizable Cray-optimal array.

   I ask you to prove the above statement.  

   I think this debate is really  over who is responsible for optimizing
code.  The compiler or the programmer.  It's the comilers job in my
opinion.   I'm not going to list the many reasons why.  However,
I have to admit I do hand tune my code sometimes to the paricular machine
that I'm working on, but only because the compiler I use "ccp" is
not a very good optimizing compiler.

> This is not to say that performance goals necessarily rule out portability,
> just that portability may be restricted to a smaller range of systems.
> Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your
> local Mimsy Data sales office :-)) are not likely to be much like
> an 8088 or 6502.

   You laugh, a representative of IMB Yorktown came to CMU and gave
a talk on the TF3, a 3 teraflop machine that they plan to deliver in
two years.  It consists of 4096 processing elements interconnected
by an omega ring.  BTW, if I remeber correctly, the machine will
take on the order of 30 Megawatts of power to run.  Does anyone know
how much money that is per hour?

> (Could a language be designed that would preserve some measure of portability
>  across some set of supercomputer systems while allowing maximal control and
>  performance? C and C++ are great at this amongst scalar byte-addressable
>  architectures but seem inadequate (to me!) to the task of extracting the most
>  speed from a multiprocessor vector box with only large-word addressing.
>  I've been trying (on and off) to devise such a language.  Not easy, and probably
>  not desirable, but an interesting experiment nontheless.  Mail ideas, please.)

  How about data-flow languages.
-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

tada@athena.mit.edu (Michael Zehr) (03/26/88)

In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:
>In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>> As long as programmers are taught to think in terms of a machine,
>> rather than the language, it will be difficult to get portable code
>> that can be moved to that 1 teraflop computer that will come out
>> tomorrow, then to the 10 teraflop computer that will come out the day
>> after, and then to the 100 teraflop computer that will come out a week
>> from Monday.
>
>A routine written for optimal performance on one architecture may be coded
>so that it is portable to others, but it'll be optimal on only the one.
>Write for optimal performance on a VAX and your code will crawl on a Cray.
>Code for maximal portability and you'll be suboptimal everywhere, having
>reduced yourself to a lowest common denominator of machine.

Other people write:
[other stuff about portability vs. performance]

When we have 1, 10, or 100 TFLOP machines, it's likely that no one will care
about the software for it being portable to 1-5 MIP machines, cuz no one
wants to wait 100,000 times as long for the program to finish.  Also, if
we port software from current 1-5 MIP machines, who's going to care that
it's sub-optimal on a 20 TFLOP machine? If it needs to be optimal on one of
those in order to run fast enough, then no one has spent time trying to make
it optimal for a 5 MIP machine, cuz there wouldn't be any use.  It's much
more important that code written for one 5 MIP machine is portable to (and
fairly easily optimized on) another 5 MIP machine, and likewise for TFLOP
machines, but who cares about going between the different categories?

Hopefully by the time we have machines of that speed, we'll have much
better languages to work with, and software that will convert C code
into a Language that is portable-but-has-fantastically-optimizing-
compilers-for-lots-of-different-architectures.  At least I hope so! :-)

-------
michael j zehr
"My opinions are my own ... as is my spelling."

chris@mimsy.UUCP (Chris Torek) (03/26/88)

>>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>>>As long as programmers are taught to think in terms of a language, rather
>>>than the machine, it will be difficult to get vendors to allow the forcing
>>>of inline.

>In article <10763@mimsy.UUCP> I answered with
>>As long as programmers are taught to think in terms of a machine,
>>rather than the language, it will be difficult to get portable code
>>that can be moved to that 1 teraflop computer that will come out
>>tomorrow, then to the 10 teraflop computer that will come out the day
>>after, and then to the 100 teraflop computer that will come out a week
>>from Monday.

In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:
>A routine written for optimal performance on one architecture may be coded
>so that it is portable to others, but it'll be optimal on only the one.
>Write for optimal performance on a VAX and your code will crawl on a Cray.
>Code for maximal portability and you'll be suboptimal everywhere, having
>reduced yourself to a lowest common denominator of machine.  Compilers can
>hide lots of annoying differences, but they can't change a nifty linked
>VAX-optimal data structure into a simple vectorizable Cray-optimal array.

This is true.  That is, it is true now, and it has been true in the
past, and no doubt it will be true in the future.  Yet I claim that
whenever it is true, that means only that you are working in too low
level a language.

For instance, it is certainly the case that if the language were at the
absolute top level of the problem (`compute the answer to this
question'), and the `compiler' for this produced decent `code'
(referenced the program written specifically for that machine that
solves that problem), the program would be portable (`find the
input-value-th digit of e' is perfectly portable) yet still optimal.

As another example, it has always surprised me that supercomputers do
not come with good APL systems.  APL is one of the few languages where
one can write what one means when dealing with matrices.  (It has other
problems, but I think it has a place here.)  I would hope the matrix
multiply routine for the Cray would be reasonably close to optimal,
even though it would no doubt be quite different from the one on the
VAX.  The programmer does not have to deal with the difference:

	A <- B * +/ C

portably says to multiply B by the (scalar) sum of all the elements
of vector C and put the result in A.

>... What I'm getting at with all this rambling is this: Knowing your target
>architecture is a GOOD THING if you're after optimal performance.  Knowing
>your processor's assembly language - and how to optimize it - will make you
>a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
>frustrate you, as it has me, and as I fear it has frustrated Herman.

Sure.  But maybe you can make it `good enough' without sacrificing
portability; or perhaps you will have to throw portability to the
wind---but maybe not entirely!  The VAX assembly bcopy() routine is
not portable, but use of it is.  Perhaps you can hide your fast but
machine specific code in a subroutine.  More likely you will need
a smarter compiler and/or higher level language.

>(Could a language be designed that would preserve some measure of portability
> across some set of supercomputer systems while allowing maximal control and
> performance? ...)

Answer:  Yes.  It will not be easy; it will certainly not be C; and
it may have a small value for `maximal control'.  But there is no
theoretical bar, just lots of practical ones.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

cik@l.cc.purdue.edu (Herman Rubin) (03/27/88)

In article <10808@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> >>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

    ......

> >In article <10763@mimsy.UUCP> I answered with

......

> In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:

.......
> 
> >... What I'm getting at with all this rambling is this: Knowing your target
> >architecture is a GOOD THING if you're after optimal performance.  Knowing
> >your processor's assembly language - and how to optimize it - will make you
> >a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
> >frustrate you, as it has me, and as I fear it has frustrated Herman.
> 
> Sure.  But maybe you can make it `good enough' without sacrificing
> portability; or perhaps you will have to throw portability to the
> wind---but maybe not entirely!  .......

> >(Could a language be designed that would preserve some measure of portability
> > across some set of supercomputer systems while allowing maximal control and
> > performance? ...)
> 
> Answer:  Yes.  It will not be easy; it will certainly not be C; and
> it may have a small value for `maximal control'.  But there is no
> theoretical bar, just lots of practical ones.
> -- 
> In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
> Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

Of course, one should be able to describe the problem in straightforward
terms and have the compiler produce optimal code :-).  That we cannot now
do this is merely a practical problem.  So your sorting problem is trans-
lated by the compiler into the O(n^2) worst-case code.  Now we all know that
that is the wrong way.  But suppose that I know that it is extremely unlikely
that there are more than a very small number of items will have to be moved
up more than a small amount?  Then it is the right way.

There are algorithms which are highly portable.  There are algorithms which
are moderately portable.  There are algorithms which look like they are 
portable, but the portable code crawls.  There are situations where the
aim of the algorithm is portable, but a portable algorithm is a total waste
of machine time.  These are not theoretical bars, just practical ones.

The quantity and diversity of the practical bars are such as to cause me to
suggest that we do not try for perfection, but for versatility.  

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

noise@eneevax.UUCP (Johnson Noise) (03/29/88)

In article <1226@PT.CS.CMU.EDU> edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
>This really doesn't belong in comp.lang.c but.....

>take on the order of 30 Megawatts of power to run.  Does anyone know
>how much money that is per hour?

	~ $ 3000.00 / hour depending where you live, time, etc.
There's a 500 MW (that's megawatt) modulator running next door...
*TINK*
There it goes again...
Oh, by the way...500 MW pulsed power (psych)

30 MW ? Are you sure?

hankd@pur-ee.UUCP (Hank Dietz) (03/30/88)

In article <1226@PT.CS.CMU.EDU>, edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
> This really doesn't belong in comp.lang.c but.....
> 
> > Compilers can
> > hide lots of annoying differences, but they can't change a nifty linked
> > VAX-optimal data structure into a simple vectorizable Cray-optimal array.
> 
>    I ask you to prove the above statement.  

Better still, a disproof exists in a paper which I'll be presenting at this
year's ICPP...  it describes a loop transformation which often can perform
this kind of conversion.  As to its optimality, well, who can really say?
I'm willing to send interested folk advance copies of the paper.

					Prof. Hank Dietz
					School of EE
					Purdue University
					West Lafayette, IN  47907