[comp.lang.c] Languages vs. machines

pmk@hall.cray.com (Peter Klausler) (03/25/88)

In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >As long as programmers are taught to think in terms of a language, rather
> >than the machine, it will be difficult to get vendors to allow the forcing
> >of inline.
> 
> As long as programmers are taught to think in terms of a machine,
> rather than the language, it will be difficult to get portable code
> that can be moved to that 1 teraflop computer that will come out
> tomorrow, then to the 10 teraflop computer that will come out the day
> after, and then to the 100 teraflop computer that will come out a week
> from Monday.

A routine written for optimal performance on one architecture may be coded
so that it is portable to others, but it'll be optimal on only the one.
Write for optimal performance on a VAX and your code will crawl on a Cray.
Code for maximal portability and you'll be suboptimal everywhere, having
reduced yourself to a lowest common denominator of machine.  Compilers can
hide lots of annoying differences, but they can't change a nifty linked
VAX-optimal data structure into a simple vectorizable Cray-optimal array.

This is not to say that performance goals necessarily rule out portability,
just that portability may be restricted to a smaller range of systems.
Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your
local Mimsy Data sales office :-)) are not likely to be much like
an 8088 or 6502.  But they could look similar to each other - and their
differences in instruction timings, functional units, and processor counts
would be differences that one can expect a reasonable compiler to handle.
It's reasonable to write code that's portable to both the Cray-1/A and
the Cray-3 (with minimal machine-dependent tweaking) and still give maximal
performance on each, but you'll sacrifice PC/AT portability in the process.

What I'm getting at with all this rambling is this: Knowing your target
architecture is a GOOD THING if you're after optimal performance.  Knowing
your processor's assembly language - and how to optimize it - will make you
a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
frustrate you, as it has me, and as I fear it has frustrated Herman.

(Could a language be designed that would preserve some measure of portability
 across some set of supercomputer systems while allowing maximal control and
 performance? C and C++ are great at this amongst scalar byte-addressable
 architectures but seem inadequate (to me!) to the task of extracting the most
 speed from a multiprocessor vector box with only large-word addressing.
 I've been trying (on and off) to devise such a language.  Not easy, and probably
 not desirable, but an interesting experiment nontheless.  Mail ideas, please.)

[The above is solely my opinion, and is not to be construed as the viewpoint of
 CRI or anyone else, including my other personalities.  Have a nice weekend.]

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/26/88)

This really doesn't belong in comp.lang.c but.....

> Compilers can
> hide lots of annoying differences, but they can't change a nifty linked
> VAX-optimal data structure into a simple vectorizable Cray-optimal array.

   I ask you to prove the above statement.  

   I think this debate is really  over who is responsible for optimizing
code.  The compiler or the programmer.  It's the comilers job in my
opinion.   I'm not going to list the many reasons why.  However,
I have to admit I do hand tune my code sometimes to the paricular machine
that I'm working on, but only because the compiler I use "ccp" is
not a very good optimizing compiler.

> This is not to say that performance goals necessarily rule out portability,
> just that portability may be restricted to a smaller range of systems.
> Chris' hypothetical 1TFLOP, 10TFLOP, and 100TFLOP machines (see your
> local Mimsy Data sales office :-)) are not likely to be much like
> an 8088 or 6502.

   You laugh, a representative of IMB Yorktown came to CMU and gave
a talk on the TF3, a 3 teraflop machine that they plan to deliver in
two years.  It consists of 4096 processing elements interconnected
by an omega ring.  BTW, if I remeber correctly, the machine will
take on the order of 30 Megawatts of power to run.  Does anyone know
how much money that is per hour?

> (Could a language be designed that would preserve some measure of portability
>  across some set of supercomputer systems while allowing maximal control and
>  performance? C and C++ are great at this amongst scalar byte-addressable
>  architectures but seem inadequate (to me!) to the task of extracting the most
>  speed from a multiprocessor vector box with only large-word addressing.
>  I've been trying (on and off) to devise such a language.  Not easy, and probably
>  not desirable, but an interesting experiment nontheless.  Mail ideas, please.)

  How about data-flow languages.
-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

tada@athena.mit.edu (Michael Zehr) (03/26/88)

In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:
>In article <10763@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>> As long as programmers are taught to think in terms of a machine,
>> rather than the language, it will be difficult to get portable code
>> that can be moved to that 1 teraflop computer that will come out
>> tomorrow, then to the 10 teraflop computer that will come out the day
>> after, and then to the 100 teraflop computer that will come out a week
>> from Monday.
>
>A routine written for optimal performance on one architecture may be coded
>so that it is portable to others, but it'll be optimal on only the one.
>Write for optimal performance on a VAX and your code will crawl on a Cray.
>Code for maximal portability and you'll be suboptimal everywhere, having
>reduced yourself to a lowest common denominator of machine.

Other people write:
[other stuff about portability vs. performance]

When we have 1, 10, or 100 TFLOP machines, it's likely that no one will care
about the software for it being portable to 1-5 MIP machines, cuz no one
wants to wait 100,000 times as long for the program to finish.  Also, if
we port software from current 1-5 MIP machines, who's going to care that
it's sub-optimal on a 20 TFLOP machine? If it needs to be optimal on one of
those in order to run fast enough, then no one has spent time trying to make
it optimal for a 5 MIP machine, cuz there wouldn't be any use.  It's much
more important that code written for one 5 MIP machine is portable to (and
fairly easily optimized on) another 5 MIP machine, and likewise for TFLOP
machines, but who cares about going between the different categories?

Hopefully by the time we have machines of that speed, we'll have much
better languages to work with, and software that will convert C code
into a Language that is portable-but-has-fantastically-optimizing-
compilers-for-lots-of-different-architectures.  At least I hope so! :-)

-------
michael j zehr
"My opinions are my own ... as is my spelling."

chris@mimsy.UUCP (Chris Torek) (03/26/88)

>>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>>>As long as programmers are taught to think in terms of a language, rather
>>>than the machine, it will be difficult to get vendors to allow the forcing
>>>of inline.

>In article <10763@mimsy.UUCP> I answered with
>>As long as programmers are taught to think in terms of a machine,
>>rather than the language, it will be difficult to get portable code
>>that can be moved to that 1 teraflop computer that will come out
>>tomorrow, then to the 10 teraflop computer that will come out the day
>>after, and then to the 100 teraflop computer that will come out a week
>>from Monday.

In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:
>A routine written for optimal performance on one architecture may be coded
>so that it is portable to others, but it'll be optimal on only the one.
>Write for optimal performance on a VAX and your code will crawl on a Cray.
>Code for maximal portability and you'll be suboptimal everywhere, having
>reduced yourself to a lowest common denominator of machine.  Compilers can
>hide lots of annoying differences, but they can't change a nifty linked
>VAX-optimal data structure into a simple vectorizable Cray-optimal array.

This is true.  That is, it is true now, and it has been true in the
past, and no doubt it will be true in the future.  Yet I claim that
whenever it is true, that means only that you are working in too low
level a language.

For instance, it is certainly the case that if the language were at the
absolute top level of the problem (`compute the answer to this
question'), and the `compiler' for this produced decent `code'
(referenced the program written specifically for that machine that
solves that problem), the program would be portable (`find the
input-value-th digit of e' is perfectly portable) yet still optimal.

As another example, it has always surprised me that supercomputers do
not come with good APL systems.  APL is one of the few languages where
one can write what one means when dealing with matrices.  (It has other
problems, but I think it has a place here.)  I would hope the matrix
multiply routine for the Cray would be reasonably close to optimal,
even though it would no doubt be quite different from the one on the
VAX.  The programmer does not have to deal with the difference:

	A <- B * +/ C

portably says to multiply B by the (scalar) sum of all the elements
of vector C and put the result in A.

>... What I'm getting at with all this rambling is this: Knowing your target
>architecture is a GOOD THING if you're after optimal performance.  Knowing
>your processor's assembly language - and how to optimize it - will make you
>a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
>frustrate you, as it has me, and as I fear it has frustrated Herman.

Sure.  But maybe you can make it `good enough' without sacrificing
portability; or perhaps you will have to throw portability to the
wind---but maybe not entirely!  The VAX assembly bcopy() routine is
not portable, but use of it is.  Perhaps you can hide your fast but
machine specific code in a subroutine.  More likely you will need
a smarter compiler and/or higher level language.

>(Could a language be designed that would preserve some measure of portability
> across some set of supercomputer systems while allowing maximal control and
> performance? ...)

Answer:  Yes.  It will not be easy; it will certainly not be C; and
it may have a small value for `maximal control'.  But there is no
theoretical bar, just lots of practical ones.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

cik@l.cc.purdue.edu (Herman Rubin) (03/27/88)

In article <10808@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> >>In article <719@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

    ......

> >In article <10763@mimsy.UUCP> I answered with

......

> In article <5308@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:

.......
> 
> >... What I'm getting at with all this rambling is this: Knowing your target
> >architecture is a GOOD THING if you're after optimal performance.  Knowing
> >your processor's assembly language - and how to optimize it - will make you
> >a better user of your "high-level" language (FORTRAN, C, etc.).  It will also
> >frustrate you, as it has me, and as I fear it has frustrated Herman.
> 
> Sure.  But maybe you can make it `good enough' without sacrificing
> portability; or perhaps you will have to throw portability to the
> wind---but maybe not entirely!  .......

> >(Could a language be designed that would preserve some measure of portability
> > across some set of supercomputer systems while allowing maximal control and
> > performance? ...)
> 
> Answer:  Yes.  It will not be easy; it will certainly not be C; and
> it may have a small value for `maximal control'.  But there is no
> theoretical bar, just lots of practical ones.
> -- 
> In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
> Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

Of course, one should be able to describe the problem in straightforward
terms and have the compiler produce optimal code :-).  That we cannot now
do this is merely a practical problem.  So your sorting problem is trans-
lated by the compiler into the O(n^2) worst-case code.  Now we all know that
that is the wrong way.  But suppose that I know that it is extremely unlikely
that there are more than a very small number of items will have to be moved
up more than a small amount?  Then it is the right way.

There are algorithms which are highly portable.  There are algorithms which
are moderately portable.  There are algorithms which look like they are 
portable, but the portable code crawls.  There are situations where the
aim of the algorithm is portable, but a portable algorithm is a total waste
of machine time.  These are not theoretical bars, just practical ones.

The quantity and diversity of the practical bars are such as to cause me to
suggest that we do not try for perfection, but for versatility.  

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

noise@eneevax.UUCP (Johnson Noise) (03/29/88)

In article <1226@PT.CS.CMU.EDU> edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
>This really doesn't belong in comp.lang.c but.....

>take on the order of 30 Megawatts of power to run.  Does anyone know
>how much money that is per hour?

	~ $ 3000.00 / hour depending where you live, time, etc.
There's a 500 MW (that's megawatt) modulator running next door...
*TINK*
There it goes again...
Oh, by the way...500 MW pulsed power (psych)

30 MW ? Are you sure?

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/29/88)

Stats on TF3

> 
> 30 MW ? Are you sure?

   It may not have been exactly 30MW but is was some outrages number like that.
This thing was suppose to take 1/2 the power output of Yorktown.
The speaker said that if they where to turn this thing on without having
a gradual power increase that the differential in voltage would
melt all the power lines back to power plant.  The same was true
for turning this thing off.  (I can see real meaning given to the
term crash! :-).  

  Note that the machine consisted of 4096 processors.  That means that
each processor was a 1 gigaflop processor - pretty damn impressive 
if they can build it.

  Another interest stat that was brought up was the up time of the
machine.  With the expected hardware failure rate the processors
they could expect to lose a processor onces every three days.

  Canidates for this beast include NASA.  Does anyone at NASA
want to comment?
-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

koopman@A.GP.CS.CMU.EDU (Philip Koopman) (03/30/88)

In article <1247@PT.CS.CMU.EDU>, edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
>    It may not have been exactly 30MW but is was some outrages number like that.
> This thing was suppose to take 1/2 the power output of Yorktown.
> ...
>   Note that the machine consisted of 4096 processors.  That means that
> each processor was a 1 gigaflop processor - pretty damn impressive 
> if they can build it.
> ...
> Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

I believe you folks are talking about the TF-1 processor.
The head architect of that recently gave a talk at CMU, and I think
I can remember some of the details:

About 32000 processors, total floating point computation speed 1 Tflops.
Every processor has 2 computation elements that are checked for
consistency to spot errors.  Any inconsistency takes the element off-line.

Total power about 3.5 MW.  Yes, powering the system up is interesting
(so is cooling).
Implementation is CMOS.  That means that if the system clock dies
at the full operating speed of something like 20 MHz or so, the
dI/dt current change melts the power lines and blows up the
substation (and maybe the East Coast power grid???? *grin*)
They're working on redundant/fail-safe clock distribution.

The thing has got a LOT of packet switching capability (more
than all the telephone switching capability in the world) to
get the processors to communicate.

An interesting and ambitious architecture!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~  Phil Koopman             5551 Beacon St.             ~
~                           Pittsburgh, PA  15217       ~
~  koopman@faraday.ece.cmu.edu   (preferred address)    ~ 
~  koopman@a.gp.cs.cmu.edu                              ~
~                                                       ~
~  Disclaimer: I'm a PhD student at CMU, and I do some  ~
~              work for WISC Technologies.              ~
~              (No one listens to me anyway!)           ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

hankd@pur-ee.UUCP (Hank Dietz) (03/30/88)

In article <1226@PT.CS.CMU.EDU>, edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
> This really doesn't belong in comp.lang.c but.....
> 
> > Compilers can
> > hide lots of annoying differences, but they can't change a nifty linked
> > VAX-optimal data structure into a simple vectorizable Cray-optimal array.
> 
>    I ask you to prove the above statement.  

Better still, a disproof exists in a paper which I'll be presenting at this
year's ICPP...  it describes a loop transformation which often can perform
this kind of conversion.  As to its optimality, well, who can really say?
I'm willing to send interested folk advance copies of the paper.

					Prof. Hank Dietz
					School of EE
					Purdue University
					West Lafayette, IN  47907

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/30/88)

 > I believe you folks are talking about the TF-1 processor.
 > The head architect of that recently gave a talk at CMU, and I think
 > I can remember some of the details:
 > 
 > About 32000 processors, total floating point computation speed 1 Tflops.
 > Every processor has 2 computation elements that are checked for
 > consistency to spot errors.  Any inconsistency takes the element off-line.
 > 
 > Total power about 3.5 MW.  Yes, powering the system up is interesting
 > (so is cooling).
 > Implementation is CMOS.  That means that if the system clock dies
 > at the full operating speed of something like 20 MHz or so, the
 > dI/dt current change melts the power lines and blows up the
 > substation (and maybe the East Coast power grid???? *grin*)
 > They're working on redundant/fail-safe clock distribution.
 > 
 > The thing has got a LOT of packet switching capability (more
 > than all the telephone switching capability in the world) to
 > get the processors to communicate.
 > 
 > An interesting and ambitious architecture!
 > 
 > 
 > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 > ~  Phil Koopman             5551 Beacon St.             ~

  I wanted to thank you for clearing up some of my misinformation.
This discussion start over someone mocking someone else about the plans
to build tera-machines. I cross posted into comp.arch hoping someone would
clear it up and someone did.

Oh, didn't the speaker say that the speed of the beast was going to be
up'ed to 3 teraflops?

BTW: 32000 processor sound about the right number considering what
was said about the switching network.  
-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

dfk@duke.cs.duke.edu (David Kotz) (03/30/88)

In article <1252@PT.CS.CMU.EDU>, koopman@A.GP.CS.CMU.EDU (Philip Koopman) writes:
> I believe you folks are talking about the TF-1 processor.
> The head architect of that recently gave a talk at CMU, and I think
> I can remember some of the details:
> 

Many details are available in an article in the recent issue of
"Supercomputing" magazine, supposedly the only thing out there written
about the TF-1. Try some of these specs out for size:

32,768 processors
arranged in a 40' donut
with 3000 miles of wiring in between (butterfly-style packet-switched network)
global 50MHz clock
using 2.5 Mwatts
water cooled
total 3 Tflops single-precision or 1.5 Tflops double-precision

Each processor:
single 300-pin CMOS chip has 
	50 Mips fixed-point unit
	100 Mflop float unit
	128 (32 bit) registers
	interface to switch (50 Mbytes/s)
and two 200 Mbyte/s channels to 
	4M of data RAM		(=> 128 Gbytes total)
	1M of instruction RAM

Processors are packed 8 to a board (actually 16, all are replicated).
Switch nodes are packed 16 8x8 nodes to a board (actually 32, all replicated).
In addition, the whole switch is replicated 8 times to lower contention.
The wiring is in 504 layers of 64 wires each. 

This is a BIG machine. Don't look for it under your desk anytime soon...

David Kotz
-- 
Department of Computer Science, Duke University, Durham, NC 27706
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	{ihnp4!decvax}!duke!dfk