[comp.arch] EFLOP architectures: when and for how much?

zed@mdbs.uucp (Bill Smith) (10/03/90)

When was (or will) machines with the following performance be available?  
The main parameters I am interested in is the cost for the initial 
commercial release for a machine with a given number of available (as 
opposed to theoretical) floating point operations per second.

					Date:			Cost:
Mega FLOP  --  10^6  (MFLOP)    
Giga FLOP  --  10^9  (GFLOP)
Tera FLOP  --  10^12 (TFLOP)
Peta FLOP  --  10^15 (PFLOP)
Exa  FLOP  --  10^18 (EFLOP)

When in this history will a programming model be available that will be 
appropriate for solving problems in any part of this entire 10^12 speed 
range and will port programs to the next higher level without serious loss of
efficiency?

William W. Smith
sawmill!mdbs!zed
[not necessarily my employer's opinions]

gcwilliams@watdragon.waterloo.edu (Graeme Williams) (10/27/90)

In article <1990Oct2.190020.15214@mdbs.uucp> zed@mdbs.uucp (Bill Smith) writes:
>When was (or will) machines with the following performance be available?  
>
>					Date:			Cost:
>Mega FLOP  --  10^6  (MFLOP)    
>Giga FLOP  --  10^9  (GFLOP)
>Tera FLOP  --  10^12 (TFLOP)
>Peta FLOP  --  10^15 (PFLOP)
>Exa  FLOP  --  10^18 (EFLOP)

The higher two of these are probably unattainable in the forseeable
future - except perhaps for very large numbers of processors in parallel.

For *SINGLE* processors there is a fundamental restriction on the number
of operations that can be performed.

Namely :
An instruction cannot be executed in a time shorter than the time
it takes for a light beam to traverse the processing device.

Thus if a single processor were to be capable of executing 10^18
instructions per second - the physical size of the processor would
have to be smaller than (3*10^8)/10^18 metres i.e. 0.3 nanometres.

Unfortunately 0.3 nanometres is the same order of size as a single
molecule - so either the processor is *EXTREMELY SIMPLE* :-) ,or it's
not made of molecules! In short 10^18 operations/sec will probably
be forever impossible for a single processor, indeed 10^15 probably
is too.

Anyone for engineering a CPU out of a tiny piece of neutron star ???

Graeme Williams
gcwilliams@watdragon.waterloo.edu

dhoyt@vx.acs.umn.edu (10/27/90)

In article <1990Oct26.191032.9099@watdragon.waterloo.edu>, gcwilliams@watdragon.waterloo.edu (Graeme Williams) writes...
>In article <1990Oct2.190020.15214@mdbs.uucp> zed@mdbs.uucp (Bill Smith) writes:
>An instruction cannot be executed in a time shorter than the time
>it takes for a light beam to traverse the processing device.
  This may or may not be true.  It is certianly possible for an 'electron'
to pass from once side of a gap to the other side, with no elapsed time.
(Actually the electron is on both sides of the gap, until it is measured--meow.)
I've not seen anything to make me think that zero momentum computing devices
are impossible.  Of course I'm not offering any venture capital either.

david | dhoyt@vx.acs.umn.edu | dhoyt@umnacvx.bitnet

peter@ficc.ferranti.com (Peter da Silva) (10/28/90)

In article <1990Oct26.191032.9099@watdragon.waterloo.edu> gcwilliams@watdragon.waterloo.edu (Graeme Williams) writes:
> Anyone for engineering a CPU out of a tiny piece of neutron star ???

Where is Bob Forward when we need him?
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

henry@zoo.toronto.edu (Henry Spencer) (10/28/90)

In article <2581@ux.acs.umn.edu> dhoyt@vx.acs.umn.edu writes:
>>An instruction cannot be executed in a time shorter than the time
>>it takes for a light beam to traverse the processing device.
>  This may or may not be true.  It is certianly possible for an 'electron'
>to pass from once side of a gap to the other side, with no elapsed time.

Um, references please.  My recollection is that the transition is *not*
instantaneous.  Contrary to popular belief, it doesn't even jump the gap
without occupying the space between:  the notion that it is forbidden to
occupy said space is an attempt to apply classical dynamics in an area
of severely non-classical behavior.

If you are talking about the collapse of wave functions, that is a
totally different story.  One can argue about whether that phenomenon
is "real", but it is inarguable that you *cannot* use it to transmit
information, which is what we're talking about.  The electron doesn't
"pass from one side to the other side"; it starts out on both sides,
and then decides which side it's on.

If you believe in (a) cause and effect, and (b) special relativity --
both of which are rather solidly rooted in experimental evidence -- then
faster-than-light transmission of information is fundamentally impossible.
(From a suitable viewpoint, the transmission time would appear to be
negative, with cause (transmission) coming after effect (reception).)
-- 
The type syntax for C is essentially   | Henry Spencer at U of Toronto Zoology
unparsable.             --Rob Pike     |  henry@zoo.toronto.edu   utzoo!henry

gcwilliams@watdragon.waterloo.edu (Graeme Williams) (10/28/90)

In article <2581@ux.acs.umn.edu> dhoyt@vx.acs.umn.edu writes:
>In article <1990Oct26.191032.9099@watdragon.waterloo.edu>, gcwilliams@watdragon.waterloo.edu (Graeme Williams) writes...
>>In article <1990Oct2.190020.15214@mdbs.uucp> zed@mdbs.uucp (Bill Smith) writes:
>>An instruction cannot be executed in a time shorter than the time
>>it takes for a light beam to traverse the processing device.
>
>  This may or may not be true.  It is certianly possible for an 'electron'
>to pass from once side of a gap to the other side, with no elapsed time.
>(Actually the electron is on both sides of the gap, until it is measured--meow.)

True the electron has an uncertainty in it's position courtesy of Quantum
Physics, unfortunately Mother Nature prohibits one from making
use of this fact to transmit information faster than light - so my
original statement stands. (This is closely related to Bell's paradox)

>I've not seen anything to make me think that zero momentum computing devices
>are impossible.  Of course I'm not offering any venture capital either.

It occurs to me that perhaps one could construct the heart of a
processor, *purely* out of interacting photons - it could be very
tiny and everything would be done at light speed. Elegant.

Still, "more" is never enough, and "much more" only suffices for a
wee while, given a few months with such a device I'd soon want a
faster one :-).

Graeme Williams
gcwilliams@watdragon.waterloo.edu

dhoyt@vx.acs.umn.edu (10/29/90)

In article <1990Oct27.235949.6451@zoo.toronto.edu>, henry@zoo.toronto.edu (Henry Spencer) writes...
>Um, references please.
  I'll try to dig some up.  I seem to remember an article in scientific
american (or another broad scope journal) in the past year or two about a
research showing two particles in 'communication' even they were outside of
each other's time cone.  I believe they were using the exclusion principle. If
anyone can refer me to this, it would save me some time in the local library.

>If you believe in (a) cause and effect, and (b) special relativity --

Special relativity does not produce the same answers as quantum dynamics.  As
correct as (general) relativity seems on a macro scale, it fails at the micro
scale.  What often trips people up is thinking of an electrons (or photons) as
a particles, when in fact, they are not.  They are waves. (Chemists and physists
tend to disaggre with each other here.)  The achiral nature of ammonia
chemistry is a  simple proof of this.  (If electrons were truely classical (or
relativistic) one would expect ammonia to be pyramid shapped, and thus show 
handed or chiral geometry, which would show up as sterospecificy.  Ammonia
chemistry does not show sterospecificy, thus is achiral, thus not pyrimid
shaped, but rather planar.)

In the ammonia ion the 'reactive' electron is a wave form that exists in a
cloud both above the plane and below the plane of the ion at the same time. 
The electron will never exist anywhere between the two clouds.  As the wave
form exists both above and below the clouds other molecules are free to react
with either side of the ion, which will fix the electron wave form to one
side of the ion.  If the 'electron' actually moved, there would have to be
a conservation of momentum, which would mean that new molecule would have to
move away from the reaction site, even if the sum of the momentum of the two
ions were zero.  Which would require the total energy in an adiabadic
(closed) system to increase.  Which would break the second law of
thermodynamics.  Which would would make things very hot indeed.

Hopefully more later.

david | dhoyt@vx.acs.umn.edu | dhoyt@umnacvx.bitnet

rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (10/29/90)

gcwilliams@watdragon.waterloo.edu (Graeme Williams) said in article
<1990Oct26.191032.9099@watdragon.waterloo.edu> :

> Namely :
> An instruction cannot be executed in a time shorter than the time
> it takes for a light beam to traverse the processing device.

> Thus if a single processor were to be capable of executing 10~18
> instructions per second - the physical size of the processor would
> have to be smaller than (3*10~8)/10~18 metres i.e. 0.3 nanometres.

> Unfortunately 0.3 nanometres is the same order of size as a single
> molecule - so either the processor is *EXTREMELY SIMPLE* :-) ,or it's
> not made of molecules! In short 10~18 operations/sec will probably
> be forever impossible for a single processor, indeed 10~15 probably
> is too.

While it is true that an instruction cannot be EXECUTED in less time
than light can traverse the computing device, nFLOPS are measures of
THROUGHPUT. We can improve throughput by some amount via pipelining of
the execution so that each section of a computing device performs only
part of the work for the instruction. After an initial delay to fill the
pipeline we could produce a result every 1/mth second where this time
between results < time for light to traverse the whole CPU.

Some people call this parallel computation and it is a form of parallelism.
Nearly all processors produced today are pipelined. Vector units are
usually (always ?) implemented with pipelining but they could include
true multiple execution units (e.g. adding 4 vector elements to 4 others
at the same time) to further increase throughtput.

I am usually more interested in bandwidth/throughput than the latency
of any individual instruction in a stream.
-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108

brooks@maddog.llnl.gov (Eugene Brooks) (10/29/90)

In article <2588@ux.acs.umn.edu> dhoyt@vx.acs.umn.edu writes:
>  I'll try to dig some up.  I seem to remember an article in scientific
>american (or another broad scope journal) in the past year or two about a
>research showing two particles in 'communication' even they were outside of
>each other's time cone.  I believe they were using the exclusion principle. If
>anyone can refer me to this, it would save me some time in the local library.
Two particles in "correlation," not "communication."  Just because two
coins come up "even," that is one is heads if the other is tails, does
not mean that you can use them to send information.  The electron, or any
other event which could carry information, can not travel faster than light
even through quantum effects.  Quantum Field Theory is fully consistent with
special relativity.

brooks@maddog.llnl.gov, brooks@maddog.uucp

brooks@maddog.llnl.gov (Eugene Brooks) (10/29/90)

In article <2588@ux.acs.umn.edu> dhoyt@vx.acs.umn.edu writes:
>Special relativity does not produce the same answers as quantum dynamics.  As
>correct as (general) relativity seems on a macro scale, it fails at the micro
>scale.  What often trips people up is thinking of an electrons (or photons) as
>a particles, when in fact, they are not.  They are waves. (Chemists and physists
>tend to disaggre with each other here.)  The achiral nature of ammonia
Electrons and Photons are not either (classical) particles or (classical) waves.
They posses features of both, however.  If you want to know what an electron is
don't ask a Chemist!

brooks@maddog.llnl.gov, brooks@maddog.uucp

djh@hawkeye.osc.edu (David Heisterberg) (10/29/90)

In article <2588@ux.acs.umn.edu> dhoyt@vx.acs.umn.edu writes:
>                                         The achiral nature of ammonia
>chemistry is a  simple proof of this.  (If electrons were truely classical (or
>relativistic) one would expect ammonia to be pyramid shapped, and thus show 
>handed or chiral geometry, which would show up as sterospecificy.  Ammonia
>chemistry does not show sterospecificy, thus is achiral, thus not pyrimid
>shaped, but rather planar.)

This doesn't really have anything to do with computer architecture, but
your ammonia analysis is wholly incorrect.  The ground state of ammonia
is pyramidal.  It has C3v symmetry so of course there's no optical activity.
It also has a fairly low inversion barrier, as do some umbrellas - the
macroscopic analog.  Further, relativity certainly need not be considered
explicitly for ammonia.

Well, one could say it is because of high performance computer architectures
that one is able to do detailed quantum calculations to determine ammonia's
geometric and electronic structure.
--
David J. Heisterberg		djh@osc.edu		And you all know
The Ohio Supercomputer Center	djh@ohstpy.bitnet	security Is mortals'
Columbus, Ohio  43212		ohstpy::djh		chiefest enemy.

dhoyt@vx.acs.umn.edu (10/29/90)

In article <2588@ux.acs.umn.edu>, dhoyt@vx.acs.umn.edu writes...
>                  If the 'electron' actually moved, there would have to be
  I don't know where my head was on this argument, the momentum could be
handled by adding to the vibrational state.  In any case, this really doesn't
have anything to do with comp.arch.  Can we move this to private mail, Henry?

david | dhoyt@vx.acs.umn.edu

omtzigt-theo@cs.yale.edu (Erwinus Theodorus Leonardus Omtzigt (Theo)) (10/30/90)

>In article <1990Oct2.190020.15214@mdbs.uucp> zed@mdbs.uucp (Bill Smith) writes:
>
>Anyone for engineering a CPU out of a tiny piece of neutron star ???
>
>Graeme Williams
>gcwilliams@watdragon.waterloo.edu


I tried, but lost my glasses due to the enormous gravity!

Theo
-- 
InterNet: omtzigt-theo@cs.yale.edu	mail: Becton Center, 15 Prospect Str
Phone:	  (203) 432 - 4261		      Yale University, Dept. of EE

herrickd@iccgcc.decnet.ab.com (10/30/90)

In article <2590@ux.acs.umn.edu>, dhoyt@vx.acs.umn.edu writes:
> In article <2588@ux.acs.umn.edu>, dhoyt@vx.acs.umn.edu writes...
>>                  If the 'electron' actually moved, there would have to be
>   I don't know where my head was on this argument, the momentum could be
> handled by adding to the vibrational state.  In any case, this really doesn't
> have anything to do with comp.arch.  Can we move this to private mail, Henry?
> 
> david | dhoyt@vx.acs.umn.edu

I'm just a spectator, but I was enjoying it.

dan herrick
herrickd@astro.pc.ab.com

gcwilliams@watdragon.waterloo.edu (Graeme Williams) (10/30/90)

In article <1dbs02b=035v01@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:
>gcwilliams@watdragon.waterloo.edu (Graeme Williams) said in article
><1990Oct26.191032.9099@watdragon.waterloo.edu> :
>
>> An instruction cannot be executed in a time shorter than the time
>> it takes for a light beam to traverse the processing device.
>
>While it is true that an instruction cannot be EXECUTED in less time
>than light can traverse the computing device, nFLOPS are measures of
>THROUGHPUT.

I don't follow this at all - for a *given* processor is not THROUGHPUT
determined by how fast the processor executes ??

>We can improve throughput by some amount via pipelining of
>the execution so that each section of a computing device performs only
>part of the work for the instruction. After an initial delay to fill the
>pipeline we could produce a result every 1/mth second where this time
>between results < time for light to traverse the whole CPU.

Sure you can do some premliminary work on an instruction using
pipelining techniques - but the bottom line is that the EXECUTION of
an instruction changes the state of the processor (e.g. a register)
and that particular part of the processor involved has a finite
physical size and hence a maximum rate at which it can be changed.
No amount of pipelining can change this fact.

Of course one can do parallel execution (if your algorithm permits)
but then you're no longer using a *SINGLE* processor/execution unit
are you?

Graeme Williams
gcwilliams@watdragon.waterloo.edu

rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (11/08/90)

In article <1990Oct29.205812.1641@watdragon.waterloo.edu>  gcwilliams@watdragon.waterloo.edu (Graeme Williams)
said:
> In article <1dbs02b=035v01@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:
> >gcwilliams@watdragon.waterloo.edu (Graeme Williams) said in article
> ><1990Oct26.191032.9099@watdragon.waterloo.edu> :
>
> >> An instruction cannot be executed in a time shorter than the time
> >> it takes for a light beam to traverse the processing device.
>
> >While it is true that an instruction cannot be EXECUTED in less time
> >than light can traverse the computing device, nFLOPS are measures of
> >THROUGHPUT.
>
> I don't follow this at all - for a *given* processor is not THROUGHPUT
> determined by how fast the processor executes ??

Throughput is determined by how fast the processor executes but not by
the time for an instruction to traverse all the pipeline stages. At any
time there are usually several instructions in the process of executing
in different pipeline stages. It's like an auto assembly line. It might
take two days for what will be a car to move from start to finish
(execution time for 1 car) but hundreds of cars are finished each hour
(that's throughput). Instead of autos we assemble instruction results.

> >We can improve throughput by some amount via pipelining of
> >the execution so that each section of a computing device performs only
> >part of the work for the instruction. After an initial delay to fill the
> >pipeline we could produce a result every 1/mth second where this time
> >between results < time for light to traverse the whole CPU.
>
>  Sure you can do some premliminary work on an instruction using
>  pipelining techniques - but the bottom line is that the EXECUTION of
>  an instruction changes the state of the processor (e.g. a register)
>  and that particular part of the processor involved has a finite
>  physical size and hence a maximum rate at which it can be changed.
>  No amount of pipelining can change this fact.
>
>  Of course one can do parallel execution (if your algorithm permits)
>  but then you're no longer using a *SINGLE* processor/execution unit
>  are you?
>
I'm not always too sure what is meant by a *single* processor.
In some pipelined machines the *state* of the machine does not have
a simple intuitive definition. *A*register* may have several different
values in several different stages of the pipeline. In fact, one of the
more interesting aspects of pipelining is that more than one iteration
of a loop might be 'in execution' in different pipeline stages at the
same time. So the value of register GR3 being loaded in iteration
2 of a loop might be different than the one that was loaded in iteration 1
of the loop if the loop is small enough for more than one iteration to
fit into the pipeline. Things become a bit tricky when you try to provide
'precise' interrupts/traps. If GR3 is used as a divisor and is 0 in one
iteration of a loop then we will have to present back to the software an
image that is consistent with a single machine state despite the fact that
while the machine was running it never had a single state.

Pipelines are today more common than not (e.g. 80486, most SPARC, MIPS,
S/390, etc.).

Things are even more complex when designers begin using multiple instruction
issue strategies as did the IBM RS6000 designers, where, as often as possible,
two instructions are issued in each clock cycle (I understand that, in that
implementation, one must be an integer instruction and the other floating point
in order to exploit the double issue). I doubt that two is the upper limit on
this technique. So, while the two instructions that began at address 44000
are executing, what is the program counter (PC). PC was usually thought of
as part of the machine state AND IT WAS SINGLE VALUED.

Back to the speed of light as a gating factor to computing throughput. If
we can complete an instruction on each cycle and there are 6 stages in our
pipeline then no stage can complete its work faster than light can traverse
that stage. If this IS a restriction then the throughput will be as if the
whole computer were squeezed into the size of one stage, 1/6 of its actual
path. We will get a new result EVERY clock and 1 clock will also be the signal
traversal time through one stage.

This STAGE business is only for the designers' convenience. It is somewhat
artificial and may not be a limiting factor. Consider a systolic array that
we might use for sorting. The array is composed of a number of comparators,
each of which will put the lower of its 2 input values out on its left
output and the other out on its right output:

       I1            I2            I3          I4

         \         /                  \       /
          \      /                      \    /
             C                            C
             |  \                        / |
             |    \                    /   |
             |      \                /     |
             |        \            /       |
             |          \        /         |
             |            \    /           |
             |              \/             |
             |             /  \            |
             |           /      \          |
             |         /          \        |
             |       /              \      |
             |     /                  \    |
             |   /                      \  |

              C                           C

            /   \                        /   \
          /       \                    /       \
         O1        O2                 O3       O4

We may be able to clock the values being sorted between levels of comparators
so that at each comparator at each clock period the two inputs become outputs.
This can easily be extended from the 2 levels shown [I hope] above to N levels.
We will then have N results each cycle with another N results 'in the mill'
right behind. One might call each level of comparators a 'stage' and we are
clearly operating the comparators in parallel.

But maybe the 'stages' above do not have to be a limiting factor. Instead
of a single master clock for all comparators, we could include some latching
so that a comparator which has finished its prior pair can accept the next pair
as soon as they are ready as signaled by latches set by the upstream comparator .
Such self-clocking is more reputed to be more difficult than deterministic
clocking. It would, however, be quite applicable to comparators which can
often finish early (ier than the worst case which must be allowed for in
a march step clock).

The sorting array was only meant for an easy to visualize example. In some
far future designs we might be able to succeed at more complex, less regular
designs which could retain some of the wave-of-solution advantages of the
systolic arrays but solve less regular, more general problems as found in
the usual computer instructions. In such a machine the TIME TO EXECUTE THE
FIRST INSTRUCTION MIGHT BE RELATIVELY LONG, but thereafter we might get the
results of subsequent instructions in a cascade with very little separation
or perhaps in parallel, and maybe even out of order where it doesn't matte r.

So I challenge the posting of the speed of light as a THROUGHPUT limit even
though it does limit the execution of a single instruction from start to
finish. More passengers instead of faster vehicles.
-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108