[comp.sys.transputer] i860s and the like

ARCR1@biology.cambridge.ac.uk (Andy Raine) (08/08/90)

Dear netters,

Following the exhibition at TA90 at Southampton, and following previous 
discussion on the net, the following has occurred to me, and I offer it as 
a topic for debate:

Meiko, Transtech, Microway etc. etc. sell boards that have an intel i860 
procesor interfaced to one or two t800's.  INMOS have a TRAM which has a 
vector coprocessor attached to a t800.

All these boards offer what seems to be impressive cpu performance (albeit 
for calculations that can be expressed in terms of vector processing 
libraries, at the moment), but one thing bothers me: 

Consider:  Meiko claim that their board can do a 1024x32bit complex FFT in 
1.3 ms.  INMOS reckon their board will do the same in < 2.0 ms.

The data required for this calculation is 1024x4x2 = 8kBytes, which would 
take (Using Meiko's CStools communications) about 8ms to transfer from one 
transputer to another (using occam on a bare link would be faster, but 
certainly no less than 4 ms).  So if all 8 links of the Meiko board (it has 
2 t800s) were saturated, and if communications can be overlapped with 
calculations completely, then the vector processor would just about be busy all 
the time.  For the boards with only 4 links, than only 50% utilisation of the 
coprocessor would be expected as a maximum.  In realistic cases, data just 
wont be available to the processor at these maximum rates.

So what should we conclude?  Well, many problems are parallelisable, but 
efficient algorithms depend on minimising the time spent in communicating 
data.  The t800 has been referred to as a 'medium grain' processor, meaning 
that often a few tens of processors can be brought to bear efficiently on a 
particular problem.  The new vector/i860 boards are then 'coarse grain' 
processors, and for the same problem, the maximum number that can be used
efficiently will be smaller.

I suggest that what is needed for a large number of scientific calculations 
is a 'fine grain' processor.  In other words, if the ratio of communication 
speed to compute speed of the t800 is taken to be 1:1, then what is needed 
is a processor where the ratio is 10:1.  The i860 & vector boards achieve a 
ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 
ratio at 1:1.

If a manufacturer produced boards with a t800 coupled with link driver 
hardware that ran at 10 times the speed of the t800's links, then I would 
be able to use ten times as many processors, and get 10 times the 
performance.  What about it?

OK, thats all.  I seem to have written quite a lot, but I dont want to 
appear to be trying to force my point of view down other people's throats, 
just to start a discussion.  So what do people think?

Andy Raine

braner@THEORY.TN.CORNELL.EDU (Moshe Braner) (08/08/90)

You are absolutely right, Andy.  I have the same misgivings about
the higher-powered CPUs still using T800 links.  The 1:1 ratio is
good enough for many cases, but barely.  OF course there are
SOME applications where the link bandwidth is no problem.

- Moshe

bruce@seismo.gps.caltech.edu (Bruce Worden) (08/09/90)

In article <8439.9008081138@prg.oxford.ac.uk> ARCR1@biology.cambridge.ac.uk (Andy Raine) writes:
>...
>Meiko, Transtech, Microway etc. etc. sell boards that have an intel i860 
>procesor interfaced to one or two t800's.
>...
>but one thing bothers me: 
>Consider:  Meiko claim that their board can do a 1024x32bit complex FFT in 
>1.3 ms.  

This FFT (Meiko's) is hand-coded and the data is just the right size (8k)
to be wired down into the cache.  (I talked to the guy who was working 
on it.)  There is no way a compiled, general-purpose code will ever 
reach this kind of performance level.

>...
>I suggest that what is needed for a large number of scientific calculations 
>is a 'fine grain' processor.  In other words, if the ratio of communication 
>speed to compute speed of the t800 is taken to be 1:1, then what is needed 
>is a processor where the ratio is 10:1.  The i860 & vector boards achieve a 
>ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 
>ratio at 1:1.

Different applications need different ratios.  Also, the ratio isn't as
important as the absolute values.  E.g. if the t800 gives 1:1 then
a t800 running at 2Mhz would give the 10:1 you want, but I don't think
that is what you have in mind.  In any given application, one thing or the
other will be the bottleneck.  What is needed is an i860 with proportionally 
fast message passing.

>If a manufacturer produced boards with a t800 coupled with link driver 
>hardware that ran at 10 times the speed of the t800's links, then I would 
>be able to use ten times as many processors, and get 10 times the 
>performance.  What about it?

Why not just scrap the t800 completely?  Take the i860 hook it to
some memory with a hardware message router (like Ametek had) and let
it rip.  It would blow the transputer networks out of the water.  Look
at what Intel is doing with the IPSC.

A couple of points:
1. With a faster processor, you don't need as many processors, hence there
is less message passing to begin with.  That is exactly what happened to 
us--the T800's were not very fast and just adding more of them would 
have created a communication nightmare.  The i860's with the extra links 
solved two problems for us.  We could do the problem with just a few
(from 8 to 32) processors, and the multi-hop overhead and link
contention went away.  If you continue the trend, you can eliminate
the need for parallel/multi-computers in the first place.  

2. In general, I agree with you.  I harassed every multicomputer
vendor I talked to about speeding up their communications, but all of them
just said "we're working on it."  More bandwidth means more applications
become parallelizable.  You think they would work a little harder, then 
maybe they would sell more machines. 

Bruce Worden bruce@seismo.gps.caltech.edu

cca04@keele.ac.uk (P.J. Mitchell) (08/09/90)

From article <8439.9008081138@prg.oxford.ac.uk>, by ARCR1@biology.cambridge.ac.uk (Andy Raine):
> 2 t800s) were saturated, and if communications can be overlapped with 
> calculations completely, then the vector processor would just about be busy all 
> the time.  For the boards with only 4 links, than only 50% utilisation of the 
> coprocessor would be expected as a maximum.  In realistic cases, data just 
> wont be available to the processor at these maximum rates.

It's possibly true for an iterative problem where all the code/data can live
in the cashe memory, but as you say the general case is an underused i860.
This has been the experience, I belive, at Daresbury where they have been
testing out some i860 machines and find that data cannot be brought to the
i860 fast enough to satisfy it.
 
> I suggest that what is needed for a large number of scientific calculations 
> is a 'fine grain' processor.  In other words, if the ratio of communication 
> speed to compute speed of the t800 is taken to be 1:1, then what is needed 
> is a processor where the ratio is 10:1.  The i860 & vector boards achieve a 
> ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 
> ratio at 1:1.

Presumably then an i860 with H1's instead of t800's would get near to the
1:1 again. Of course this is not what you want, but it could be worse.
Mind you by the time we get H1`s (or whatever they're to be called) we will
also probably have i960's and i870's too...

> If a manufacturer produced boards with a t800 coupled with link driver 
> hardware that ran at 10 times the speed of the t800's links, then I would 
> be able to use ten times as many processors, and get 10 times the 
> performance.  What about it?

I think that this is more or less the case, it's always the communications
overhead that reduces the efficiency of parallelisation and so the better
(faster) the communications the better your parallelisations.

I still find it baffling that transputer links are a) one bit wide, and b)
use one byte packets ! I would imagine simply going to 8 bit buses and using
(say) 512 byte packets would increase performance dramatically.

P.S. Andy, I tried to mail you the other day and got bounced. Is your mail host
     down ?
--
--Paul Mitchell (CMA N.Cheshire, DoD#0145)      | Computer Centre,
JANET:  cca04@uk.ac.keele.seq1                  | University of Keele, Keele,
USENET: cca04@seq1.keele.ac.uk@nss.cs.ucl.ac.uk | Staffordshire, ST5 5BG, U.K.
BITNET: cca04%seq1.keele.ac.uk@ukacrl           | 0782 - 621111 ext 3302

br@cam-orl.UUCP (Brian Robertson) (08/10/90)

>>If a manufacturer produced boards with a t800 coupled with link driver 
>>hardware that ran at 10 times the speed of the t800's links, then I would 
>>be able to use ten times as many processors, and get 10 times the 
>>performance.  What about it?

>Why not just scrap the t800 completely?  Take the i860 hook it to
>some memory with a hardware message router (like Ametek had) and let
>it rip.  It would blow the transputer networks out of the water.  Look
>at what Intel is doing with the IPSC.

Inmos is effectively doing just that by bringing out the H1. However its never
called being scrapped. The H1 will be code compatible with the T8 and will 
have all the things in it to do routing for free e.g. virtual cut through. 
The T8 will then be made cheaper. When demand for the T8 falls too low then 
the T8 will be scrapped just like the M2.

The transputer I believe is forward looking in two respects. First it
recognises that processor speed for a given technology will eventually
saturate and that to go faster we will need to use parallelism. Second with
the increase in integration it allows memory and glue logic to be put on chip
and to run alot faster than if the memory and glue logic were off chip.

>A couple of points:
>1. With a faster processor, you don't need as many processors, hence there
>is less message passing to begin with.  That is exactly what happened to 
>us--the T800's were not very fast and just adding more of them would 
>have created a communication nightmare.  The i860's with the extra links 
>solved two problems for us.  We could do the problem with just a few
>(from 8 to 32) processors, and the multi-hop overhead and link
>contention went away.  If you continue the trend, you can eliminate
>the need for parallel/multi-computers in the first place.  

Another advantage of the transputer is being able to implement what I call
object oriented HW. If a design is divided up into logical HW modules then
the HW and SW become very simple to write. Only the interface need to be
specified to the module. The HW interface should be standard (ideally between
manufacturers). Only the SW interface needs to be specified. By making things
simple they become more reliable and easy to implement. The board dimensions 
of the module are kept small, which is useful as clock speeds go up. It 
becomes much more easy to build systems as all that is necessary is to
configure standard modules together. This has advantages for the all important
time-to-market parameter. This is why I believe the TRAM concept is so good. 
What is true is that the prices of these off the shelf TRAM modules are too 
high. 

As clock speeds increase then 'kitchen sink' boards will become increasingly
more difficult to build. Capacitance and inductance will mean that crosstalk,
and ringing will increase. Parallel connections will have more skew problems.
This alone will mean that serial interfaces will start to have advantages in
providing transport even over quite short distances. Monomode fibre with huge
BW possibilities is obviously attractive but still expensive.

What is a shame is that Inmos may decide to keep the new H1 HW interface 
protected and so prevent other manufacturers from being compatible. I would
like to see not only faster interface but 'standard' interfaces between
manufacturers devices. 

Even now, conventional 'kitchen sink' boards are not in practice 
uniprocessor devices. The ethernet, rs232 or whatever, are all 
controlled by dedicated processors. How they communicate is completely non
standard, some have dma built in some dont, etc, etc. Often on these boards the
shared data busses have severe disadvantages especially in real time
applications. For example it is not possible for too many ethernet controllers
to share the same bus as they may all fight for the bus at the same time.
Local processing/filtering/buffering may often be required to solve these
problems. In addition physical separation may also be required so that control
can be provided where control is needed e.g. in robot applications. 

There are also applications where a uniprocessor is perfect.

Brian Robertson
Olivetti Research
Cambridge. UK.