ARCR1@biology.cambridge.ac.uk (Andy Raine) (08/08/90)
Dear netters, Following the exhibition at TA90 at Southampton, and following previous discussion on the net, the following has occurred to me, and I offer it as a topic for debate: Meiko, Transtech, Microway etc. etc. sell boards that have an intel i860 procesor interfaced to one or two t800's. INMOS have a TRAM which has a vector coprocessor attached to a t800. All these boards offer what seems to be impressive cpu performance (albeit for calculations that can be expressed in terms of vector processing libraries, at the moment), but one thing bothers me: Consider: Meiko claim that their board can do a 1024x32bit complex FFT in 1.3 ms. INMOS reckon their board will do the same in < 2.0 ms. The data required for this calculation is 1024x4x2 = 8kBytes, which would take (Using Meiko's CStools communications) about 8ms to transfer from one transputer to another (using occam on a bare link would be faster, but certainly no less than 4 ms). So if all 8 links of the Meiko board (it has 2 t800s) were saturated, and if communications can be overlapped with calculations completely, then the vector processor would just about be busy all the time. For the boards with only 4 links, than only 50% utilisation of the coprocessor would be expected as a maximum. In realistic cases, data just wont be available to the processor at these maximum rates. So what should we conclude? Well, many problems are parallelisable, but efficient algorithms depend on minimising the time spent in communicating data. The t800 has been referred to as a 'medium grain' processor, meaning that often a few tens of processors can be brought to bear efficiently on a particular problem. The new vector/i860 boards are then 'coarse grain' processors, and for the same problem, the maximum number that can be used efficiently will be smaller. I suggest that what is needed for a large number of scientific calculations is a 'fine grain' processor. In other words, if the ratio of communication speed to compute speed of the t800 is taken to be 1:1, then what is needed is a processor where the ratio is 10:1. The i860 & vector boards achieve a ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 ratio at 1:1. If a manufacturer produced boards with a t800 coupled with link driver hardware that ran at 10 times the speed of the t800's links, then I would be able to use ten times as many processors, and get 10 times the performance. What about it? OK, thats all. I seem to have written quite a lot, but I dont want to appear to be trying to force my point of view down other people's throats, just to start a discussion. So what do people think? Andy Raine
braner@THEORY.TN.CORNELL.EDU (Moshe Braner) (08/08/90)
You are absolutely right, Andy. I have the same misgivings about the higher-powered CPUs still using T800 links. The 1:1 ratio is good enough for many cases, but barely. OF course there are SOME applications where the link bandwidth is no problem. - Moshe
bruce@seismo.gps.caltech.edu (Bruce Worden) (08/09/90)
In article <8439.9008081138@prg.oxford.ac.uk> ARCR1@biology.cambridge.ac.uk (Andy Raine) writes: >... >Meiko, Transtech, Microway etc. etc. sell boards that have an intel i860 >procesor interfaced to one or two t800's. >... >but one thing bothers me: >Consider: Meiko claim that their board can do a 1024x32bit complex FFT in >1.3 ms. This FFT (Meiko's) is hand-coded and the data is just the right size (8k) to be wired down into the cache. (I talked to the guy who was working on it.) There is no way a compiled, general-purpose code will ever reach this kind of performance level. >... >I suggest that what is needed for a large number of scientific calculations >is a 'fine grain' processor. In other words, if the ratio of communication >speed to compute speed of the t800 is taken to be 1:1, then what is needed >is a processor where the ratio is 10:1. The i860 & vector boards achieve a >ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 >ratio at 1:1. Different applications need different ratios. Also, the ratio isn't as important as the absolute values. E.g. if the t800 gives 1:1 then a t800 running at 2Mhz would give the 10:1 you want, but I don't think that is what you have in mind. In any given application, one thing or the other will be the bottleneck. What is needed is an i860 with proportionally fast message passing. >If a manufacturer produced boards with a t800 coupled with link driver >hardware that ran at 10 times the speed of the t800's links, then I would >be able to use ten times as many processors, and get 10 times the >performance. What about it? Why not just scrap the t800 completely? Take the i860 hook it to some memory with a hardware message router (like Ametek had) and let it rip. It would blow the transputer networks out of the water. Look at what Intel is doing with the IPSC. A couple of points: 1. With a faster processor, you don't need as many processors, hence there is less message passing to begin with. That is exactly what happened to us--the T800's were not very fast and just adding more of them would have created a communication nightmare. The i860's with the extra links solved two problems for us. We could do the problem with just a few (from 8 to 32) processors, and the multi-hop overhead and link contention went away. If you continue the trend, you can eliminate the need for parallel/multi-computers in the first place. 2. In general, I agree with you. I harassed every multicomputer vendor I talked to about speeding up their communications, but all of them just said "we're working on it." More bandwidth means more applications become parallelizable. You think they would work a little harder, then maybe they would sell more machines. Bruce Worden bruce@seismo.gps.caltech.edu
cca04@keele.ac.uk (P.J. Mitchell) (08/09/90)
From article <8439.9008081138@prg.oxford.ac.uk>, by ARCR1@biology.cambridge.ac.uk (Andy Raine): > 2 t800s) were saturated, and if communications can be overlapped with > calculations completely, then the vector processor would just about be busy all > the time. For the boards with only 4 links, than only 50% utilisation of the > coprocessor would be expected as a maximum. In realistic cases, data just > wont be available to the processor at these maximum rates. It's possibly true for an iterative problem where all the code/data can live in the cashe memory, but as you say the general case is an underused i860. This has been the experience, I belive, at Daresbury where they have been testing out some i860 machines and find that data cannot be brought to the i860 fast enough to satisfy it. > I suggest that what is needed for a large number of scientific calculations > is a 'fine grain' processor. In other words, if the ratio of communication > speed to compute speed of the t800 is taken to be 1:1, then what is needed > is a processor where the ratio is 10:1. The i860 & vector boards achieve a > ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 > ratio at 1:1. Presumably then an i860 with H1's instead of t800's would get near to the 1:1 again. Of course this is not what you want, but it could be worse. Mind you by the time we get H1`s (or whatever they're to be called) we will also probably have i960's and i870's too... > If a manufacturer produced boards with a t800 coupled with link driver > hardware that ran at 10 times the speed of the t800's links, then I would > be able to use ten times as many processors, and get 10 times the > performance. What about it? I think that this is more or less the case, it's always the communications overhead that reduces the efficiency of parallelisation and so the better (faster) the communications the better your parallelisations. I still find it baffling that transputer links are a) one bit wide, and b) use one byte packets ! I would imagine simply going to 8 bit buses and using (say) 512 byte packets would increase performance dramatically. P.S. Andy, I tried to mail you the other day and got bounced. Is your mail host down ? -- --Paul Mitchell (CMA N.Cheshire, DoD#0145) | Computer Centre, JANET: cca04@uk.ac.keele.seq1 | University of Keele, Keele, USENET: cca04@seq1.keele.ac.uk@nss.cs.ucl.ac.uk | Staffordshire, ST5 5BG, U.K. BITNET: cca04%seq1.keele.ac.uk@ukacrl | 0782 - 621111 ext 3302
br@cam-orl.UUCP (Brian Robertson) (08/10/90)
>>If a manufacturer produced boards with a t800 coupled with link driver >>hardware that ran at 10 times the speed of the t800's links, then I would >>be able to use ten times as many processors, and get 10 times the >>performance. What about it? >Why not just scrap the t800 completely? Take the i860 hook it to >some memory with a hardware message router (like Ametek had) and let >it rip. It would blow the transputer networks out of the water. Look >at what Intel is doing with the IPSC. Inmos is effectively doing just that by bringing out the H1. However its never called being scrapped. The H1 will be code compatible with the T8 and will have all the things in it to do routing for free e.g. virtual cut through. The T8 will then be made cheaper. When demand for the T8 falls too low then the T8 will be scrapped just like the M2. The transputer I believe is forward looking in two respects. First it recognises that processor speed for a given technology will eventually saturate and that to go faster we will need to use parallelism. Second with the increase in integration it allows memory and glue logic to be put on chip and to run alot faster than if the memory and glue logic were off chip. >A couple of points: >1. With a faster processor, you don't need as many processors, hence there >is less message passing to begin with. That is exactly what happened to >us--the T800's were not very fast and just adding more of them would >have created a communication nightmare. The i860's with the extra links >solved two problems for us. We could do the problem with just a few >(from 8 to 32) processors, and the multi-hop overhead and link >contention went away. If you continue the trend, you can eliminate >the need for parallel/multi-computers in the first place. Another advantage of the transputer is being able to implement what I call object oriented HW. If a design is divided up into logical HW modules then the HW and SW become very simple to write. Only the interface need to be specified to the module. The HW interface should be standard (ideally between manufacturers). Only the SW interface needs to be specified. By making things simple they become more reliable and easy to implement. The board dimensions of the module are kept small, which is useful as clock speeds go up. It becomes much more easy to build systems as all that is necessary is to configure standard modules together. This has advantages for the all important time-to-market parameter. This is why I believe the TRAM concept is so good. What is true is that the prices of these off the shelf TRAM modules are too high. As clock speeds increase then 'kitchen sink' boards will become increasingly more difficult to build. Capacitance and inductance will mean that crosstalk, and ringing will increase. Parallel connections will have more skew problems. This alone will mean that serial interfaces will start to have advantages in providing transport even over quite short distances. Monomode fibre with huge BW possibilities is obviously attractive but still expensive. What is a shame is that Inmos may decide to keep the new H1 HW interface protected and so prevent other manufacturers from being compatible. I would like to see not only faster interface but 'standard' interfaces between manufacturers devices. Even now, conventional 'kitchen sink' boards are not in practice uniprocessor devices. The ethernet, rs232 or whatever, are all controlled by dedicated processors. How they communicate is completely non standard, some have dma built in some dont, etc, etc. Often on these boards the shared data busses have severe disadvantages especially in real time applications. For example it is not possible for too many ethernet controllers to share the same bus as they may all fight for the bus at the same time. Local processing/filtering/buffering may often be required to solve these problems. In addition physical separation may also be required so that control can be provided where control is needed e.g. in robot applications. There are also applications where a uniprocessor is perfect. Brian Robertson Olivetti Research Cambridge. UK.