wxh@lanl.gov (William x Harvey) (05/31/90)
The following information is extracted from April 1990 BYTE (European version), Virtual Channels: The Next Generation of Transputers, by Dick Pountain. Current transputer hardware and software only partially embody the Communicating Sequential Processes (CSP) model. Distributing processes onto multiple transputers is limited because the transputer has only four links, cramping the way programs are written and losing some of the advantages of the CSP model. To reduce the channels used, either rewrite the single-processor program, which defeats one objective of Occam - rewriting of its "scheduling invariance", or force the processes to share links by multiplexing. An example of this limitation is the need to use additional transputers at each node of a hypercube to just get link connectivity. David May at Inmos, Bristol gave out some information on the H1, which has a solution to the interconnectivity problem. Inmos felt adding more on-chip links would not help much, and instead added multiplexing hardware to work with a high-speed routing chip to implement packet switching. Links now become virtual channels, with each program having as many as needed. The routing chip delivers messages to a named destination. Two processes anywhere in a network communicate using a named channel, not caring about routing, making the programs more portable. The H1 still has four links onboard, but divides messages into packets and interleaves these onto a single link. The links operate at 100 megabits per second full-duplex. Messages are split into 32-byte packets, and sent with a one or two byte header, and an end-of-packet or end-of-message trailer. The receiver acknowledges every packet with an empty packet. Each virtual link has two virtual channels, one to send and one to receive. A sending process waits until its last sent packet is acknowledged, which occurs at reception of the first byte, allowing unbroken transmisions if the receiver is ready. If not ready, a one-packet long buffer for each link is provided in memory, permitting any number of virtual channels. To use the H1 as other than an improved T800 requires the C104 routing chip, containing 32 transputer links with a 32 by 32 crossbar switch. Connecting multiple C104 chips allows a complex switching network. The C104 contains only simple logic to route messages. The technique of wormhole routing is used, in which the header instructs the C104 which link to connect, closing that connection at the end of the message. A node in use will stall a competing message, however with a good routing stratey low latency is observed. Inmos chose interval routing, which is complete (every packet eventually arrives), deadlock-free, inexpensive, fast, scalable (good for any size network), versatile (good for any type topology), and can be near optimal (shortest routing followed). A number assigned to each node is used for its address. At every routing switch each link knows which header values it should recognize. Intervals are nonoverlapping and numbered so every header falls into just one interval. Each interval is a pointer to the subtree containing all these processor numbers. The attraction in this scheme is the routing can be computed with a single comparison. The C104 allows 1 or 2 byte headers, giving up to 65536 transputers in a single network. Algorithms exist to prevent cyclic paths that are deadlock free. An optimal labeling scheme for an arbritray network is unknown, but the more common topologys such as rings, trees, and two-dimensional arrays are known. Grid and hypercube schemes can be constructed from these topologys. When too many messages go through a single node, a hot spot develops, stalling a large number of packets. This can be avoided by evenly distributing the flow. The C104 can perform this using universal routing by choosing a random node to initially send the packet to, where it is forwarded to its destination. Latency is increased and throughput is decreased, but the worst-case performance is near optimum. Extra headers can join additional networks also, to whatever level is needed. Packet switching removes certain restrictions in Occam. The present version is a static language in which resources must be determined at compile time, so new processes cannot be added at run time on remote processers. Also each channel must be explicitly allocated to a physical link at compile time. H1 and the C104 allow Full Occam to be implemented. It looks the same, but PLACE need not be used to allocate channels to links - the routing system will handle this. The compiler will be able to PLACE processes on processors itself. Portability to any routed array of transputers will occur, though efficiency will be determined by the topology. The next step is Dynamic Full Occam. Compiler allocation restrictions are removed, allowing recursive processes, replicators with run-time indexes, and reallocation of processors on the fly. The H1 will also have an on chip Floating Point Unit and four communication links. The core processor will use on-chip caching and a DRAM controller supporting static column addressing. Peak performance should be 100 MIPS or 20 MFLOPS. A memory protection scheme allowing code, data, stack, and heap will help in implementing operating systems such as Unix. A many-to-one channel instruction will allow many processes to share a common resource, such as a file server or graphics engine, more efficiently than ALT does. To ease the implementation of semaphores, the instructions SIGNAL and WAIT will be introduced. Added also is hardware support for exception handling in accordance with IEEE floating point standards. Simulations suggest the H1/C104 pair will perform close to the theoretical maximum. The average message delay will go from 12 to 27 microseconds in going from a 64 to a 16384 processor hypercube. The author, Dick Pountain, lives in London. He can be contacted on BIX as "dickp." ----- Billy Harvey wxh@lanl.gov Phone: +49 6565 4342 (W. Germany)