mackeown@CompSci.Bristol.AC.UK (William Mackeown) (10/26/90)
The following article from New Electronics, Sept 1990, was sent to the transputer mailing list and posted in comp.sys.transputer <13001...> by rabagliatia@com.inmos.isnet. William M. --------------------------------------------------------------------- Inmos - H1 - architecture revealed Since the Transputer's launch five years ago there have been no major changes to the microprocessor. Now, on the eve of the device's fifth birthday, Dr Clive Dyson outlines the architecture of the H1, the next generation Transputer. The revolutionary architecture of the Inmos Transputer created a storm when it was first revealed in the mid-1980s. A microprocessor with on-chip ram and four serial communication links was dramatically different to other 32-bit processors then appearing on the market. But the Transputer has matured into a credible force in the microprocessor market and according to Dataquest figures, is now the third most popular device behind the products of Intel and Motorola. However, the complexion of the 32-bit microprocessor market is changing as embedded systems continue to account for a growing share. Dataquest predicts the market will grow to 28 million units in 1993 from 4.3 million units in 1988. Over this period the share taken by system cpus is predicted to fall to 28% from 64% while that for embedded system processors is forecast to rise to 61% from 30%. The Transputer is well placed to take advantage of this shift away from 'conventional' processors; indeed the majority of the 200,000 Transputers Inmos shipped worldwide in 1989 were built into embedded systems. This is no idle coincidence; the architecture of the Transputer was originally designed to suit it to the demands of real-time computing. A key requirement in most programmed systems, especially embedded systems. is the ability of the processor to switch context efficiently at an interrupt or timeslice between processes or tasks. These processes also have to communicate with each other. The Transputer is unique in providing hardware support for process scheduling and specific instructions for inter-process communication. Processes can be written in C or Fortran supported by a kernel or operating system. Alternatively, programs can be written in parallel versions of C, Fortran, Ada, or in Occam in which case the scheduling capabilities of the Transputer are used directly. Furthermore. as the computational loads on embedded controllers increase, the ability of the Transputer to produce scalable multiprocessor systems is crucial. Message passing over dedicated point-to-point links will always prove more deterministic than communication over a shared backplane. Finally, real-time embedded systems have to be compact both in terms of a low component count and in terms of the amount of code they require. The Transputer meets both these requirements. However, the existing Transputer can be improved further to enhance its suitability for embedded systems and a team at Inmos' Bristol design centre have been working for the past two years to do just that. The device, codenamed H1, will be launched early next year. The design goal for the H1 was to establish a new standard in single processor performance while enhancing the Transputer's position as the premier multiprocessing microprocessor. This had to be achieved while maintaining compatibility with existing Transputer products. To meet these goals a new micro-architecture has been developed which implements the same instruction set as the existing T805 Transputer. The H1 provides an order of magnitude increase in performance combined with enhanced capabilities to support the emerging software standards in the embedded systems market. Inmos is also designing a range of network communication products to complement the H1. These products are based on a new 100Mbits/s link protocol which supports the dynamic routing of messages between processors. The key features of the H1 architecture are a pipelined superscalar processor alu combined with on-chip cache ram and improved communications which make multiprocessor programming easier. The major design goal of achieving a significant performance increase, while maintaining instruction set compatibility with the T805 Transputer, produced a design which gives a peak performance in excess of 150mips and 20Mflops with a sustained performance exceeding 60mips and 1OMflops. A number of design features have contributed to this performance. The processor itself uses a pipelined superscalar architecture which is able to execute up to eight instructions on each clock cycle and operates at a clock speed of 50MHz (a consequence of a sub-micron cmos process). The number of cycles required to execute many of the instructions, such as integer and floating point multiply and logical shift has been reduced dramatically. The T805, for example, requires 38 clock cycles for an integer multiply operation; the H1 will need a small fraction of that number. Unlike other superscalar machines the H1 architecture does not require an advanced compiler to schedule the different functional units in the processor. The flow of multiple instructions through the pipeline is controlled by hardware. It is not necessary for existing compilers to be modified or for source code to be recompiled to obtain the full claimed performance. The triple metal layer, sub-micron cmos process has enabled 16Kbytes of on-chip cache memory to be provided. The move to a cached architecture is a radical change from the simple on-chip memory provided on earlier Transputers. It is achievable because 16Kbytes is a sufficiently large cache to result in high hit rates for most applications. However, it is possible to run the cache as on-chip ram for applications which only require small amounts of memory, or which cannot tolerate the indeterminate behaviour caused by cache line misses. Great care has been taken to ensure that the H1 Transputer will provide this high performance in low component count systems. For example, there is a 64-bit data bus, which can sustain high data transfer rates for cache line refill, with a programmable memory interface. The interface supports four independent banks of external memory and the timing for each bank can be configured independently. System designers could choose, for example. to fill two banks with dram, one bank with video ram, and the other with peripherals. Such a system would require no external support logic. The H1 supports, in hardware, the same scheduling algorithms as current Transputers. However, on the H1 each process can be augmented with a trap handler process. If an error such as integer overflow occurs then the trap handler copes with the error in software before returning control to the process. A separate user mode is also supported by the H1. In this mode, privileged instructions (which include communications and scheduling instructions) cannot be executed. All memory accesses are checked and translated from the logical to the physical address space. The memory protection and address translation mechanisms are designed specifically to support secure programming and debugging in embedded systems. For dedicated (single user) systems the protection aids the detection of programming errors. For multiuser general purpose computer systems, it allows users and the operating system to be protected from erroneous or malicious programs. By concentrating on the requirements of embedded systems, the protection and translation mechanisms allow the processor to execute code in protected (user) mode as efficiently as in normal processes but without the overhead associated with page-based virtual memory. Additionally, enhancements on the H1 allow programmers to write more efficient real-time kernels; the state of the machine, the process and timer queues, timeslicing and interruptability can be accessed and controlled. One limitation of existing Transputer networks is the need to match the algorithms to the interconnectivity in a specific machine. This means the software is not readily ported to other machines with different link topologies. The H1 eliminates this problem by providing hardware which allows Transputers to be connected via a quick communications network. Communication channels may be established between processes on any two Transputers in the network. This simplifies programming because processes can be allocated to Transputers after the program has been written. Different allocations can be made for different machines and the allocation can be changed to optimise performance. It is possible, in principle, for the allocation to be made by the compiler effectively removing all configuration details from the program. The H1 Transputer itself contains a separate communications processor which multiplexes a large number of logical communications channels (virtual channels) onto each of its four physical links. Messages are transmitted along virtual channels as a sequence of packets all of which, except the last, contain 32 bytes of data. Each packet starts with a header which is used to route the packet through the network and to identify the destination virtual channel on the remote Transputer. The communication network is constructed using a separate routing device, the C104. Small numbers of H1s can be connected using a single C104. In larger systems a number of C104 devices can be connected together to form a hypercube, a multi-dimensional grid, or a tree network. Each C104 has 32 bidirectional links. The header of each packet arriving on a link input is used to determine the output link for that packet which is then transmitted when the output link becomes free. An algorithm called 'interval labelling' decides which link should be the output connection for each packet. A continuous set of header values, an interval, is allocated to each output link. The header of an incoming packet will lie within only one range and the packet will be directed out of the associated link. Using this algorithm it is possible to devise the optimal labelling scheme, which is free of deadlock, for all the common network topologies. The C104 provides additional circuitry to allow networks to be connected together and to reduce the impact of message congestion on worst case latency and bandwidth in heavily loaded networks. The balance of processing power, multiprocessing capabilities and support for standard software will allow the H1 to take up the challenges of the l990s. Dr Clive Dyson is the Transputer development manager at Inmos, Bristol New Electronics, September 1990