[comp.sys.transputer] New Electronics Article, Sept 1990

RABAGLIATIA@isnet.inmos.com (10/24/90)

The following is an article that appeared in New Electronics, Sept. 1990.

It is copied without permission.  Any spelling mistakes are those of the
OCR software I used.

Cheers,  Andy Rabagliati    EMAIL:- rabagliatia@isnet.inmos.COM

----------------------------------------------------

Inmos - Hl - architecture revealed

Since the Transputer's launch five years ago there have been no major
changes to the microprocessor. Now, on the eve of the device's fifth
birthday, Dr Clive Dyson outlines the architecture of the Hl; the next
generation Transputer.

The revolutionary architecture of the Imnos Transputer created a storm
when it was first revealed in the mid 1980s A microprocessor with on chip
ram and four serial communication links was dramatically different to
other 32bit processors then appearing on the market

But the Transputer has matured into a credible force in the microprocessor
market and according to Dataquest figures, is now the third most popular
device behind the products of Intel and Motorola

However, the complexion of the 32bit microprocessor market is changing
with embedded systems continuing to account for a growing share. Dataquest
predicts the market will grow to 28 million units in 1993 from 4.3 million
units in 1988. Over this period the share taken by system cpus is
predicted to fall to 28% frorr. 64% while that for embedded system
processors is forecast to rise to 61% from 30%.

The Transputer is well placed to take advantage of this shift away from
'conventional' processors; indeed the majority of the 200,000 Transputers
Inmos shipped worldwide in 1989 were built into embedded systems. This is
no idle coincid ence; the architecture of the Transputer was originally
designed to suit it to the demands of real time computing.

A key requirement in most programmed systems, especially embedded systems.
is the ability of the processor to switch context efficiently at an
interrupt or timeslice between processes or tasks. These processes also
have to communicate with each other. The Transputer is unique in providing
hardware support for process schedul ing and specific instructions for
inter-process communication.

Processes can be written in C or Fortran supported by a kernel or
operating system.  Altematively programs can be written in parallel
versions of C, Fortran, Ada or in Occam in which case the scheduling
capabilities of the Transputer are used directly.  Furthermore. as the
computational loads on embedded controllers increase, the ability of the
Transputer to produce scalable multiprocessor systems is crucial. Message
pas sing over dedicated point to point links will always prove more
deterministic than communication over a shared backplane. Fin ally. real
time embedded systems have to be compact both in terms of a low component
count and in the amount of code they require. The Transputer meets both
these requirements.

However, the existing Transputer can be improved further to enhance its
suitability for embedded systems and a team at Inmos Bristol design centre
have been working for the past two years to do just that. The device.
codenamed Hl, will be launched early next year.

The design goal for the Hl was to establish a new standard in single
processor per formance while enhancing the Transputer's position as the
premier multiprocessing microprocessor. This had to be achieved while
maintaining compatibility with existing Transputer products.

To meet these goals a new micro-architecture has been developed which
imple ments the same instruction set as the exist ing T805 Transputer. The
Hl provides an order of magnitude increase in performance combined with
enhanced capabilities to support the emerging software standards in the
embedded systems market.

Inmos is also designing a range of network communication products to
comple ment the Hl. These products are based on a new 100Mbits/s link
protocol which supports the dynamic routing of messages be tween
processors.

The key features of the Hl architecture are a pipelined superscalar
processor alu combined with on-chip cache ram and improved communications
which make multi processor programming easier.

The major design goal of achieving a significant performance increase,
while main taining instruction set compatibility with the T805 Transputer,
produced a design which gives a peak performance in excess of 150mips and
20Mflops with a sustained performance exceeding 60mips and IOMnops.

A number of design features have contributed to this performance. The
processor itself uses a pipelined superscalar architecture whch is able to
execute up to eight ructions on each clock cycle and operates at a clock
speed of 50MHz (a consequ ence of a sub-micron cmos process).

The number of cycles required to execute many of the instructions. such as
integer and floating point multiply and logical shift has been reduced
dramatically. The T805, for example, requires 38 clock cycles for an
integer multiply operation; the Hl will need a small fraction of that
number.

Unlike other superscalar machines the Hl architecture does not require an
advanced compiler to schedule the different functional units in the
processor. The flow of multiple instructions through the pipeline is
controlled by hardware. It is not necessary for esisting compilers to be
modified.  orfor source code to be recompiled to obtain the full claimed
performance.

The triple metal layer, sub-micron cmos process has enabled 16kbytes of
on-chip cache memory to be provided. The move to a cached architecture is
a radical change from the simple on-chip memory provided on earlier
Trarrsputers. It is achievable because 16kbytes is a sufficiently large
cache to result in high hit rates for most applications. However, it is
possible to run the cache as on-chip ram for applications which only
require small amounts of memory, or which cannot tolerate the
indeterminate behaviour caused by cache line misses.

Great care has been taken to ensure that the Hl Transputer will provide
this high performance in low component count systems.  For example, there
is a 64bit data bus, which can sustain high data transfer rates for cache
line refill, with a programmable memory interface. The interface supports
four independent banks of extemal memory and the timing for each bank can
be configured independently. System designers could choose, for example.
to fill two banks with dram, one bank with video ram and the other with
peripherals. Such a system would require no extemal support logic.  The Hl
supports, in hardware. the same scheduling algonthms as current
Transputers. However, on the Hl each prooess can be augmented with a trap
handler process. If an error such as integer overflow occurs then the trap
handler copes with the error in software before returning control to the
process.

A separate user mode is also supported by the Hl. In this mode privileged
instructions (which include communications and scheduling instructions)
cannot be executed. An memory accesses are checked and translated from the
logical to the physical address space.

The memory protection and address translation mechanisms are designed
specifically to support secure programming and debugging in embedded
systems. For dedicated (single user) systems the protection aids the
detection of programming er rors. For multiuser general purpose com pute
systems it allows users and the operating system to be protected from
erroneous or malicious programs.

By concentrating on the requirements of embedded systems the protection
and translation mechanisms allow the processor to execute code in
protected (user) mode as efficiently as in normal processes but without
the overhead associated with paged based virtual memory. Additionally,
enhancements on the Hl allow programmers to write more efficient real time
ker nels; the state of the machine, the process and timer queues,
timeslicing and interruptability can be accessed and controlled.

One limitation of existing Transputer networks is the need to match the
algorithms to the interconnectivity in a specific machine. This means the
software is not rea dily ported to other machines with different link
topologies. The Hl eliminates this problem by providing hardware which
allows Transputers to be connected via a quick communications network.
Communication channels may be established be tween processes on any two
Transputers in the network.

This simplifies programming because processes can be allocated to
Transputers after the program has been written. Different alloations can
be made for different machines and the allocation can be changed to
optimise performance. It is possible, in principle, for the allocation to
be made by the compiler effectively removing all configuration details
from the program.

The Hl Transputer itself contains a separate communications processor
which multiplexes a large number of logical communications channels
(virtual channels) onto each of its four physical links. Messages are
transmitted along virtual chan nels as a sequence of packets all of which,
except the last, contain 32bytes of data.  Each packet starts with a
header which is used to route the packet through the network and to
identify the destination virtual channel on the remote Transputer.

The communication network is constructed using a separate routing device,
the C104. Small numbers of Hls can be connected using a single C104. In
larger sys tems a number of C104 devices can be con nected together to
form a hypercube, a multi-dimensional grid or tree network.

Each C104 has 32 bidirectional links. The header of each packet arriving
on a link input is used to determine the output link for that paclcet
which is then transmitted when the output link becomes free.

An algorithm called 'interval labelling' decides which link should be the
output connection for each packet. A continuous set of header values, an
interval, is allocated to each output link. The header of an incoming
packet will lie within only one range and the packet will be directed out
of the associated link. Using this algorithm it is possible to devise the
optimum labelling scheme, which is free of deadlock, for all the common
network topologies.

The C104 provides additional circuitry to allow networks to be connected
together and to reduce the impact of message congestion on worst case
latency and band width in heavily loaded networks.

The balance of processing power, multiprocessing capabilities and support
for standard software will allow the Hl to take up the challenges of the
l990s.

Dr Clive Dyson is the Transputer development manager at Inmos, Bristol

New Electronics September 1990