[comp.sys.transputer] Experiences with Transputer VME Boards

koontz@aplvax.jhuapl.edu (Ken Koontz) (02/28/90)

>We are looking at putting together a system consisting of
>Transputers on VME cards linking together VME chassis, which
>contain other processor (68020) cards.  ...

>I would be interested in anyone's experience with these or
>similar development systems, or any recommendations.  Thanks.

>     Phillip L. Shaffer                        shaffer@crd.ge.com
>     GE Corporate Research & Development       uunet!crd.ge.com!shaffer
>     Building KW, Room D211
>     P.O. Box 8, Schenectady NY 12301

_________________________________________________________________________

To     : Phillip Shaffer, GE Corp R&D and TMAIL at large
>From   : Ken Koontz, JHU/APL
Subject: Experience with Transputer VME Boards

Dear Phillip,

     I've been experimenting and developing with transputers on 
the VMEbus for over a year now.  We have several Navy prototypes 
that are based on VMEbus and multiple 68020s or 030s that we are 
infusing transputer technology into.  The most recent project
uses a transputer array between a special-purpose processor and 
the general purpose 68020s to do some signal processing.  Because 
of the input rate of the data, a special-purpose processor is 
required to sample and presort the data for the transputers.

     To implement the array, we're using both Dual-Ported(DP) RAM
type boards and non-DP RAM boards.  The non-DP RAM boards are
used to make a processor farm while the DP RAM boards interface
with the 68020s. Since your only interested in the DP boards,
I'll just talk about my experiences with them.  We can save the
non-DP boards for another time if your interested. 

     At the time we selected the hardware (around January 1989), 
there were only 2 clear alternatives: the Inmos B011 (developed 
by Tadpole and sold by them for a while as the Tadpole TPSC) and 
the Paracom (Parsytec) BBK-V2.  Since then, there are some other 
players: the Inmos B016 (brand new, just out) and the Archipel 
Voltex-1/V (a new French firm, found their add in Parallelogram).


Inmos B011:

     We bought one of these from Tadpole when it was just out (it 
came with a T414 and was socketed for a C004 but you couldn't get 
one yet!).  I upgraded it to a 20MHz T800 with a little help from 
Inmos.  The B011 has some good points and some bad points.

     On the good side, the DP memory is fairly fast for the T800
(approx 2-ws).  Computationally, it outperforms my BBK-V2 due to
the faster off-chip RAM.  It has a simple bus arbiter so you can
place it in VMEbus slot 0 as your system controller.  It has 2
TRAM slots (though I don't think many people use them).  It also
has a pair of RS-232 ports, a reset switch on the front panel,
and several status LEDs like Run and Error.  It can do
A32/A24/D32/D8 transfers the bus.  The memory has a parity bit. 
The DP RAM is 2MB; it also has sockets for 256KB of EPROM. A link
adapter is also provided that is mapped to the VMEbus; it was
mainly put on the board to allow a Sun or other VMEbus host to
communicate with the T800 in a PC-like way (a la B004, B008) to
allow some software compatibility but is very low speed
(150-300KB). 

     Now for the bad side.  VME interrupts are very limited; it
can handle any of the 7-levels (jumper selected) but can't
request any!  VMEbus transfers are limited to programmed I/O (no
block transfer or BLT mode).  The literature says it can do D16
but actually it can only do D32/D8.  There is no byte shifter. 
If you do transfers with a 68020, you'll find you have big/little
endian headaches. The T is a little endian machine (least
significant byte in a word is byte 0) while the 680x0 class is a
big endian machine (most significant byte in a word is byte 0). 
If you transfer mainly 32-bit data (integers or reals), you'll
have to correct the order of bytes in a word in software.  For
our application, this was a major concern since we needed to
transfer and process 1MB (256K 32-bit words) per second; order
correcting the bytes in software was impossible to reach these
speeds. 


BBK-V2:

     We decided on the BBK-V2 and bought two boards for initial 
development work.  I've been working with them since September 
1989 to determine what they can and cannot do.  The BBK-V2 has 
some advantages over the B011 but also has its own set of 
problems.

     On the good side, the BBK-V2 has much improved interrupts.
It can handle or request any of the 7-levels of interrupts
(jumper selected).  It includes one memory mapped (mailbox)
interrupt.  The VMEbus interface is a little faster than the
B011.  It has a byte shifter in hardware that can be enabled or
disabled (through jumpers). It also has 2MB DP RAM and 256KB
EPROM space.  Another feature which may be useful are the RS422
drivers/receivers to drive the links differentially. 

     The bad side is as follows.  The DP RAM is 3-ws (!).  I 
ordered a T800-25 on mine (you can get a -17, -20, or -25) with 
80nsec access time DRAM, but the DP interface between the T800 
and RAM slows things down considerably, even when no VMEbus 
activity is present (the B011 can easily beat it on off-chip 
memory intensive applications).  There is no parity on the DP RAM 
(I don't use the parity on the B011 anyway).  There are no RS232 
ports.  There are not status indicators on the front-panel; only 
4 large Lemo connectors for the links.  There is no reset button.
(It might sound funny, but I like boards with LEDs.  They really 
help in system testing to tell if anything is going on or not.
At least the B011 had some indicators but the BBK-V2 leaves you 
with a blank stare.)  There are no RS232 ports (usually not a 
problem).  VMEbus transfers still use programmed I/O, but you can 
do D16.  

     The link interfaces are non-Inmos standard which makes 
it difficult to interface to Inmos-style boards (e.g. Inmos, 
Transtech, CSA, Microway, others).  Paracom has a reset input/output 
associated with each link (great for fault-tolerant 
investigations but it poses problems for general use).  Normally, 
I only use one reset input from a motherboard.  Paracom decided 
that analyze didn't do anything so they hardwire it to ground.  
Because reset reinitializes the external memory interface (and 
stops refresh to your DRAM), they also included off-chip refresh 
to handle data retention.  However, analyze will save the state 
of the processor (some register values); some post-mordem 
debuggers will let you display these values.  I'm not sure if 
these register values mean anything if the T was reset instead of 
analyzed; it makes it difficult to tell if your debugger is 
lieing to you or not.


Basic Problems with Both Boards:

     The basic problem with both boards tends to be the VMEbus
interface and the implementation of dual-ported RAM.  The IF does
not support BLT mode, only programmed I/O.  In programmed I/O,
every word requires an address (e.g. A-D-A-D-A-D-...); in BLT
mode, you send the address once which gets latched on the target
board, then each move of data causes counters on the other board
to increment (e.g. A-D-D-D-D-D-...). Don't get confused with the
Paracom literature saying they support fast block transfers with
the transputer; you can use the transputer's move instruction to
do block transfers but you get programmed I/O behavior on the
VMEbus, not BLT. I did a number of I/O transfer tests to see how
fast I could transfer 1MB of raw data out the four links to four
neighboring transputers.  The program had five processes on the
VMEbus T: one that transfered the data over the IF and four that
sent the data out the links (1 for each link).  I used a rotating
pool of 5 buffers and pointers to the buffers so that the data
didn't get copied between process buffers (a la strict occam
conventions).  This helped to reduce contention on the local bus
and increase overall transfer rates. 

     I tried pulling the raw data over the bus with a T
move instruction, pushing the data with an external 68020 and 
using interrupts to the T when done, and having the T move the 
data out the links and over the bus with the links' DMA units.
The fastest time was achieved with a 68020 moving the data into 
the T (VMEbus writes are faster than reads); however, moving the 
data with the DMA units was only a little worse.   I also did the 
tests at 10Mbps and 20Mbps (except on the BBK-V2 20Mbps at TTL 
levels is too noisey and I didn't have some differential 
interfaces to my TRAMs on hand).  In general, VMEbus activity 
peaked at 4MB/sec since I could only transfer 1 32-bit LWORD in 1 
microsecond for a sustained period.  This is a far cry from 
VMEbus' 40MB/sec advertised transfer rate.  Also, the amount of 
pipelined link activity you create is definitely a function of 
your software AND the link speeds AND the speed of the VMEbus -- 
not a simply test.

     This brings up a little problem.  Since a T800 has four 
links (1.75MB/sec uni max speed), I should be able to transfer up 
to 7MB/sec from one node to four others.  But my VMEbus IF is 
limited to 4MB/sec.  Bitch!  Now you know why I wish they 
implemented BLT.  But if I had it, could my software keep the 
links busy all the time?  Probably not so I would need to wait 
for the H1(!).

     Another problem.  The T800 doesn't implement a test
instruction like what's on a 680x0.  The 680x0 test is mainly
used for semiphore mechanisms on shared bus architectures to
communicate with other processors (you don't need this on a T,
right?).  This maps to the Read-Modify-Write (RMW) cycle on the
VMEbus.  The B011 allows RMW into its DP RAM but can't produce
them.  The BBK-V2 doesn't allow them at all.  Without them,
synchronization with multiple 680x0s on a shared bus is difficult
to damn near impossible.  Therefore, we came up with a mechanism
for data transfer using semiphores and interrupts but limited
between the T800 board and one 680x0 board.  A particular
processor may only set a semiphore while the other can only clear
it; two sets are used for two way communications. Interrupts are
also used in certain modes. All message must be corridinated with
the interface software on the 680x0.  This may lead to multiple
transfers of data on the VMEbus, though an indirect mechanism
using an address pointer can reduce this problem.  Not an eligant
solution... 

     Yet another problem.  Dual ported memory architecture of 
both boards is not real dual ported memory but "shared" memory.
Both ports cannot be active to separate address locations 
concurrently.  If the VMEbus side has the memory and the T800 
tries to access it, additional wait states are inserted on the 
T800.  Ditto for the other way around.  Some of our 
VMEbus/680x0/Hardware Grunts had a cow over this; my I/O 
benchmarks confirmed that for our throughput rates, it had little 
effect.  Things such as the program's concurrency, who moved the 
data over the bus, and the organization of data in the offboard 
memory was of greater concern.

     Yet yet another problem.  The DP RAM can be mapped to almost 
any address on the VMEbus.  Accessing the VMEbus from the T is 
another problem.  Both boards use a windowing scheme that maps a 
portion of the VMEbus within a smaller window of addresses on the 
T.  There are several different windows that map onto the VMEbus 
but with different transfer methods (some are D32, some are 
D8(E), some are D8(O)).  Its a hokey way of doing things and can 
get real confusing.  Once you figure out a configuration, don't 
change it!


B016:

     I don't have a B016 but I did have some say in its design.
Dave Boreham from Inmos put a mail message on the OUGBB asking 
for comments on VMEbus boards around last March.  I got into an 
intense conversation on the problems with the BBK-V2 and B011 and 
what should appear on a VMEbus Master board with DP RAM.  It 
looks like he solved a lot of the problems.  I want one but our 
funding just got cut so I can't have one until it reappears.

     The B016 has 4MB of RAM (hopefully really dual ported)
expandable to 16MB when the denser RAMs are in.  Byte shifter
included.  It also has 128KB of private static memory not
accessable from the VMEbus. This is great since the other boards
have to have program and data either on-chip (precious) or off-chip
(in the shared memory). Thus, even processor instruction fetches
with the other boards can be influenced by VMEbus activity.  Not
so with private memory.  It also has 256KB (size right?) of Flash
EEPROM for program storage and two RS232 ports on a 2671. 

     Interrupts are similar to the BBK-V2 but with several 
mailboxes (how many I don't know).  BLT is supported (YES!).
RMW is not nor is unaligned transfers (UAT, not on any other 
board known to man either).  The board uses a T801-25 with VERY 
fast dynamic RAM; they talk LWORD cycle times of 200nsec on the 
bus (at least).  That's good for 20MB/sec transfer rates!  It 
sounds like a super board.


Voltex-1/V:

     I have very little information on this one.  I think it
looks like a BBK-V2 but it has up to 4 link adapters on it.
Supports standard Inmos link/system services specs. Similar
memory size (2MB?), EPROM, etc.  They advertise the fastest
transfer rate over the VMEbus of 1.3M LWORDs per second (that's
because the B016 isn't really official yet).  That comes to
5.2MB/sec or 1 LWORD xfer in 0.77 microseconds compared to 1.00
microseconds on average for the others (though I've seen
0.82-0.86 for some words with a VMEtro bus analyzer on the
BBK-V2).  It must not use BLT mode either. 


Software:

     We've been using Logical Systems C (LSC) for over a year. I
really enjoy it.  I spent 3 years working with Occam (from the
old Occam1 VAX compiler through D700D on the PC).  I enjoy an
environment for transputers which has its roots in the basic
foundation of software development and which does not require you
to relearn a new foundation.  I had a love-hate relationship with
occam from the start; loved to PAR/SEQ/ALT/PAR i/ALT i and
folding editor, hated the libraries (or lack thereof)/crude data
types/strange TDS environment (aside from the folds).  I tried 3L
C before it was 3L Parallel C and was not impressed.  I've read
about 3L Parallel C and have no desire to move away from LSC. 
I've also read up on Parsec C (good article in BYTE Jan 1990 from
Dick Pountain). Again, no desire to move.  I guess I like using
my own editor, having a real cross-compiler that can run on ANY
host, and then executing my software on a transputer target at
the very end. It's also easy for other software types familar
with traditional software development tools to pick up and use
("the transputer's just really another processor but with those
links!"). 

     Maybe I'm too biased about LSC, though everyone that I talk
to that uses it finds it to be very nice.  I've recently helped
with Beta testing of 89.1, found some problems, offered to fix
some problems or make some enhancements.  The product has a sort
of cult following of C and transputer enthusiasts that have
contributed towards its development.  I don't know if the other
language products have this kind of following or not. 

     I looked briefly at Helios and Trollious.  I was very 
concerned with the new environment, the use of RAM, the transfer 
rates with message routing, etc.  We really needed a fast, tailor 
made transport system for an embedded application that had to 
handle a good volume of data with few unknowns (otherwise you 
would always question whether the operating system was at fault).

     I also looked at Express, bought an early version, and came 
away with a bad taste in my mouth.  Lots of software errors, 
seemed to be marketed for scientific computing vs. real-time 
embedded stuff, was limited to mesh and hypercube architectures, 
etc.

     Debugging for LSC may improve dramatically REAL SOON NOW.


Other Options:

     From your brief explanation on your project, it sounds like 
your trying to link VMEbus crates with transputer links.  Have 
you looked at any of the more traditional VMEbus crates (e.g. 
from Ironics or CES)?  We had picked CES's system to interlink 
3-4 crates.  I don't know all of the details but it looked like a 
nice system.  Supported up to 8MB/s xfer rate, had 
memory-management units to allow boards and busses to be mapped 
into a system that looked like 1 big VMEbus (but was implemented 
with several physically separate busses).  If your not using 
transputers for application processing (e.g. no processor farm or 
pipeline of 50 or more for compute intensive algorithms), you 
wouldn't need to invest in transputers, hit the learning curve, 
and get used to all sorts of new things (don't tell Inmos this).

     Another possibility might be CSA's PART.8.  It's an 
interesting little beast.  It's a VMEbus Slave interface (no 
processor) which contains 6 link adapter interfaces: 2 are dumb 
polling types but the other 4 have FIFOs and interrupt support.
You can do 1MB/sec+ transfers over links with them.  Links are 
differential so you can separate the crates (probably up to 30m
at 10Mbps).  Add one of these to a 680x0 board and with suitable 
software, it can behave just like a transputer (2 board solution 
vs. 1 chip solution !?!?!?!).  You can program in your standard 
680x0 environment and not need to program a T (again, don't tell 
Inmos this but it is another engineering solution).

     I hope this has been of some help.  I better get to work on 
some real work today.  Keep in touch.


Ken Koontz
Johns Hopkins University
Applied Physics Laboratory
Johns Hopkins Rd.  MS 6-41
Laurel, MD 20723

Tel: (301)953-6328
FAX: (301)953-1093
email: koontz@aplvax.jhuapl.edu  OR
       koontz@capvax.jhuapl.edu