koontz@aplvax.jhuapl.edu (Ken Koontz) (02/28/90)
>We are looking at putting together a system consisting of >Transputers on VME cards linking together VME chassis, which >contain other processor (68020) cards. ... >I would be interested in anyone's experience with these or >similar development systems, or any recommendations. Thanks. > Phillip L. Shaffer shaffer@crd.ge.com > GE Corporate Research & Development uunet!crd.ge.com!shaffer > Building KW, Room D211 > P.O. Box 8, Schenectady NY 12301 _________________________________________________________________________ To : Phillip Shaffer, GE Corp R&D and TMAIL at large >From : Ken Koontz, JHU/APL Subject: Experience with Transputer VME Boards Dear Phillip, I've been experimenting and developing with transputers on the VMEbus for over a year now. We have several Navy prototypes that are based on VMEbus and multiple 68020s or 030s that we are infusing transputer technology into. The most recent project uses a transputer array between a special-purpose processor and the general purpose 68020s to do some signal processing. Because of the input rate of the data, a special-purpose processor is required to sample and presort the data for the transputers. To implement the array, we're using both Dual-Ported(DP) RAM type boards and non-DP RAM boards. The non-DP RAM boards are used to make a processor farm while the DP RAM boards interface with the 68020s. Since your only interested in the DP boards, I'll just talk about my experiences with them. We can save the non-DP boards for another time if your interested. At the time we selected the hardware (around January 1989), there were only 2 clear alternatives: the Inmos B011 (developed by Tadpole and sold by them for a while as the Tadpole TPSC) and the Paracom (Parsytec) BBK-V2. Since then, there are some other players: the Inmos B016 (brand new, just out) and the Archipel Voltex-1/V (a new French firm, found their add in Parallelogram). Inmos B011: We bought one of these from Tadpole when it was just out (it came with a T414 and was socketed for a C004 but you couldn't get one yet!). I upgraded it to a 20MHz T800 with a little help from Inmos. The B011 has some good points and some bad points. On the good side, the DP memory is fairly fast for the T800 (approx 2-ws). Computationally, it outperforms my BBK-V2 due to the faster off-chip RAM. It has a simple bus arbiter so you can place it in VMEbus slot 0 as your system controller. It has 2 TRAM slots (though I don't think many people use them). It also has a pair of RS-232 ports, a reset switch on the front panel, and several status LEDs like Run and Error. It can do A32/A24/D32/D8 transfers the bus. The memory has a parity bit. The DP RAM is 2MB; it also has sockets for 256KB of EPROM. A link adapter is also provided that is mapped to the VMEbus; it was mainly put on the board to allow a Sun or other VMEbus host to communicate with the T800 in a PC-like way (a la B004, B008) to allow some software compatibility but is very low speed (150-300KB). Now for the bad side. VME interrupts are very limited; it can handle any of the 7-levels (jumper selected) but can't request any! VMEbus transfers are limited to programmed I/O (no block transfer or BLT mode). The literature says it can do D16 but actually it can only do D32/D8. There is no byte shifter. If you do transfers with a 68020, you'll find you have big/little endian headaches. The T is a little endian machine (least significant byte in a word is byte 0) while the 680x0 class is a big endian machine (most significant byte in a word is byte 0). If you transfer mainly 32-bit data (integers or reals), you'll have to correct the order of bytes in a word in software. For our application, this was a major concern since we needed to transfer and process 1MB (256K 32-bit words) per second; order correcting the bytes in software was impossible to reach these speeds. BBK-V2: We decided on the BBK-V2 and bought two boards for initial development work. I've been working with them since September 1989 to determine what they can and cannot do. The BBK-V2 has some advantages over the B011 but also has its own set of problems. On the good side, the BBK-V2 has much improved interrupts. It can handle or request any of the 7-levels of interrupts (jumper selected). It includes one memory mapped (mailbox) interrupt. The VMEbus interface is a little faster than the B011. It has a byte shifter in hardware that can be enabled or disabled (through jumpers). It also has 2MB DP RAM and 256KB EPROM space. Another feature which may be useful are the RS422 drivers/receivers to drive the links differentially. The bad side is as follows. The DP RAM is 3-ws (!). I ordered a T800-25 on mine (you can get a -17, -20, or -25) with 80nsec access time DRAM, but the DP interface between the T800 and RAM slows things down considerably, even when no VMEbus activity is present (the B011 can easily beat it on off-chip memory intensive applications). There is no parity on the DP RAM (I don't use the parity on the B011 anyway). There are no RS232 ports. There are not status indicators on the front-panel; only 4 large Lemo connectors for the links. There is no reset button. (It might sound funny, but I like boards with LEDs. They really help in system testing to tell if anything is going on or not. At least the B011 had some indicators but the BBK-V2 leaves you with a blank stare.) There are no RS232 ports (usually not a problem). VMEbus transfers still use programmed I/O, but you can do D16. The link interfaces are non-Inmos standard which makes it difficult to interface to Inmos-style boards (e.g. Inmos, Transtech, CSA, Microway, others). Paracom has a reset input/output associated with each link (great for fault-tolerant investigations but it poses problems for general use). Normally, I only use one reset input from a motherboard. Paracom decided that analyze didn't do anything so they hardwire it to ground. Because reset reinitializes the external memory interface (and stops refresh to your DRAM), they also included off-chip refresh to handle data retention. However, analyze will save the state of the processor (some register values); some post-mordem debuggers will let you display these values. I'm not sure if these register values mean anything if the T was reset instead of analyzed; it makes it difficult to tell if your debugger is lieing to you or not. Basic Problems with Both Boards: The basic problem with both boards tends to be the VMEbus interface and the implementation of dual-ported RAM. The IF does not support BLT mode, only programmed I/O. In programmed I/O, every word requires an address (e.g. A-D-A-D-A-D-...); in BLT mode, you send the address once which gets latched on the target board, then each move of data causes counters on the other board to increment (e.g. A-D-D-D-D-D-...). Don't get confused with the Paracom literature saying they support fast block transfers with the transputer; you can use the transputer's move instruction to do block transfers but you get programmed I/O behavior on the VMEbus, not BLT. I did a number of I/O transfer tests to see how fast I could transfer 1MB of raw data out the four links to four neighboring transputers. The program had five processes on the VMEbus T: one that transfered the data over the IF and four that sent the data out the links (1 for each link). I used a rotating pool of 5 buffers and pointers to the buffers so that the data didn't get copied between process buffers (a la strict occam conventions). This helped to reduce contention on the local bus and increase overall transfer rates. I tried pulling the raw data over the bus with a T move instruction, pushing the data with an external 68020 and using interrupts to the T when done, and having the T move the data out the links and over the bus with the links' DMA units. The fastest time was achieved with a 68020 moving the data into the T (VMEbus writes are faster than reads); however, moving the data with the DMA units was only a little worse. I also did the tests at 10Mbps and 20Mbps (except on the BBK-V2 20Mbps at TTL levels is too noisey and I didn't have some differential interfaces to my TRAMs on hand). In general, VMEbus activity peaked at 4MB/sec since I could only transfer 1 32-bit LWORD in 1 microsecond for a sustained period. This is a far cry from VMEbus' 40MB/sec advertised transfer rate. Also, the amount of pipelined link activity you create is definitely a function of your software AND the link speeds AND the speed of the VMEbus -- not a simply test. This brings up a little problem. Since a T800 has four links (1.75MB/sec uni max speed), I should be able to transfer up to 7MB/sec from one node to four others. But my VMEbus IF is limited to 4MB/sec. Bitch! Now you know why I wish they implemented BLT. But if I had it, could my software keep the links busy all the time? Probably not so I would need to wait for the H1(!). Another problem. The T800 doesn't implement a test instruction like what's on a 680x0. The 680x0 test is mainly used for semiphore mechanisms on shared bus architectures to communicate with other processors (you don't need this on a T, right?). This maps to the Read-Modify-Write (RMW) cycle on the VMEbus. The B011 allows RMW into its DP RAM but can't produce them. The BBK-V2 doesn't allow them at all. Without them, synchronization with multiple 680x0s on a shared bus is difficult to damn near impossible. Therefore, we came up with a mechanism for data transfer using semiphores and interrupts but limited between the T800 board and one 680x0 board. A particular processor may only set a semiphore while the other can only clear it; two sets are used for two way communications. Interrupts are also used in certain modes. All message must be corridinated with the interface software on the 680x0. This may lead to multiple transfers of data on the VMEbus, though an indirect mechanism using an address pointer can reduce this problem. Not an eligant solution... Yet another problem. Dual ported memory architecture of both boards is not real dual ported memory but "shared" memory. Both ports cannot be active to separate address locations concurrently. If the VMEbus side has the memory and the T800 tries to access it, additional wait states are inserted on the T800. Ditto for the other way around. Some of our VMEbus/680x0/Hardware Grunts had a cow over this; my I/O benchmarks confirmed that for our throughput rates, it had little effect. Things such as the program's concurrency, who moved the data over the bus, and the organization of data in the offboard memory was of greater concern. Yet yet another problem. The DP RAM can be mapped to almost any address on the VMEbus. Accessing the VMEbus from the T is another problem. Both boards use a windowing scheme that maps a portion of the VMEbus within a smaller window of addresses on the T. There are several different windows that map onto the VMEbus but with different transfer methods (some are D32, some are D8(E), some are D8(O)). Its a hokey way of doing things and can get real confusing. Once you figure out a configuration, don't change it! B016: I don't have a B016 but I did have some say in its design. Dave Boreham from Inmos put a mail message on the OUGBB asking for comments on VMEbus boards around last March. I got into an intense conversation on the problems with the BBK-V2 and B011 and what should appear on a VMEbus Master board with DP RAM. It looks like he solved a lot of the problems. I want one but our funding just got cut so I can't have one until it reappears. The B016 has 4MB of RAM (hopefully really dual ported) expandable to 16MB when the denser RAMs are in. Byte shifter included. It also has 128KB of private static memory not accessable from the VMEbus. This is great since the other boards have to have program and data either on-chip (precious) or off-chip (in the shared memory). Thus, even processor instruction fetches with the other boards can be influenced by VMEbus activity. Not so with private memory. It also has 256KB (size right?) of Flash EEPROM for program storage and two RS232 ports on a 2671. Interrupts are similar to the BBK-V2 but with several mailboxes (how many I don't know). BLT is supported (YES!). RMW is not nor is unaligned transfers (UAT, not on any other board known to man either). The board uses a T801-25 with VERY fast dynamic RAM; they talk LWORD cycle times of 200nsec on the bus (at least). That's good for 20MB/sec transfer rates! It sounds like a super board. Voltex-1/V: I have very little information on this one. I think it looks like a BBK-V2 but it has up to 4 link adapters on it. Supports standard Inmos link/system services specs. Similar memory size (2MB?), EPROM, etc. They advertise the fastest transfer rate over the VMEbus of 1.3M LWORDs per second (that's because the B016 isn't really official yet). That comes to 5.2MB/sec or 1 LWORD xfer in 0.77 microseconds compared to 1.00 microseconds on average for the others (though I've seen 0.82-0.86 for some words with a VMEtro bus analyzer on the BBK-V2). It must not use BLT mode either. Software: We've been using Logical Systems C (LSC) for over a year. I really enjoy it. I spent 3 years working with Occam (from the old Occam1 VAX compiler through D700D on the PC). I enjoy an environment for transputers which has its roots in the basic foundation of software development and which does not require you to relearn a new foundation. I had a love-hate relationship with occam from the start; loved to PAR/SEQ/ALT/PAR i/ALT i and folding editor, hated the libraries (or lack thereof)/crude data types/strange TDS environment (aside from the folds). I tried 3L C before it was 3L Parallel C and was not impressed. I've read about 3L Parallel C and have no desire to move away from LSC. I've also read up on Parsec C (good article in BYTE Jan 1990 from Dick Pountain). Again, no desire to move. I guess I like using my own editor, having a real cross-compiler that can run on ANY host, and then executing my software on a transputer target at the very end. It's also easy for other software types familar with traditional software development tools to pick up and use ("the transputer's just really another processor but with those links!"). Maybe I'm too biased about LSC, though everyone that I talk to that uses it finds it to be very nice. I've recently helped with Beta testing of 89.1, found some problems, offered to fix some problems or make some enhancements. The product has a sort of cult following of C and transputer enthusiasts that have contributed towards its development. I don't know if the other language products have this kind of following or not. I looked briefly at Helios and Trollious. I was very concerned with the new environment, the use of RAM, the transfer rates with message routing, etc. We really needed a fast, tailor made transport system for an embedded application that had to handle a good volume of data with few unknowns (otherwise you would always question whether the operating system was at fault). I also looked at Express, bought an early version, and came away with a bad taste in my mouth. Lots of software errors, seemed to be marketed for scientific computing vs. real-time embedded stuff, was limited to mesh and hypercube architectures, etc. Debugging for LSC may improve dramatically REAL SOON NOW. Other Options: From your brief explanation on your project, it sounds like your trying to link VMEbus crates with transputer links. Have you looked at any of the more traditional VMEbus crates (e.g. from Ironics or CES)? We had picked CES's system to interlink 3-4 crates. I don't know all of the details but it looked like a nice system. Supported up to 8MB/s xfer rate, had memory-management units to allow boards and busses to be mapped into a system that looked like 1 big VMEbus (but was implemented with several physically separate busses). If your not using transputers for application processing (e.g. no processor farm or pipeline of 50 or more for compute intensive algorithms), you wouldn't need to invest in transputers, hit the learning curve, and get used to all sorts of new things (don't tell Inmos this). Another possibility might be CSA's PART.8. It's an interesting little beast. It's a VMEbus Slave interface (no processor) which contains 6 link adapter interfaces: 2 are dumb polling types but the other 4 have FIFOs and interrupt support. You can do 1MB/sec+ transfers over links with them. Links are differential so you can separate the crates (probably up to 30m at 10Mbps). Add one of these to a 680x0 board and with suitable software, it can behave just like a transputer (2 board solution vs. 1 chip solution !?!?!?!). You can program in your standard 680x0 environment and not need to program a T (again, don't tell Inmos this but it is another engineering solution). I hope this has been of some help. I better get to work on some real work today. Keep in touch. Ken Koontz Johns Hopkins University Applied Physics Laboratory Johns Hopkins Rd. MS 6-41 Laurel, MD 20723 Tel: (301)953-6328 FAX: (301)953-1093 email: koontz@aplvax.jhuapl.edu OR koontz@capvax.jhuapl.edu