tjt@cbnewsh.att.com (timothy.j.thompson) (05/25/89)
Does anyone have any experience with running multiple Sun CPU's in a single VME backplane? We're trying to do that, and have had a little success, but there is a critical problem which we haven't been able to solve. If anyone can provide any information at all about running multiple Sun CPU's in a single VME backplane, we would be very appreciative. We'd also be interested to know if newer Sun's (e.g. the 4/110) would work better in this type of arrangement. Enclosed is a detailed description of our experience so far. Tim Thompson AT&T Bell Labs/Holmdel/NJ tjt@twitch.att.com ====================================================================== 1. General Description of the System Our application requires that we install several SUN processors running SunOS 4.0 in one VME backplane. One of these processors (a 3/280) is used for overall system control, and it operates the various disk controllers and most of the other VMEbus peripherals. The other SUN processors (3/110 at the moment) are diskless clients running dedicated real-time application programs (under SunOS); they occasionally need to communicate over the VMEbus with various custom boards of our own design. 2. VMEbus Configuration We modified a standard SUN 3/280S in the following ways: a) the 3/280 board was moved from slot 1 of its card cage to slot 2. b) the 3/280 board was jumpered to be a bus requester only. c) a Mizar system controller was installed in slot 1 of the 3/280 card cage. d) a Performance Technologies, Inc. PT-VME902A VMEbus repeater was installed in slot 3 of the 3/280 card cage. e) the "non-host" side of the bus repeater was installed in slot 1 of a second 21-slot VME backplane. f) a SUN 3/110 processor was installed in slot 2 of the second VME backplane. It was jumpered to be a VME requester only, and never to field VMEbus interrupts. It boots over Ethernet, and the SunOS it runs does not enable DVMA, so there is no DVMA collision with the 3/280. 3. The First Problem In the system as configured above, application programs on each processor must be able to access the VMEbus without interfering with each other. When initially tested, however, this was not the case. A scenario causing failure is as follows: assume the 3/280 is busy accessing the VMEbus in order to operate the system disk (i.e., writing the disk controller registers, etc.). Simultaneously, on the 3/110 a program mmaps in a portion of the VMEbus and performs an access which causes a bus error. In most cases the SunOS ON THE 3/280 will crash with a "panic: bus error". (The 3/110 SunOS won't crash since the bus error was caused by a user process). To explain this behavior, I hypothesize the following: the VMEbus master logic in each SUN has a bus timeout period associated with it (737 microseconds in the case of the 3/110, according to the "User's Guide to the SUN-3/100 VMEbus" publication). Thus, when the 3/110 causes a bus error it locks the VMEbus for at least 737 microseconds. Ordinarily, such a timeout period would apply to bus cycles which have been initiated (i.e., address strobe and one or both data strobes asserted) but not completed (i.e., no dtack received). In the case of the SUN, though, I believe this timeout period begins when the SUN tries to become VMEbus master, not when it initiates a VMEbus cycle. If the 3/280 wants the VMEbus at the same time, it will also start its bus timeout timer (I don't know the exact value of the 3/280's timeout period). If the 3/280's timeout period is equal to or shorter than that of the 3/110, it is likely that the 3/280 will get a bus error (and SunOS will crash if the access was from the kernel) simply because it can't get the VMEbus in time. To test this hypothesis, I built a simple bus timer which measures VMEbus accesses, and issues BERR (bus error) if the access takes longer than 128 microseconds, the notion being that if a VMEbus slave takes longer than 128 microseconds to respond, it probably isn't there. With this board in the expansion backplane, I was no longer able to crash SunOS on one processor by causing a VMEbus timeout error with another SUN processor. 4. The Current Problem The next problem I've run into is apparently an interaction between the SUN 3/280's on-board Ethernet interface and the 3/280's VMEbus interface. Here's the test I ran: on the 3/110, a program is run which continuously reads a VMEbus memory board (a standard commercial product). On the 3/280 I run two programs in parallel: one reads the VMEbus continuously, and the other 'tar's a big remotely-mounted NFS filesystem to /dev/null. One 3/280 program causes lots of VMEbus activity (along with the program running on the 3/110), and the other 3/280 program causes lots of received Ethernet packets. The result is that the 3/110 program will get random bus errors, at a rate of perhaps 5 per minute. The ceases when the 3/280 stops VMEbus accesses or Ethernet activity subsides. In no case are any of the programs on the 3/280 affected. One seemingly reasonable hypothesis is that the VMEbus control logic and the Ethernet logic on the 3/280 are constructed in such a manner that, occasionally, the 3/280 acquires the VMEbus to do a cycle and the on-board Ethernet logic delays the start of the cycle for a considerable period of time (>737 microseconds, since the 3/110 gets a bus error). Note that I say it delays the start of the cycle; it can't delay the cycle once it has started, or the add-on board I built (the 128 microsecond bus timer) would have issued BERR and aborted the 3/280's bus cycle (crashing UNIX, in all probability). To fix this I tried building yet another piece of add-on VMEbus logic which, when installed just ahead of the 3/110, measures the elapsed time from the 3/110's bus request (BR3) and the receipt of a bus grant (BG3). If this time becomes excessive (e.g., >128 microseconds), it issues BCLR (bus clear), the hope being that this signal would cause the 3/280 to let go of the bus. A VMEbus master has the option of ignoring this signal, however, and the 3/280 did not seem to respond to it. So, this problem we haven't been able to fix. Any ideas?