[comp.sys.sun] Multiple Sun CPU's on a single VME backplane?

tjt@cbnewsh.att.com (timothy.j.thompson) (05/25/89)
Does anyone have any experience with running multiple Sun CPU's in a
single VME backplane?  We're trying to do that, and have had a little
success, but there is a critical problem which we haven't been able to
solve.  If anyone can provide any information at all about running
multiple Sun CPU's in a single VME backplane, we would be very
appreciative.   We'd also be interested to know if newer Sun's (e.g. the
4/110) would work better in this type of arrangement.  Enclosed is a
detailed description of our experience so far.

        Tim Thompson   AT&T Bell Labs/Holmdel/NJ   tjt@twitch.att.com

======================================================================

1. General Description of the System

Our application requires that we install several SUN processors running
SunOS 4.0 in one VME backplane.  One of these processors (a 3/280) is used
for overall system control, and it operates the various disk controllers
and most of the other VMEbus peripherals.  The other SUN processors (3/110
at the moment) are diskless clients running dedicated real-time
application programs (under SunOS); they occasionally need to communicate
over the VMEbus with various custom boards of our own design.

2. VMEbus Configuration

    We modified a standard SUN 3/280S in the following ways:

      a) the 3/280 board was moved from slot 1 of its card cage
	 to slot 2.
      b) the 3/280 board was jumpered to be a bus requester only.
      c) a Mizar system controller was installed in slot 1 of the
	 3/280 card cage.
      d) a Performance Technologies, Inc. PT-VME902A VMEbus
	 repeater was installed in slot 3 of the 3/280 card cage.
      e) the "non-host" side of the bus repeater was installed in
	 slot 1 of a second 21-slot VME backplane.
      f) a SUN 3/110 processor was installed in slot 2 of the
	 second VME backplane.  It was jumpered to be a VME
	 requester only, and never to field VMEbus interrupts.
	 It boots over Ethernet, and the SunOS it runs does not
	 enable DVMA, so there is no DVMA collision with the 3/280.

3. The First Problem

In the system as configured above, application programs on each processor
must be able to access the VMEbus without interfering with each other.
When initially tested, however, this was not the case.  A scenario causing
failure is as follows: assume the 3/280 is busy accessing the VMEbus in
order to operate the system disk (i.e., writing the disk controller
registers, etc.).  Simultaneously, on the 3/110 a program mmaps in a
portion of the VMEbus and performs an access which causes a bus error.  In
most cases the SunOS ON THE 3/280 will crash with a "panic: bus error".
(The 3/110 SunOS won't crash since the bus error was caused by a user
process).

To explain this behavior, I hypothesize the following: the VMEbus master
logic in each SUN has a bus timeout period associated with it (737
microseconds in the case of the 3/110, according to the "User's Guide to
the SUN-3/100 VMEbus" publication).  Thus, when the 3/110 causes a bus
error it locks the VMEbus for at least 737 microseconds.  Ordinarily, such
a timeout period would apply to bus cycles which have been initiated
(i.e., address strobe and one or both data strobes asserted) but not
completed (i.e., no dtack received).  In the case of the SUN, though, I
believe this timeout period begins when the SUN tries to become VMEbus
master, not when it initiates a VMEbus cycle.  If the 3/280 wants the
VMEbus at the same time, it will also start its bus timeout timer (I don't
know the exact value of the 3/280's timeout period).  If the 3/280's
timeout period is equal to or shorter than that of the 3/110, it is likely
that the 3/280 will get a bus error (and SunOS will crash if the access
was from the kernel) simply because it can't get the VMEbus in time.

To test this hypothesis, I built a simple bus timer which measures VMEbus
accesses, and issues BERR (bus error) if the access takes longer than 128
microseconds, the notion being that if a VMEbus slave takes longer than
128 microseconds to respond, it probably isn't there.  With this board in
the expansion backplane, I was no longer able to crash SunOS on one
processor by causing a VMEbus timeout error with another SUN processor.

4. The Current Problem

The next problem I've run into is apparently an interaction between the
SUN 3/280's on-board Ethernet interface and the 3/280's VMEbus interface.
Here's the test I ran: on the 3/110, a program is run which continuously
reads a VMEbus memory board (a standard commercial product).  On the 3/280
I run two programs in parallel: one reads the VMEbus continuously, and the
other 'tar's a big remotely-mounted NFS filesystem to /dev/null.  One
3/280 program causes lots of VMEbus activity (along with the program
running on the 3/110), and the other 3/280 program causes lots of received
Ethernet packets.  The result is that the 3/110 program will get random
bus errors, at a rate of perhaps 5 per minute.  The ceases when the 3/280
stops VMEbus accesses or Ethernet activity subsides.  In no case are any
of the programs on the 3/280 affected.

One seemingly reasonable hypothesis is that the VMEbus control logic and
the Ethernet logic on the 3/280 are constructed in such a manner that,
occasionally, the 3/280 acquires the VMEbus to do a cycle and the on-board
Ethernet logic delays the start of the cycle for a considerable period of
time (>737 microseconds, since the 3/110 gets a bus error).  Note that I
say it delays the start of the cycle; it can't delay the cycle once it has
started, or the add-on board I built (the 128 microsecond bus timer) would
have issued BERR and aborted the 3/280's bus cycle (crashing UNIX, in all
probability).

To fix this I tried building yet another piece of add-on VMEbus logic
which, when installed just ahead of the 3/110, measures the elapsed time
from the 3/110's bus request (BR3) and the receipt of a bus grant (BG3).
If this time becomes excessive (e.g., >128 microseconds), it issues BCLR
(bus clear), the hope being that this signal would cause the 3/280 to let
go of the bus.  A VMEbus master has the option of ignoring this signal,
however, and the 3/280 did not seem to respond to it.

So, this problem we haven't been able to fix.  Any ideas?