[comp.arch] Summary of responses to VME broadcast question

jim_d@cimcorMN.ORG (Jim Dahlberg) (03/18/89)
Here is a summary of the responses to my VME question regarding using
VME to broadcast to other processors.

As an aside, I was suprised that I received most of my replies via email.
I thought that this was supposed to be more of a forum for everyone, so
that everyone could benefit by the discussions.

Here is a summary of the original question:
|  I am working on a multiprocessor system on the VME bus.  [How can I]
|  *broadcast* data to the other processors' local memory?

===========================================================================

FROM markw@hpsal2.HP.COM

Try the following:

Dedicate some region of VME address space or a "User defined" address
modifier code (10-1F) to the broadcast address space.  To do a broadcast,
issue a write to the broadcast space.  Your VME processors will
accept the broadcast and no others will be affected.

Depending on what you want to use broadcasts for, you may want to
self-acknowledge (assert DTACK) the VME broadcast at the master
(if, for example, the broadcast is a reset, which must be reliable).  

===========================================================================

>From mcc.com!shamash!@MCC.COM:rfg

In article <646@cimcor.mn.org> you write:
>
>    I am working on a multiprocessor system which will have
>multiprocessors on the VME bus.  It will be necessary for the
>application software to broadcast data to the other processors' local
>memory.

I am particularily interested in this approach, i.e. what I would call
"firm" coupling, wherein each processor does in fact have local memory,
(so that the system in general has features of a loosely coupled system),
but where there is also a hardware-supported broadcast capability which
can be used to effectively simulate shared memory.

Unfortunately, I have yet to be able to convince aynbody that this is
a good or useful idea.  (All of the people of this project are loose-
coupling bigots).

The particular case in which this approach would be a huge win would
be Ada applications.  As you may know, Ada's model of parallel
computation pretty much requires hardware which at least simulates
tight-coupling because of the language semantics which allow variables
to be directly shared between multiple (otherwise independent) tasks.

I suggested a multiprocessor system design including a unique broadcast
capability in my 1987 CS Master's thesis.  (Of course, because I went
to a small State college in California, nobody ever read it).

I'll try not to bore you too much, but, in a nutshell, here is the
idea I had.

Basically, on each "node" you would use a stock 86K MMU (68451?).  I
recall that in the specs for this part it said that there was a kind-of
spare bit in each page descriptor entry (either in memory or cached
in the MMU).  This extra bit was called the "shared" bit or the "don't
cache" bit or something.  Anyway, it was always driven out on one
of the MMU pins during each bus cycle.

I figured that you could use this bit (that is to say the MMU output
signal it produced) to tell the hardware whether or not the current
memory location being accessed was *replicated* on other nodes.  If
it was not, or if the operation was a read, then the whole transaction
is satisfied locally.  If however the replicated bit is set for the
current page, and if the operation is a write, then the memory interface
hardware detects this fact, and *only* then causes the write to be
done *both* locally, and also broadcast onto the global bus.

This approach assures that only those pages which must cause broadcasts
on writes will in fact do so.  Also, the particular set of pages which
cause broadcasting is, at all times, directly under control of the
operating system, and can be easily changed by the OS.

Now regarding the listening side.  I also envisioned that on each "node"
you would have a second MMU, just like the first, except that its
address input lines are hooked to the global bus.  It thus acts like
a global bus "snoop", waiting for global bus transactions which it
cares about before doing anything.

As with the "sender" MMU, this "receiver" MMU would have its own set
of mapping tables, which could be maintained by the operating system.
Also, just like for the sender, you could use the extra bit in each
of the mapping table entries to tell this receiver MMU that it needs
to do something special whenever it sees input addresses which fall
into this particular page.

The "something special" it would do for global bus transactions which
have addresses which fall onto the receiver MMU's "special pages"
would be to go ahead and actually "accept" the boradcasted data,
and actually do the write locally (at the receiver-MMU mapped local
physical address).

This whole scheme allows you to simulate a shared memory machine on
what it mostly a loosely coupled machine, while minimizing global
bus traffic as much as possible.  Note that only writes (never reads)
go onto the global bus.  Also note that even for write, the operating
system can setup the mapping tables so that only the writes which
go to a particular (OS-determined) set of pages ever cause global
bus traffic.  This reduces global bus traffic (and contention) even
further... to near the absolute minimum needed to implement simulated
shared memory via broadcast-based distributed replication of shared
variables.

This scheme in effect uses local main memory banks in much the same
way as caches are normally used tighly coupled multiprocessor systems.
So what is the advantage of this scheme over typical caching?  Significantly
reduced contention for global resources (i.e. busses and/or memory)
because of the fine-grained control (i.e. page by page) of broadcasting.

Also, note that there is no reason that you could not use this scheme
and also use traditional caching hardware at the same time.  This would
give you a double win.

Well, I've said more that enough.  I'd like to know what you think of the
idea, and I'd be interested in finding out more about the system you
are planning to build.  If it is to have a broadcast capability, then
perhaps you will be looking for somebody to port an Ada compiler to
it someday. :-)
=========================================================================

>From shamash!wheaties.ai.mit.edu!sundar

This is in reference to your question regarding VME bus broadcasting. We 
looked at this problem over a year ago and there was no way you could do this
without bending the specs a whole lot. I also dont know of anyone 
commercially who does this sort of thing.

I'd appreciate it if you could send me a note if you hear otherwise.

-Sundar

=========================================================================

>From shamash!uunet.UU.NET!mcvax!memex.co.uk!peter

It is feasible, but only if you know the details of ALL the boards
you are broadcasting to. VME was not designed for this, but if your
slowest board takes the data before the fastest one produces Dtack you
should be OK.
If your boards have CPUs on them which might hold up VME access to the local
memory for unpredictable amounts of time you may have problems.
A colleague here and I argue about this from time to time. He says, rightly,
that it is not strictly illegal to do this; I say it against the spirit
of the VME spec.

What you really want is P896 FutureBus, this had broadcast write (and
broadcall read) in the spec.

>    If necessary, we can 'bend' the VME spec to allow this, since we
>are already planning to use a non-standard VME connector.  Also all the
>VME cards will be unique, so they don't have to conform to the standard.
>But we would like to stay with the standard as much as possible.

What do you mean by "unique"? I am curious why you want VME if you don't
use the standard connector and your boards are also "unique". This seems
to prevent you using any standard VME boards in your system.

	Peter Ilieve		peter@memex.co.uk

===========================================================================

    Jim Dahlberg
    Internet: jim_d@cimcor.mn.org
    UUCP:     uunet!rosevax!cimcor!jim_d