[comp.arch] Microcomputer Bus Multiprocessing

daveh@cbmvax.commodore.com (Dave Haynie) (03/22/91)

I'm looking for references, discussion, flames, whatever on various solutions
to multiprocessing, especially in relation to support for said on microcomputer
buses.

It looks like the typical answer for "who caches what" on a microcomputer is
somewhere between "the host processor is the only thing that can cache" to
"nobody caches any shared memory".  What kind of cache support protocols are
being used, if any.  The only bus I have any reference to that supports any
sort of cache coherency scheme, at the moment, is FutureBus+.  Which would
imply that at the moment, zero microcomputer systems solve this problem.

Along with cache problems, interrupts are another multiprocessor question.  If
a device issues an interrupt, which processor does it go to?  Most systems seem
to be saying "only the host processor".  Pre-Apple NuBus did have a solution
to this problem, but the solution was to simply eliminate normal level 
sensitive, easy to use interrupts.  An I/O device would then generate an 
interrupt by mastering the bus and banging a magic location in the memory map
of the processor of choice.  Sure, it would work, but requiring bus master
capability isn't an easy way to build a $50 serial port card that supports
multiple masters.  Anyone doing it better (again, in the context of a micro).


-- 
Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
	"What works for me might work for you"	-Jimmy Buffett

jerry@TALOS.UUCP (Jerry Gitomer) (03/22/91)

daveh@cbmvax.commodore.com (Dave Haynie) writes:

|I'm looking for references, discussion, flames, whatever on various solutions
|to multiprocessing, especially in relation to support for said on microcomputer
|buses.

	Dave, do you want to limit this discussion to production chips
	such as the 80x86 and the 680x0 and buses or do you want to
	include custom chips, systems that never made it to market, and
	whatever?

			Jerry
--
Jerry Gitomer at National Political Resources Inc, Alexandria, VA USA
I am apolitical, have no resources, and speak only for myself.
Ma Bell (703)683-9090      (UUCP:  ...{uupsi,vrdxhq}!pbs!npri6!jerry

koll@NECAM.tdd.sj.nec.com (Michael Goldman) (03/23/91)

The only experience I've had with multiprocessing and bus issues was in a
real-time system using 680x0 & Ready Systems' VRTX kernel and the related MPV
(MultiProcessor for for VME).  MPV uses a table in global memory to keep
track of where a CPU is and how to access it.  System calls (posts to queues,
semaphores, etc.) can be made across the bus by interrupting the addressed
CPU, (VME interrupt or mailbox interrupt) and passing a structure to the MPV
code on the 2nd CPU that contains the parameters necessary to make the system
call locally.  In this situation, even though all memory is global, it is shared
through semaphore-like arrangements, using the Motorola 68K TAS (test and set)
instruction.  One member of our team discovered an interesting problem with the
TAS instruction using cycle-by-cycle-interleaved memory access.  Although the
VME spec requires that the bus be locked during the multi-cycle TAS to insure
that TAS is an atomic operation, the local CPU bus did not.  Thus, we
occasionally had CPU-1's TAS interleave with CPU-2's TAS.
This worked as follows:

  CPU-1	   TEST (thinks lock open) SET (thinks it owns memory)
  
  CPU-2              TEST (thinks lock open) SET (thinks it owns memory)
  
  Memory     0         0            1         1      (trash)
  
The solution for us was to put the MPV tables on a separate memory-only
board.  I believe Software Components' pSOS+ MPV-equivalent approaches the
problem differently.

Radstone sells a VME board that has a 68040 and a 68020 on the same board.
They consider the 68020 as an I/O processor that fields all the interrupts
for Ethernet, SCSI, VMEbus, etc. with separate memory for each CPU.  One of the
proprietary Real-Time Unix companies did something similar to minimize interrupt
latency.

There was a lot of concern about bus-snooping in the Futurebus+ discussions
and I'm curious as to how it was resolved.

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (03/23/91)

It appears that there are many people who don't realize that snooping
caches on bus based multiprocessors are a reality in the marketplace.

Encore and Sequent have both been offering products based on this
technology for some time.  The Encore Multimax uses NS 32000 family
microprocessors, while Sequent builds machines around the M 68000 and
Intel 80x86 families.

In addition, the DEC Firefly is built around microvax processors, but
I don't know if they built it commercially or just for in-house research.

We have 2 Encore Multimax machines at Iowa, both with around 18
processors.  One is used as a workhorse supporting undergraduate
computer science instruction, the other supports only research users.
Parallel processing on the Multimax is supported at the lowest level
with shared memory and spin-locks.  There is a threads package sitting
on top of this low level.

The Encore operating system (MULTIMAX) is a version of UNIX with memory
sharing between processes allowed on a page by page basis (mark the page
as shared, then fork); theres a suite of library routines to manage a
shared heap using a shared version of malloc -- that's how I usually
do shared memory applications on this machine.

					Doug Jones
					jones@cs.uiowa.edu

glew@pdx007.intel.com (Andy Glew) (03/23/91)

    One member of our team discovered an interesting problem with the TAS
    instruction using cycle-by-cycle-interleaved memory access.  Although
    the VME spec requires that the bus be locked during the multi-cycle
    TAS to insure that TAS is an atomic operation, the local CPU bus did
    not.  Thus, we occasionally had CPU-1's TAS interleave with CPU-2's
    TAS.

This is exactly the sort of cautionary tale I put in my "Survey of
Synchronization Primitive Implementations".   

Michael - can you give me more details (like company name) or does
that violate non-disclosure agreements?

Other readers - if you have any other real-life examples of problems
with synchronization primitive implementations, please send them to
me.


Reminiscing - this is the sort of thing that got me interested in computer
systems architecture in the first place.


--
---

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

bobg@.UUCP (Bob Greiner) (03/23/91)

In article <20037@cbmvax.commodore.com> daveh@cbmvax.commodore.com (Dave Haynie) writes:
>I'm looking for references, discussion, flames, whatever on various solutions
>to multiprocessing, especially in relation to support for said on microcomputer
>buses.

The Motorola Computer Group ships multiprocessors on VMEbus.  The VME141 
is a 68030 board with a write-through cache that supports queued invalidates.  
The VME188 is a 4 processor 88100 board with copy-back cache that supports
cache coherence through retries.  Other boards interact through uncached
memory regions.  

The '188 is not cache coherent versus other cached masters on VMEbus.  It
is internally cache coherent over its proprietary bus.  VME is used as an
IO bus; IO masters that access main memory on the '188 see a hardware-
enforced cache coherent image.  

This technique is used by many other companies: use VME to get to IO, have
a proprietary bus (or switch) with cached processors and main memory.  

This limitation, no cache coherence over the standard backplane bus, is why
I became the author of the cache coherence chapter of Futurebus+.

Bob Greiner, bobg@phx.mcd.mot.com
Not necessarily the opinion of Motorola.

ddr@cs.edinburgh.ac.uk (Doug Rogers) (03/26/91)

In article <20037@cbmvax.commodore.com>, daveh@cbmvax.commodore.com (Dave Haynie) writes:
> I'm looking for references, discussion, flames, whatever on various solutions
> to multiprocessing, especially in relation to support for said on microcomputer
> buses.
> 

> .............  The only bus I have any reference to that supports any
> sort of cache coherency scheme, at the moment, is FutureBus+.  Which would
> imply that at the moment, zero microcomputer systems solve this problem.

Apart from specialist in house busses like those used by Sequence
> 
> Along with cache problems, interrupts are another multiprocessor question.  If
> a device issues an interrupt, which processor does it go to?  Most systems seem
> to be saying "only the host processor". 

Futurebus supports interupts through its message passing scheme, the distributed
arbitration mechanism goes to two passes, the first showing a message is being
sent that wins over arbitrators for the bus, and the second pass of the protocol 
places the message on hte bus. It is up to the other arbiters to recognise 
messages that are intended as interupts locally. Some messages are reserved for 
powerdown etc.

There are 3 agreed profiles for futurebus, which all might be a bit heavy for 
small personal machines. New profiles are being put forward for such machines,
if you are interested why not contact someone on the  IEEE 896 working
group.

-- 
Douglas Rogers                     JANET: ddr@uk.ac.ed.lfcs
Department of Computer Science     UUCP:  ..!mcvax!ukc!lfcs!ddr
University of Edinburgh            ARPA:  ddr%lfcs.ed.ac.uk@nsfnet-relay.ac.uk
Edinburgh EH9 3JZ, UK.             Tel:   031-650 5172 (direct line)

koll@NECAM.tdd.sj.nec.com (Michael Goldman) (03/27/91)

In reply to Andy Glew's question about the name of the manufacturer that
made the board allowing cycle-by-cycle interleaving, I can give you the
name but it wouldn't be fair to virtually every other VME manufacturer that
does the same.  The only exception I can think of is Matrix (Raleigh, NC)
which claims that it is slower to do the arbitration cycle-by-cycle than
operation-by-operation.

markw@hpcuhe.cup.hp.com (mark williams) (03/27/91)

The base note asks for references to multiprocessors using microcomputers,
especially in reference to cache coherence and interrupt handling.  
Judging from the poster's address, I think the writer might want a
focused response (limited to MP micros using CISCy or RISCy chips).  I could
respond from that perspective, but I won't.  This subject is is much broader.
The amusing thing for me is that the issues are the same whether the processors
are Am386s(tm) :+) or supercomputers,  so the problem has been studied for
a while.

These topics are very well covered in the literature.  If I were to recommend
just one reference, it would be the excellent tutorial in Computer two years
ago:

   "Synchronization, Coherence and Event Ordering in Multiprocessors", by
   Dubois, Scheurich and Briggs, IEEE Computer, Feb 1988.

There are literally hundreds of other references to pursue, so many that
an annotated bibliography has be done by Eugene Miya   It was posted to
the net recently.  Check your archive.

As for systems, currently dozens of MP systems are shipping, ranging from
PCs (Compaq SystemPro, a dual 486) to Supercomputers (Crays).  They use a wide
range of architectures, from loosely coupled MP (the highly successful Tandem
Computers) to shared memory MP.  

Within the shared memory camp, which has the most implementations, all the
vendors I know of use a proprietary processor/memory bus and most couple to
some standard bus like VME or EISA to leverage low-cost I/O controllers.  
The processor/memory bus maintains cache-coherence, which the standard I/O
buses cannot do (without proprietary extensions).  

This leads to some tricky problems extending locks to I/O
space with acceptable performance.  Cache-coherence over I/O space can be 
implemented in the bus converter between the processor/memory bus and the
I/O bus.  This too is tricky.

In an earlier post, Futurebus+ was mentioned as a potential solution, since
it is a "standard bus" with a defined cache coherence scheme.  Whenever the
"standard" gets approved (and frozen), it's worth taking a look at.  At least
one major vendor has plans to use Futurebus+ as a standard I/O bus, because
it's faster than other current "standard" I/O buses.

Disclaimer:  Just one man's opinion.

Mark Williams