[net.arch] How Many ...

gottlieb@cmcl2.UUCP (Allan Gottlieb) (04/30/86)

In article <2120@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:
>On the other hand, you can eliminate this problem by putting the translation
>hardware out at the memory (which I believe is what was done by
>Gottlieb et. al. in their supercomputer project, along with also putting
>some adders and so on out there), but then you only have one of them,
>which means it has to be very fast to avoid a bottleneck.  I recall reading
>a comment by Gottlieb about that fairly recently, where he was saying he
>wished he'd put his memory management at the processors instead.
>
	This is not quite right.  It is certainly correct that the NYU
	Ultracomputer architecture specifies adders at the MMs (memory
	modules).  This is done to provide an atomic implementation of
	our fetch-and-add operation (which we believe is an important
	coordination primitive).  We are very concerned about
	bottlenecks and thus specify that the number of MMs grows with
	the number of PEs (processing elements).  In the nominal
	design an omega network is used to connect 2^D PEs to 2^D MMs.
	The network is enhanced with VLSI switches we have designed
	that combine simultaneous references (including fetch-and-adds)
	directed at the same memory location.  We have implemented
	bottleneck free algorithms for task management, memory
	management, and parts of the I/O system calls.

	Since the specified network is buffered, circuit switched, and
	pipelined, the bandwidth grows linearly with the number of PEs
	and thus will not prove to be limiting.  However the latency
	grows as D (i.e. log #PE) and it is not trivial to do enough
	prefetching to mask the latency.  For this reason it is
	essential to put the cache on the PE side of the network.
	Moreover, it is important to minimize network traffic as
	otherwise queues in the switches become nonempty, further
	increasing the latency.

	The memory management of our current prototype (8PEs bus based)
	is on the PE side of the network.  Note that we do not support
	demand paging.  A soon to be completed thesis by Pat Teller
	studies the demand paging issue, especially the problem of evicting
	shared pages.  Since we consider it important to have the TLB
	on the PE side of the network, the design "locks" TLB resident
	pages in the MMs.

	Finally, let me add that IBM Research plans to build a 512 PE
	RP3 (Research Parallel Processor Prototype) whose architecture
	inlcudes all the (older) ultracomputer architecture as well as
	significant IBM enhancements, especially in memory management.
	In the RP3, memory management is also done on the PE side for
	those requests that traverse the network (their memory management
	enhancement permits certain cache misses to go directly to the MM
	physically associated with the issuing PE without using the
	network).  RP3 does not specify demand paging either.  The real
	problem here is that neither the RP3 nor Ultra projects have
	produced the I/O systems needed.  That is, RP3 is 1000 MIPS
	for good (but reasonable) cache hit ratios but does not have
	anything close to 1000 times the I/O of a Vax-780.
-- 
Allan Gottlieb
GOTTLIEB@NYU
{floyd,ihnp4}!cmcl2!gottlieb   <---the character before the 2 is an el

jer@peora.UUCP (J. Eric Roskos) (05/05/86)

I wrote:

>On the other hand, you can eliminate this problem by putting the translation
>hardware out at the memory (which I believe is what was done by
>Gottlieb et. al. in their supercomputer project, along with also putting
>some adders and so on out there) ...

Allan Gottlieb replied:

>       This is not quite right. ...  The latency grows as log
>       #PE and it is not trivial to do enough prefetching to mask the
>       latency.  For this reason it is essential to put the cache on the
>       PE side of the network.  The memory management of our current
>       prototype (8PEs, bus-based) is on the PE side of the network ... we
>       consider it important to have the TLB on the PE side of the
>       network; the design "locks" TLB resident pages in the MMs.

Oops... I apologise about that... in your paper you wrote:

	To prevent the cluster controller from becoming a bottleneck,
	each MM has a directory with entries for the pages it currently
	contains.... When a page fault is detected by an MM, an
	appropriate message is sent to the cluster controller,
	inducing it to perform a page swap and update the
	directory of each MM in its cluster. ... Performing address
	translation at the MM is not without cost: virtual addresses
	augmented with address space identifiers, which may be longer
	than physical addresses, must be transmitted across the
	network, increasing traffic.  Also, the optimal cluster
	size for paging may be bad for general I/O.  We are, therefore,
	investigating the second option of locating the translation
	mechanism at the PEs.

Since the description of the hypothetical case was in the present tense,
no subjunctives, I accidentally misunderstood the last sentence to mean
"we attempted the former and are now investigating the possibility of
using the latter approach for the next prototype."  I apologise for any
confusion that may have resulted from my misinterpretation.

There is another consideration involved in the placement of the address
translation hardware, however, aside from the above factors.  That is the
situation in which heterogeneous processors serve as the PEs.  It is a
problem not so commonly discussed in the research, but it is a problem in
the real world.

The problem is that with differing processor types, there may not be
available the same MMU for all processors.  This might be the case, for
example, in a machine using both 8086 and 68000 CPUs sharing a common
memory.

Two problems come up: (1) the 8086s MMU and the 68000s MMU require
substantially different translation/protection tables.  Keeping them
consistent is a problem.  (2) Proving that the protection is correct
is much harder.

#2 is a problem even with homogeneous processors.  It would be much easier,
especially to do the proofs required by things such as CSC-STD-001-83
and similar existing security standards, if the memory was self-protecting:
if, assuming that the integrity of the memory system was not violated, that
it was not possible to make illegal accesses.  Unfortunately, the practical
considerations presently prevent this.

------
Disclaimer: the above comments reflect my own ideas, and do not necessarily
reflect any work presently being done at Concurrent.
-- 
E. Roskos