[net.arch] Shared vs. Paged memory

rb@cci632.UUCP (Rex Ballard) (09/18/86)

In article <210@ima.UUCP> johnl@ima.UUCP (John R. Levine) writes:
>In article <3608@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes:
>>Just a few comments on what the large 370 world does about some of the
>>problems that have been raised with very large memories.  First of all,
>>large memories in this world are always multi-ported and accessed
>>through caches. ...
>
>The latest issue of the IBM Systems Journal (known affectionately as "Pravda")
>is about the new 3090 series.  I was amazed to learn that in the four-processor
>version, the processors are set up in pairs, with half of the memory attached to
>each pair.  If a processor wants to get to memory attached to the other pair,
>it has to ask the other side to get it.  They have large write-behind caches,
>so that the number of memory references is very low compared to the number of
>instructions executed, but it still seems like a peculiar way to build a memory.

It does seem peculiar until you look at the actual constraints of arbitrating
between N processors and using DMA or some other "paging" or "Buffering"
scheme.

In the arbitration scheme, you have two choices, you can either allow
arbitration only at specific times, such as upon completion of instruction
execution or you can allow arbitration prior to each memory access.  In the
latter case, even in non-contention situations, the test adds additional
states for each access.  In the former, the time required for arbitration
is much longer.

For example, if there is contention between two 8085
processors running at 6 mhz, the minimum arbitration time is 1 microsecond,
for about 1 megabyte/sec, but it is also possible to extend this to 6
micro-seconds per processor.  Put 3 or 4 CPU's on the same backplane, and
delays can run 24 microseconds.  Bandwidth has just dropped to 50 Kbytes
per second with comparable delays in wait states.

The 6809 allowed two processors to share the same memory without contention,
but here, there is a limit to the number of processors allowed.  The same
is true for dual ported memory.

In addition, the memory access speed must be at least N times faster than
for a single CPU.

On the other hand, DMA interfaces such as IPI and SCSI among others, set
arbitration very early, and once the circuit is set, DMA in burst or
cycle stealing mode, without further arbitration.  Given either cycle
stealing DMA or Burst mode, the CPU is only slowed by the time required
for the memory write.  In interleaved schemes, the CPU will not see
the difference, in burst schemes, each CPU can run faster when DMA is
not occurring.

In addition, the CPU can limit it's use of dma target memory through
use of smaller segments, context switching, and mapping.

In data-flow models, where caching, pre-fetch, pipelines can be used,
it is concievably possible to maintain effective accesses even faster
than shared memory, in spite of transfer overhead time.

This issue of large memory could be tied to the issue of many processors.
If a many processors containg moderate amounts of memory and high-speed
data coupling are connected, it becomes more practical to use less "main"
memory and more "mass storage".

The common complaint with VM, segments, paging, etc., is that it requires
too much "cpu time" to manage all the tables and buffers required.  In
the case if a single or dual CPU system, this is true, but when heirarchies
of CPU's and associated inter-processor "calls" are used, the general
system speed goes up.

Again the issues of single application vs. multi-tasking, cpu-only vs.
"system" benchmarks, etc. come into play.

The simple start of a "system level" memory scheme of this sort is
asking a few simple questions.

What do I need when?
How do I get it?
How long will it take to get here?
What can I do while I'm waiting?

At some levels this can be hard to do, but at others, it can be quite
simple.

It is theoretically possible to make it appear to certain levels of
processor, that it has access to 1 Gigabyte of data with no wait-states
given the right management scheme.  For example:

Multi-tasking master can access local cache at say 150ns, during "miss",
reqest resource from level 1 slave, then switch to another task with
resources already available.  Level 1 slave knows what code is being
executed, gives code to master, then requests more code from level 2
slave, level 2 slave has index table to "ramdisk cache" and "virtual
directories, new tasks, data from open files,... can be requested
from level 3 slave.  Level 3 slave knows about all directories in path,
all open files, and most average use resources.  Level 4 slave knows
how to translate virtual device addresses to sectors and tracks.
Caches most average use sectors or tracks.  Level 5 slave manages
disk requests, stepping, ...

As you can see, the "master" or level 0 processor will usually find
loops, sequential streams,... already in it's slave's core before
it requests it.  The master may have less than a meg, even 128K or
less, but when 8 different slaves, each with their own 128K can be
selected, we now have 1 Meg available within say, 2 milliseconds,
time enough to do something else.  Like updating a screen, terminal I/O,
or looking ahead to see what else it might need.

Extending the same *8 rule, for each level, 64 meg in 4 milliseconds,
512 meg in 6 milliseconds, possibly even 4 Gigabytes in 8 milliseconds. 
In worst case however, things could slow down to a painfully slow 100
milliseconds.  To do this however, you would have do write some special
applications, like "random read any byte in a 32 gigabyte space" :-).
Remember, each level of processor can have tasks which do special work
like getting information based on data base keys recently accessed.

The next trick of course, is coming up with 32 gigabytes of mass storage,
like worms with cartridge indices, video tape players, and so on.

Now, unless you want the slaves to be sleeping all the time, you will
need lot's of masters as well, to be conservative, say 4 masters.
Does this seem a little bit powerful?

This by the way, assumes only an SCSI interface with 1 megabyte/second
interface and average arbitration time of 500 us.  With IPI at 10 times
the transfer rate, and 1/10th the arbitration time, it could be possible
to reduce the times by some small amount (using faster memory of course).

This whole scheme was covered in a Byte article about 8 years ago or so,
probably in an April issue under the subject of "Content addressable memory".

Kismet - "The Olive Tree".
	Rex Ballard.