[comp.arch] Data General 1200 and the single/multiple processor debate

darylm@illian.UUCP (Daryl V. McDaniel) (08/12/88)

I would like to respond to two of the discussions that have been taking
place in this group.

	misc Comments about Data General 1200
	GP vs Special Purpose (Dedicated) processors

DG 1200:

First, I will cover the DG 1200 since what I have to say is short.  I
worked for seven years as a Field Engineer servicing, among other things,
Data General Nova and 1200 computers.  The 1200 interested me because it
had a novel, at that time in the late 70's, architecture.

The 1200 was advertised, and programmed, as a 16-bit machine.  The actual
hardware only had a 4-bit ALU and internal data paths.  I guess you could
say that it was nibble-serial.  The DG 1200s I worked on were used as
dedicated controllers for Centronix 101A printers.  (Talk about dinosaurs...)


GP vs SP processors:

I believe that, if exercised with restraint, designing with multiple
"processing elements" can provide measurable performance increases over
single "do it all" processor implementations.  I will show, though, that
unless careful consideration is given, multiple PEs can seriously harm
performance.

I am going to use the term "processing element", or PE, instead of CPU in
this article.  This is because I want a term that will include DMA
controllers, blitters, GP CPUs, and other similar devices.

The controversy between whether to do it all with one CPU or spread it over
multiple "processing elements" has been going on for quite a while.  I
believe that the final consideration is what can be done to meet our
performance requirements with the minimum costs.

The biggest win from the multiple PE approach is an increase in
parallelism.  This will only pay off if the task can be partitioned into
multiple threads that can operate independently and in parallel.  Examples
of this are I/O operations and Graphics.

Major detriments to the multiple PE approach include increases in Hardware
and Software complexity.  There are now multiple contenders for bus and
memory bandwidth with the corresponding increase in hardware to support
arbitration and isolate the bus from devices pending service.  The software
effort goes up for each additional PE.  The DMA controller type PE usually
requires the least additional effort while multiple GP processors require
the most.  The software complexity increases even further if there are
multiple, dissimilar, programmable PEs.

An architecture that has a single bus, a reasonably fast main CPU, and
multiple I/O devices will rapidly become log-jammed due to overallocated bus
bandwidth.  An example (based upon real hardware):

    Machine A:

	0.9 MIPS CPU, 3.125 million 16-bit bus cycles/second max bandwidth
	2 DMA serial ports, 8 thousand bus requests/second max
		each cycle is 1.2us
	1 802.3 DMA LAN port, can grab bus for up to 300us
	1 DMA SCSI (disk) interface, 1 million bus requests/second max
		each cycle is 1.2us

    Machine B:

	0.75 MIPS main CPU, 2.5 million 16-bit bus cycles/second max

	I/O processor with local memory:
		0.5 MIPS CPU, 2 million 16-bit bus cycles/second
		4 DMA serial ports, 1.6 bus requests/second max
			each cycle is 1.2us
		1 802.3 DMA LAN port, same as machine A
		DMA SA506 disk interface
		DMA SCSI interface

The sample machines had the fastest DMA controllers available at the time
they were designed.  The two processors in machine B were able to operate
in parallel as long as they weren't accessing memory local to the other
processor.  For the purposes of comparative testing, both machines had the
same amount of main memory, 2Mb, and two serial ports and the SCSI port of
machine B were unused.  Both machines were running a variant of BSD4.2.

The processors used, in order of presentation were:

	NS32016 12.5MHz		Machine A, only CPU
	NS32016 10MHz		Machine B, main CPU
	NS32016 8MHz		Machine B, I/O CPU

Comparative performance tests were done using the NCR systems performance
benchmarks.

With NO simulated I/O load, machine A was 15% faster than machine B.  As
the simulated disk and serial I/O load was increased the performance
converged until they were equal.  At the point of maximum simulated load,
machine B had 20% better performance than machine A.  The addition of LAN
activity to the simulated load significantly degraded the performance of
machine A while having only minor affect on machine B.

By analyzing bus activity on machine A, it was determined that a
significant amount of time was spent in bus arbitration and servicing
"slow" DMA bus cycles.  Machine A was then modified to allow the block move
instructions of the CPU to be used to transfer data from the SCSI
controller to memory.  The modification resulted in a general flattening of
the performance degradation curve.  It also had the side effect of reducing
the manufactured cost of machine A.

Even with the improved performance of machine A, machine B was still able
to provide better performance under high simulated loads.

NOTE:

	Under "normal" 2-user operation, machine A always provided better
	performance than machine B.  It was not until 4-user, or more,
	comparisons were made that machine B excelled, under "normal"
	operation.  This does not reflect negatively on the NCR SPT suite.
	The suite was designed to cover the range from "no" load to "max"
	load.  The trick comes in deciding what simulated load reflects a
	normal load.  This will vary as the intended application of the
	machine varies.

TODAY:

The high performance machines we design today, (we had nothing to do with
the above two machines), use the multiple PE approach.  For two port serial
interfaces we use a Hitachi 64180 (Z80 with 2 serial ports on chip) with a
major portion of the TTY driver handled by it.  Our 4 and 8-port serial
interfaces use the SMC Quad and Octal UARTs with a NS32CG16 GP CPU to
handle the protocol.

The main CPU(s) still handle disk and LAN.  The main reason is that we
haven't been able to partition the software so that data copying is
minimized and a significant load is removed from the central compute
resource.

Graphics devices, both input and output, are a separate sub-system that
contains multiple PEs.  Communication with the graphics sub-system is done at a
high level, such as that specified by the X-protocol.

Borrowing an idea from Tandem, we use two identical busses that reach each
card in the system.  Arbitration routes the requester to whichever bus
becomes free first.  This allows a relatively simple design with a
significant, better than 2X, performance increase over a single bus
system.

I have tried to present some of my views and experiences with multiple and
single PE architectures.  I hope that there is something herein that you
find useful or that stimulates a useful train of thought.  There are many
issues and techniques that I haven't touched on.

My main thrust is that you shouldn't discard the multiple or single PE
approach without really looking at both.  In almost every application
there is an area that could be enhanced by the incorporation of a dedicated
PE for that function.  The question is, is it worth the cost.  Also, the
gratuitous inclusion of PEs under the assumption that if one is good, many
must be better is dangerous.  You can end up in the situation where one is
good but many is horrible.

I do agree that a single GP CPU with a zillion MIPS would make my life much
easier.  But then someone would want something faster.  Wouldn't you, Uncle
Sugar? (Uncle Sam, for those of you lacking US military experience)

I purposely haven't provided many details or the reasoning behind some of
the things I have described.  The main reason was to keep this article
short.  (I know, 160 lines isn't really that short)  The other reason is
that I can't remember some of the specifics of the machine A vs machine B
comparison because that happened three years ago.

DISCLAIMER:
  None of the above is in any way an advertisement.  We are a
non-commercial, independent, non-manufacturing, R&D firm.  Anyway, these
are my own views, not the companies.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Daryl V. McDaniel
Micronetics			USENET:	...tektronix!nosun!illian!darylm
4730 S.W. 182nd Ave.		 TELEX:	WUI 6972206
Aloha, OR   97007		 PHONE:	(503) 224-7056