darylm@illian.UUCP (Daryl V. McDaniel) (08/12/88)
I would like to respond to two of the discussions that have been taking place in this group. misc Comments about Data General 1200 GP vs Special Purpose (Dedicated) processors DG 1200: First, I will cover the DG 1200 since what I have to say is short. I worked for seven years as a Field Engineer servicing, among other things, Data General Nova and 1200 computers. The 1200 interested me because it had a novel, at that time in the late 70's, architecture. The 1200 was advertised, and programmed, as a 16-bit machine. The actual hardware only had a 4-bit ALU and internal data paths. I guess you could say that it was nibble-serial. The DG 1200s I worked on were used as dedicated controllers for Centronix 101A printers. (Talk about dinosaurs...) GP vs SP processors: I believe that, if exercised with restraint, designing with multiple "processing elements" can provide measurable performance increases over single "do it all" processor implementations. I will show, though, that unless careful consideration is given, multiple PEs can seriously harm performance. I am going to use the term "processing element", or PE, instead of CPU in this article. This is because I want a term that will include DMA controllers, blitters, GP CPUs, and other similar devices. The controversy between whether to do it all with one CPU or spread it over multiple "processing elements" has been going on for quite a while. I believe that the final consideration is what can be done to meet our performance requirements with the minimum costs. The biggest win from the multiple PE approach is an increase in parallelism. This will only pay off if the task can be partitioned into multiple threads that can operate independently and in parallel. Examples of this are I/O operations and Graphics. Major detriments to the multiple PE approach include increases in Hardware and Software complexity. There are now multiple contenders for bus and memory bandwidth with the corresponding increase in hardware to support arbitration and isolate the bus from devices pending service. The software effort goes up for each additional PE. The DMA controller type PE usually requires the least additional effort while multiple GP processors require the most. The software complexity increases even further if there are multiple, dissimilar, programmable PEs. An architecture that has a single bus, a reasonably fast main CPU, and multiple I/O devices will rapidly become log-jammed due to overallocated bus bandwidth. An example (based upon real hardware): Machine A: 0.9 MIPS CPU, 3.125 million 16-bit bus cycles/second max bandwidth 2 DMA serial ports, 8 thousand bus requests/second max each cycle is 1.2us 1 802.3 DMA LAN port, can grab bus for up to 300us 1 DMA SCSI (disk) interface, 1 million bus requests/second max each cycle is 1.2us Machine B: 0.75 MIPS main CPU, 2.5 million 16-bit bus cycles/second max I/O processor with local memory: 0.5 MIPS CPU, 2 million 16-bit bus cycles/second 4 DMA serial ports, 1.6 bus requests/second max each cycle is 1.2us 1 802.3 DMA LAN port, same as machine A DMA SA506 disk interface DMA SCSI interface The sample machines had the fastest DMA controllers available at the time they were designed. The two processors in machine B were able to operate in parallel as long as they weren't accessing memory local to the other processor. For the purposes of comparative testing, both machines had the same amount of main memory, 2Mb, and two serial ports and the SCSI port of machine B were unused. Both machines were running a variant of BSD4.2. The processors used, in order of presentation were: NS32016 12.5MHz Machine A, only CPU NS32016 10MHz Machine B, main CPU NS32016 8MHz Machine B, I/O CPU Comparative performance tests were done using the NCR systems performance benchmarks. With NO simulated I/O load, machine A was 15% faster than machine B. As the simulated disk and serial I/O load was increased the performance converged until they were equal. At the point of maximum simulated load, machine B had 20% better performance than machine A. The addition of LAN activity to the simulated load significantly degraded the performance of machine A while having only minor affect on machine B. By analyzing bus activity on machine A, it was determined that a significant amount of time was spent in bus arbitration and servicing "slow" DMA bus cycles. Machine A was then modified to allow the block move instructions of the CPU to be used to transfer data from the SCSI controller to memory. The modification resulted in a general flattening of the performance degradation curve. It also had the side effect of reducing the manufactured cost of machine A. Even with the improved performance of machine A, machine B was still able to provide better performance under high simulated loads. NOTE: Under "normal" 2-user operation, machine A always provided better performance than machine B. It was not until 4-user, or more, comparisons were made that machine B excelled, under "normal" operation. This does not reflect negatively on the NCR SPT suite. The suite was designed to cover the range from "no" load to "max" load. The trick comes in deciding what simulated load reflects a normal load. This will vary as the intended application of the machine varies. TODAY: The high performance machines we design today, (we had nothing to do with the above two machines), use the multiple PE approach. For two port serial interfaces we use a Hitachi 64180 (Z80 with 2 serial ports on chip) with a major portion of the TTY driver handled by it. Our 4 and 8-port serial interfaces use the SMC Quad and Octal UARTs with a NS32CG16 GP CPU to handle the protocol. The main CPU(s) still handle disk and LAN. The main reason is that we haven't been able to partition the software so that data copying is minimized and a significant load is removed from the central compute resource. Graphics devices, both input and output, are a separate sub-system that contains multiple PEs. Communication with the graphics sub-system is done at a high level, such as that specified by the X-protocol. Borrowing an idea from Tandem, we use two identical busses that reach each card in the system. Arbitration routes the requester to whichever bus becomes free first. This allows a relatively simple design with a significant, better than 2X, performance increase over a single bus system. I have tried to present some of my views and experiences with multiple and single PE architectures. I hope that there is something herein that you find useful or that stimulates a useful train of thought. There are many issues and techniques that I haven't touched on. My main thrust is that you shouldn't discard the multiple or single PE approach without really looking at both. In almost every application there is an area that could be enhanced by the incorporation of a dedicated PE for that function. The question is, is it worth the cost. Also, the gratuitous inclusion of PEs under the assumption that if one is good, many must be better is dangerous. You can end up in the situation where one is good but many is horrible. I do agree that a single GP CPU with a zillion MIPS would make my life much easier. But then someone would want something faster. Wouldn't you, Uncle Sugar? (Uncle Sam, for those of you lacking US military experience) I purposely haven't provided many details or the reasoning behind some of the things I have described. The main reason was to keep this article short. (I know, 160 lines isn't really that short) The other reason is that I can't remember some of the specifics of the machine A vs machine B comparison because that happened three years ago. DISCLAIMER: None of the above is in any way an advertisement. We are a non-commercial, independent, non-manufacturing, R&D firm. Anyway, these are my own views, not the companies. -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_- Daryl V. McDaniel Micronetics USENET: ...tektronix!nosun!illian!darylm 4730 S.W. 182nd Ave. TELEX: WUI 6972206 Aloha, OR 97007 PHONE: (503) 224-7056