[comp.arch] DMA on RISC-based systems

sandrock@uxe.cso.uiuc.edu (05/25/89)

I am interested in the pros and cons of DMA transfers in RISC systems.
In particular I am interested in the notion that the DECsystem 3100 has
no DMA to its main memory, but instead relies upon the CPU to copy i/o
buffers to/from an auxilliary memory. First, is this statement accurate?
And second, if true, is this a reasonable tradeoff to make on a RISC system?
We are interested in the DECsystem 3100 (allegedly same h/w as DECstation)
versus the MIPS M/120 as far as multiuser (32 simultaneous, say) performance.
The load would likely be a mix of compute-bound and i/o-bound applications,
including possibly Ingres and NFS-serving, along with various chemistry codes.
Also, wrt RISC systems, would we be better off segregating the interactive
usage, i.e., editing, email, etc. from the compute-bound batch-mode jobs, by
running them on two separate systems?  My thinking is that a high rate of
context switching caused by interactive processes would tend to counteract
the advantage of having large instruction and data caches on the machine.
Any pertinant advice is welcome, and I will summarize if need be.

Mark Sandrock

=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
BITNET:   sandrock@uiucscs	    University of Illinois at Urbana-Champaign
Internet: sandrock@b.scs.uiuc.edu   School of Chemical Sciences Computing Serv.
Voice:    217-244-0561		    505 S. Mathews Ave., Urbana, IL  61801  USA
Home of the Fighting Illini of 'Battle to Seattle' fame. NCAA Final Four, 1989.
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=

henry@utzoo.uucp (Henry Spencer) (05/27/89)

In article <46500067@uxe.cso.uiuc.edu> sandrock@uxe.cso.uiuc.edu writes:
>In particular I am interested in the notion that the DECsystem 3100 has
>no DMA to its main memory, but instead relies upon the CPU to copy i/o
>buffers to/from an auxilliary memory. First, is this statement accurate?
>And second, if true, is this a reasonable tradeoff to make on a RISC system?

Not infrequently, a fast, well-designed CPU can copy data faster than all
but the very best DMA peripherals.  The DMA device may still be a net win
if it can use the memory while the CPU is busy elsewhere, giving worthwhile
parallelism, but this depends on how hard the CPU works the memory.  The
bottleneck nowadays is usually memory bandwidth rather than CPU crunch, and
caches aren't a complete solution, so DMA may end up stalling the CPU.
If that happens, it's not clear that DMA is worth the trouble, especially
since it's easier to design memory to serve only one master.

Having the CPU do the copying is not an obviously *un*reasonable idea.
Much depends on the details.

DMA historically was more popular than auxiliary memory because memory was
expensive.  This is no longer true.
-- 
Van Allen, adj: pertaining to  |     Henry Spencer at U of Toronto Zoology
deadly hazards to spaceflight. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

chris@softway.oz (Chris Maltby) (05/29/89)

In article <1989May26.170247.1165@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
> Having the CPU do the copying is not an obviously *un*reasonable idea.
> Much depends on the details.

> DMA historically was more popular than auxiliary memory because memory was
> expensive.  This is no longer true.

Of course, there are many benefits that can be gained by having controllers
with their own buffers. Disk drivers can stop worrying about rotational
placement if the disk controller is providing whole tracks or cylinders
at a time for no extra bus overhead. LAN drivers can avoid copying stuff
like protocol headers etc into and out of main memory.

Generally, the CPU can be a lot smarter about I/O than any brain-damaged
microprocessor controlled device interface.
-- 
Chris Maltby - Softway Pty Ltd	(chris@softway.sw.oz)

PHONE:	+61-2-698-2322		UUCP:		uunet!softway.sw.oz.au!chris
FAX:	+61-2-699-9174		INTERNET:	chris@softway.sw.oz.au

rec@dg.dg.com (Robert Cousins) (05/30/89)

In article <46500067@uxe.cso.uiuc.edu> sandrock@uxe.cso.uiuc.edu writes:
>
>I am interested in the pros and cons of DMA transfers in RISC systems.
>In particular I am interested in the notion that the DECsystem 3100 has
>no DMA to its main memory, but instead relies upon the CPU to copy i/o
>buffers to/from an auxilliary memory. First, is this statement accurate?

Yes, there is no DMA in the traditional form on the PMAX.

>And second, if true, is this a reasonable tradeoff to make on a RISC system?

IMHO, no.  When designing the AViiON, 88K based workstations, we found
that it is possible to provide the increased performance from DMA
at a low cost (lower than the 3100).

>Mark Sandrock
>=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
>BITNET:   sandrock@uiucscs	    University of Illinois at Urbana-Champaign
>Internet: sandrock@b.scs.uiuc.edu   School of Chemical Sciences Computing Serv.
>Voice:    217-244-0561		    505 S. Mathews Ave., Urbana, IL  61801  USA
>Home of the Fighting Illini of 'Battle to Seattle' fame. NCAA Final Four, 1989.
>=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=

There are some basic requirements, IMHO, which must be met to be considered
state-of-the art:  I/O which does not involve the CPU for moving every
byte, graphics which does not require total CPU dedication for normal
operations such as line drawing or bit blitting, dedicated LAN controllers
to handle the low levels of the LAN protocol and a number of similar 
minimums which most new machines have.  It is interesting, however, to
notice the number of machines which do not meet up with even the most
basic criteria.  

I make this point to begin discussion.  What are some of the minimum
standards which should be applied to these classes of machines and which
machines fail to meet them?

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

rec@dg.dg.com (Robert Cousins) (05/31/89)

In article <1552@softway.oz> chris@softway.oz (Chris Maltby) writes:
>In article <1989May26.170247.1165@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>> Having the CPU do the copying is not an obviously *un*reasonable idea.
>> Much depends on the details.
>> DMA historically was more popular than auxiliary memory because memory was
>> expensive.  This is no longer true.
>
>Of course, there are many benefits that can be gained by having controllers
>with their own buffers. Disk drivers can stop worrying about rotational
>placement if the disk controller is providing whole tracks or cylinders
>at a time for no extra bus overhead. LAN drivers can avoid copying stuff
>like protocol headers etc into and out of main memory.

The same or similar tricks can generally be played using DMA.  However,
there are certain penalties payed for using buffers:

	
1.	Additional latency -- effectively, disk or LAN devices perform
	DMA operations into their own buffers.  After this, the CPU must
	perform a copy into main memory.  Since these peripheral buffers
	are not cached (or if they are, then the there is no excuse
	for not copying into main memory to begin with), the copy will
	be more expensive.  THere are already several versions of Unix 
	which directly page programs from disk to user code space.  The
	use of a dedicated buffer will substantially slow this down.  
	Future versions of Unix may choose to take advantage of these
	features in greater ways for performance enhancements.  The bottom 
	line is that this approach requires an additional copy which 
	can make CPU latency a problem.

2.	Buffer size -- provision of a private buffer for a peripheral
	implies that the driver must now manage the buffer memory.
	Since certain classes of peripherals such as Ethernet can have
	semi-continuous traffic, this management must be timely and
	efficient.  The CPU must be able to drain the buffer in a
	short period of time (which can be a problem under standard
	Unix due to the design of the dispatcher).  The easiest way
	to handle this is to provide a LARGE buffer to store the data.
	So, at this point in time, one must ask oneself:  "Would I rather
	have 4 megabytes of dedicated LAN buffer or 4 megabytes of additional
	main memory?"  Most people would rather have the main memory.

3.	Architectural generality -- There are a variety of cases where
	having the data "beamed down" into main memory is useful though
	strictly not required.  In tightly coupled multiprocessors (TCMPs)
	it is convenient to avoid excessive data movement and to simplify
	the driver to minimize the time in which a particular device's
	code is single threaded.  

The real reason why some machines avoid DMA is because of CPU braindamage.
Many CPUs are either poorly cached (causing them to demand too much
bus bandwidth and therefore suffering from major performance loss when
minor peripherals begin to take bus cycles) or have defective architectures
which do not support cache coherency (or atleast support it effectively).

Some examples of the first include some of the low end microprocessors
which can take 100% of the CPU bandwidth for extended periods of time.

Some examples of the second include some of the higher end microprocessors
with on-chip caches or cache controllers.  

A number of DMA buffer workarounds have been used over the years.  One
favorite hack is to provide a hole in the cache coverage so that some
areas of memory are not cached.  In one form or another almost every
system provides for this.  Sometimes it is on a page by page basis (88K
for example).  Others create a dedicated area of memory for it (MIPs).

>Generally, the CPU can be a lot smarter about I/O than any brain-damaged
>microprocessor controlled device interface.

However, just remember that you are throwing MIPS away doing the copying.
I would rather have a $5 DMA controller spending the time than my high
powered CPU.  Sure, it works to use the CPU to do the copying, but when
you realize the amount of time the CPU may be forced to spend because of 
the copy (including extra interrupt service, context switches, polling
loops, cache flushes, etc.), it often turns out that a DMA controller
can provide the user with VERY CHEAP MIPS by freeing up the CPU.  It
is this logic which allows people to avoid using graphics processors
in workstations by saying "the CPU is fast therefore I don't need one."

>-- 
>Chris Maltby - Softway Pty Ltd	(chris@softway.sw.oz)
>
>PHONE:	+61-2-698-2322		UUCP:		uunet!softway.sw.oz.au!chris
>FAX:	+61-2-699-9174		INTERNET:	chris@softway.sw.oz.au


Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

henry@utzoo.uucp (Henry Spencer) (05/31/89)

In article <181@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>There are some basic requirements, IMHO, which must be met to be considered
>state-of-the art:  I/O which does not involve the CPU for moving every
>byte, graphics which does not require total CPU dedication for normal
>operations such as line drawing or bit blitting, dedicated LAN controllers
>to handle the low levels of the LAN protocol...
>
>I make this point to begin discussion.  What are some of the minimum
>standards which should be applied to these classes of machines and which
>machines fail to meet them?

The three obvious ones are:

1. A serious assessment of what performance in each of these areas is
	necessary to meet the machine's objectives, and what fraction of
	the CPU would be necessary to do so with "dumb" hardware.  It is
	not likely to be cost-effective to add hardware to save 1% of
	the CPU.  10% might be a different story.  50% definitely is.

2. A serious assessment of the overheads of adding smart hardware, like
	the extra memory bandwidth it eats and the software hassles that
	all too often are necessary.

3. A serious assessment of whether the added performance can be had in a
	more versatile and cost-effective way by just souping up the CPU.

Simply saying "we've got to have smart i/o, and smart graphics, and smart
networks" without justifying this with numbers is marketingspeak, not
a sound technical argument.  As RISC processors have shown us, taking things
*out* of the hardware can result in better systems.
-- 
You *can* understand sendmail, |     Henry Spencer at U of Toronto Zoology
but it's not worth it. -Collyer| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

aglew@mcdurb.Urbana.Gould.COM (06/01/89)

>Generally, the CPU can be a lot smarter about I/O than any brain-damaged
>microprocessor controlled device interface.

Smarter and faster. One of the big problems with smart I/O is that it is done
using slow microprocessors 1 or 2 generations old. Now, if your smart I/O
cards (1) run the latest, greatest, processors (which requires a big
development commitment) and (2) share software with the "standard" UNIX,
so that you don't throw out software investment when you upgrade I/O cards
- or even move the functionality back to the CPU for some models,
then you may have got something...   Of course, some people prefer symmetric
multiprocessing...

rec@dg.dg.com (Robert Cousins) (06/01/89)

In article <1989May31.163057.543@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <181@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>>There are some basic requirements, IMHO, which must be met to be considered
>>state-of-the art:  I/O which does not involve the CPU for moving every
>>byte, graphics which does not require total CPU dedication for normal
>>operations such as line drawing or bit blitting, dedicated LAN controllers
>>to handle the low levels of the LAN protocol...
>>
>>I make this point to begin discussion.  What are some of the minimum
>>standards which should be applied to these classes of machines and which
>>machines fail to meet them?
>
>The three obvious ones are:
>
>1. A serious assessment of what performance in each of these areas is
>	necessary to meet the machine's objectives, and what fraction of
>	the CPU would be necessary to do so with "dumb" hardware.  It is
>	not likely to be cost-effective to add hardware to save 1% of
>	the CPU.  10% might be a different story.  50% definitely is.

Your point is well taken.  However, it is clear from reading about some
products on the market that the sum total of the penalty considered in the
calculations is the number of bus cycles required to do the copy.  In fact,
many of these copies will take place with interrupts disabled or at some
spl() level which restricts some interrupts.  Secondly, some copies will
require additional interrupt service which will require additional overhead
for context switches.  Thirdly, some of the management will require task
activation which can take a long time under traditional Unix.  When these
are considered, the cost function should be computed based upon 

	       Cost of smarter peripherals
   cost/MIPS = -----------------------------------
               MIPS freed up by better peripherals

If this number comes out lower than the price of your CPU, the odds are
that you would be better off using the smarter peripherals.  

>2. A serious assessment of the overheads of adding smart hardware, like
>	the extra memory bandwidth it eats and the software hassles that
>	all too often are necessary.

As a rule, smarter peripherals require less memory bandwidth than dumb ones.
For example, copying data from a dedicated disk buffer to main memory using
software entails not only the instruction bandwidth but also a read and
a write for each word or two bus cycles per word.  DMA on the other hand
should be able to perform the same operation with a single cycle per word.

As for the software hassels, I've written drivers for both smart and dumb
devices.  It is true that some classes of smart devices can be more difficult
to program than their dumb counterparts.  However, in my experience, this is
not the rule but is the exception.  In fact, if you graph the software
hassel factor (if one can truely be quantified), my experience shows that
as hardware goes from brain dead to genius the curve is "U" shaped.  Managing
very stupid hardware can be as difficult as the most sophisticated.  

>3. A serious assessment of whether the added performance can be had in a
>	more versatile and cost-effective way by just souping up the CPU.

Agreed.  As was pointed out in the equation above, the real issue is 
getting the end user the most bang for the buck.

>Simply saying "we've got to have smart i/o, and smart graphics, and smart
>networks" without justifying this with numbers is marketingspeak, not
>a sound technical argument.  As RISC processors have shown us, taking things
>*out* of the hardware can result in better systems.
>-- 
>You *can* understand sendmail, |     Henry Spencer at U of Toronto Zoology
>but it's not worth it. -Collyer| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

I disagree.  There are certain requirements for a product to be considered
useful.  It is possible to design 500 Mhz Z80s.  There are also a number
of users who would find this an attractive product, though many people
would mutter that this is waste of technology.  There are certain things
which users have a right to demand:  quality software, state of the
art hardware, reasonable performance for the price, dependable support.
I would suggest that provision of non-braindead peripherals is in this
class almost (but not quite) by default.

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

rcd@ico.ISC.COM (Dick Dunn) (06/02/89)

> There are some basic requirements, IMHO, which must be met to be considered
> state-of-the art:  I/O which does not involve the CPU for moving every
> byte, graphics which does not require total CPU dedication for normal
> operations such as line drawing or bit blitting, dedicated LAN controllers
> to handle the low levels of the LAN protocol and a number of similar 
> minimums which most new machines have...

Why are these requirements?  At some point in the past, for some set of
architectural constraints, smart DMA was an improvement over having the CPU
move the data.  *However*, remember that smart DMA was not the goal!  The
goal was to speed up I/O, and you can only substitute the goal of smart DMA
if the numbers work right--that is, if you get enough performance gain to
justify the cost of the DMA controller, dual-porting the memory (or putting
it on the bus and making the bus fast enough), etc.

Similar arguments go for the other two putative requirements.  For example,
if it's going to make sense to have a separate bit-blitter, you have to be
able to set it up quickly.  If the setup time is longer than the time it
would take the CPU to draw the typical line or blt the typical bits, you
haven't gained anything.  Even if the setup is fast, you have to be able to
do something useful with the CPU--which is likely to mean that you have to
be able to do a blazingly fast context switch to another process while the
drawing is going on.

Folks working on networking here have found that it tends to be easier and
faster to run "host-based" TCP than to deal with "smart" boards.

>...It is interesting, however, to
> notice the number of machines which do not meet up with even the most
> basic criteria.  

But the criteria you've given are artificial...they come not from the
direct goals (such as performance in a particular application) but from
derived goals based on certain assumptions of how to increase performance.
I suggest that the problem is *not* that these machines are deficient, but
that the assumptions are wrong.

> I make this point to begin discussion.  What are some of the minimum
> standards which should be applied to these classes of machines and which
> machines fail to meet them?

I suggest that we proceed with the discussion by looking at the machines as
black boxen for a bit--establish the standards based on WHAT you want the
machine to do, not HOW it gets it done.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...CAUTION:  I get mean when my blood-capsaicin level gets low.

peter@ficc.uu.net (Peter da Silva) (06/02/89)

In article <15809@vail.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
> if it's going to make sense to have a separate bit-blitter, you have to be
> able to set it up quickly. ... Even if the setup is fast, you have to be
> able to do something useful with the CPU--which is likely to mean that you
> have to be able to do a blazingly fast context switch...

Which is not out of the question. There is plenty of room for improvement
in this department in most operating systems (understatement of the century).
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

andrew@frip.WV.TEK.COM (Andrew Klossner) (06/03/89)

[]

	"basic requirements ... which must be met to be considered
	state-of-the art ... [include] dedicated LAN controllers to
	handle the low levels of the LAN protocol ..."

This can backfire on you.  I've seen more than one example of a very
smart LAN interface board which actually slowed down system throughput,
because its Chevy on-board processor couldn't do nearly as fast a job
as the Formula 1 dragcar that was the main CPU, and the single active
process was blocked waiting for LAN completion.

Ranging a bit, I've also dealt with systems with a fast main CPU, a
SCSI channel, and a wimpy Z8 in the on-disk SCSI controller.  Yep, the
Z8 was the overall system bottleneck -- lots of time wasted while it
slooooowly processed all the messages that SCSI bus master and slave
must exchange.

If you're going to buy into off-CPU agents to move I/O around, make
sure that those agents will improve as fast as the CPU, or your future
generation machines will be crippled.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

seanf@sco.COM (Sean Fagan) (06/03/89)

In article <28200325@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:
>>Generally, the CPU can be a lot smarter about I/O than any brain-damaged
>>microprocessor controlled device interface.
>Smarter and faster. One of the big problems with smart I/O is that it is done
>using slow microprocessors 1 or 2 generations old. Now, if your smart I/O
>cards (1) run the latest, greatest, processors (which requires a big
>development commitment) and (2) share software with the "standard" UNIX,
>so that you don't throw out software investment when you upgrade I/O cards
>- or even move the functionality back to the CPU for some models,
>then you may have got something...   Of course, some people prefer symmetric
>multiprocessing...

Well, more than 20 years ago, a machine was built which had smart I/O
processors.  Just for the sake of fun, let's call the central processor a
"CP," and the I/O processors "PP"'s.  Fun, huh?  Now, the "CP" was 60-bits,
had something like 70-odd instructions, and was a load-store/3-address
design.  The "PP"'s were 12-bit, accumulator based machines, also with a
small instruction set.  With each "CP" you got at least 10 "PP"'s.
Incidently, the "PP"'s were a barrel-processor:  each set of 10 had only 1
ALU.

This machine *screamed*.  It had, for the time, an incredibly fast processor
(the "CP"), which, even today, will outperform things like Elxsi's.  With
the "PP"'s, it even had I/O that causes it to outperform most of today's
mainframes, at a fraction of the price.  True, it didn't run UNIX(tm)
(although I was did a paper-design of what it would take), but, if you need
the speed, it doesn't always matter, does it?

For those of you who haven't guessed, the machine was the CDC Cyber,
designed (chiefly) by Seymour Cray (God).  The machine I played on mostly
was a Cyber 170/760, which was estimated at about 10 MIPS or so, and could
support hundreds of people, all doing "real" work (database, editing,
compiling, etc.).  As I, and others, keep trying to say, MIPS are fine, but
can it do I/O?

-- 
Sean Eric Fagan  |    "[Space] is not for the timid."
seanf@sco.UUCP   |             -- Q (John deLancie), "Star Trek: TNG"
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

sauer@dell.dell.com (Charlie Sauer) (06/03/89)

In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:
>Well, more than 20 years ago, a machine was built which had smart I/O ...
>...
>For those of you who haven't guessed, the machine was the CDC Cyber,
>designed (chiefly) by Seymour Cray (God).  The machine I played on mostly
>was a Cyber 170/760, ...

Until I got to the punch line, I was sure you were going to say "CDC 6600,"
which was the first of that series of machines.  I'm at home and can't lay
my hands on Thornton's book ("Design of the CDC 6600" I think was the title),
and I didn't actually use a 6600 until 1970, but it seems like you would have
to say 6600, or at least 7600, to make the "more than 20 years" accurate.
-- 
Charlie Sauer  Dell Computer Corp.     !'s:cs.utexas.edu!dell!sauer
               9505 Arboretum Blvd     @'s:sauer@dell.com
               Austin, TX 78759-7299   
               (512) 343-3310

brooks@vette.llnl.gov (Eugene Brooks) (06/04/89)

In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes:
>because its Chevy on-board processor couldn't do nearly as fast a job
>as the Formula 1 dragcar that was the main CPU, and the single active
For the record, a CHEVY won INDY this year!
That meager 71 Vette of mine, with truely OBSOLETE CHEVY power is rarely
challenged on the roads of California either.


brooks@maddog.llnl.gov, brooks@maddog.uucp

seanf@sco.COM (Sean Fagan) (06/05/89)

In article <1429@dell.dell.com> sauer@dell.UUCP (Charlie Sauer, ) writes:
>In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:
>>Well, more than 20 years ago, a machine was built which had smart I/O ...
>>...
>>For those of you who haven't guessed, the machine was the CDC Cyber,
>>designed (chiefly) by Seymour Cray (God).  The machine I played on mostly
>>was a Cyber 170/760, ...
>
>Until I got to the punch line, I was sure you were going to say "CDC 6600,"
>which was the first of that series of machines.
>but it seems like you would have
>to say 6600, or at least 7600, to make the "more than 20 years" accurate.

For all intents and purposes, they're the same machine.  However, the 760 is
*much* faster than the 6600 (the 760 is the second fastest 170 machine; the 
fastest being one that has 2 processors and an extra 3 bits of addressing 
[for the OS, not user]).

They have the same architecture, but I never played on a 6600, so I had to
use what I knew.  However, what I wrote is still true for the 6600.

Also, note that I said "the machine was the CDC Cyber," but that the model 
"I played on mostly" was the 760...

-- 
Sean Eric Fagan  |    "[Space] is not for the timid."
seanf@sco.UUCP   |             -- Q (John deLancie), "Star Trek: TNG"
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

snoopy@sopwith.UUCP (Snoopy) (06/05/89)

In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:

|For those of you who haven't guessed, the machine was the CDC Cyber,
|designed (chiefly) by Seymour Cray (God).  The machine I played on mostly
|was a Cyber 170/760, which was estimated at about 10 MIPS or so, and could
|support hundreds of people, all doing "real" work (database, editing,
|compiling, etc.).  As I, and others, keep trying to say, MIPS are fine, but
|can it do I/O?

I remember an 11/70 doing 1/3 the number of jobs of a pair of CDC 6500s.
It supported ~50 users with *very* fast response, faster baud rate terminals
(19.2k vs 300/1200), upper/lower case vs. the 6500's upper-case only, and
of course Unix.  To this day, I haven't used a machine with multi-user
response that's even close.

It's not what you have, it's what you do with it.

    _____     						  .-----.
   /_____\    Snoopy					./  RIP	 \.
  /_______\   qiclab!sopwith!snoopy			|  	  |
    |___|     parsely!sopwith!snoopy			| tekecs  |
    |___|     sun!nosun!illian!sopwith!snoopy		|_________|

		"I *was* the next man!"  -Indy

rec@dg.dg.com (Robert Cousins) (06/05/89)

In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes:
>[]

>	"basic requirements ... which must be met to be considered
>	state-of-the art ... [include] dedicated LAN controllers to
>	handle the low levels of the LAN protocol ..."

>If you're going to buy into off-CPU agents to move I/O around, make
>sure that those agents will improve as fast as the CPU, or your future
>generation machines will be crippled.
>
>  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
>                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

I agree, however, I fear that some people misunderstood my point concerning
"RDA of hardware support."  There are a number of ways to produce brain-damaged
hardware.  For example, Seeq makes an Ethernet controller chip which requires
external DMA support.  If a dumb DMA channel (or no DMA) is used, the
lowest levels of software will end up being exceptionally complex since
all of the buffer management and scatter/gather will be in software.  THere
is also danger of droping packets on the floor which has nasty implications
for performance. :-)    If, however, some slightly more reasonable DMA is 
supplied (similar to the LANCE, or Intel chip's) the software complexity 
drops substantially.  

WHile I never intended my comments would imply INTELLIGENT control, it is
worthwhile to add it to the discussion.  At DG, our experience is that
it is possible to provide DMA services at prices below competitive non-DMA
products.  Does this mean that the DMA products run faster than the non-DMA
ones?  Often the peripherals are the limiting factor.  However, the following
analysis may be enlightening:

Scenario one:  Programmed I/O.

Given that a disk channel will be averaging 200K bytes/second in 
4K byte bursts 20 milliseconds apart using a 1 megabyte/second SCSI
channel, the time required to transfer the data will be SCSI limited 
(given a CPU of > ~3 MIPS).  However, since each byte takes 1 microsecond,
the CPU will be forced to be dedicated to the SCSI channel for 4 milliseconds
each tenure, 50 times per second for a total of 200 milliseconds each
second.  This has cost the user 20% of the available computing power.

Scenario two:  Small dedicated buffer.

The buffer is 4K bytes long so the processor is no longer required
to make as timely response as above.  The real issue is now the 
copy time, of which there is two components:  transfer time and
context overhead.  The transfer time will be limited by the memory/
cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
half of the transfer will involve a bus cycle in all cases.  Given a minor
penalty of 4 or 5 instruction periods for this half and assuming a cache
hit on the other side always, the code will look something like this:

		ld	r1,4096
		ld	r2,bufferaddress
		ld	r3,destaddress
	loop:	ldb	r4,(r2)		/	byte load = 4 clocks for miss
		ldb	(r3),r4		/	byte store = 1 for cache hit
		addi	r2,1		/	1
		addi	r3,1		/	1
		addi	r1,-1		/	1
		brz	loop		/	1 (code could be reorg'd)

	Total clocks required: 9*4096=36864 per block
	* 50 blocks/ second = 1843200 clocks

Given a CPU speed of 20 Mhz, this translates into 9% of the CPU time.

If the CPU is required to perform the copy during an interrupt service,
there is the danger that lower priority interrupts may be lost.  If the
copy takes place in the top half of the driver, then task latency becomes
an issue.  The buffer will not be drained until after the task wakes up
and completes the copy.  On some Unix implementations, the task wake up
time can be long periods of time -- enough to impact upon total throughput.


Scenario three:  Stupid DMA.

Here, the CPU just sets up the DMA and awaits completion.  The overhead
is approximately 0 compared to the above examples.  

Where does the DMA pay off given that all three examples have approximately
identical throughput?  

DMA is preferable to the first choice whenever the cost of DMA is less
than 20% of the cost of the CPU or less than the cost of speeding up
the CPU by 20%.

DMA is preferable to the second choice whenever the cost of DMA is
less than 9% of the cost of the CPU or less than the cost of speeding
up the CPU by 9%.

I am the first to admit that these models are simplistic, but they
do represent valid considerations and reasonable approximations to 
to the actual solutions.

Comments?

Robert Cousins
Dept. Mgr, Workstation Dev't
Data General Corp.

Speaking for myself alone.

sauer@dell.dell.com (Charlie Sauer) (06/05/89)

In article <2822@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:
>For all intents and purposes, they're the same machine.  However, the 760 is
>*much* faster than the 6600 (the 760 is the second fastest 170 machine; the 
>fastest being one that has 2 processors and an extra 3 bits of addressing 
>[for the OS, not user]).

I know I was being picky, but since I'm stuck in that vein, let me offer a
slightly more substantive quibble, slightly more relevant to the original
topic reflected in the subject line:  Isn't it true that the 6600 and the
7600 differed in that the PP's were all peers in the 6600 but the 7600 had
one PP that had authority over the others?
-- 
Charlie Sauer  Dell Computer Corp.     !'s:cs.utexas.edu!dell!sauer
               9505 Arboretum Blvd     @'s:sauer@dell.com
               Austin, TX 78759-7299   
               (512) 343-3310

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/06/89)

In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes:
>In article <28200325@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:

>>Smarter and faster. One of the big problems with smart I/O is that it is done
>>using slow microprocessors 1 or 2 generations old. Now, if your smart I/O

>Well, more than 20 years ago, a machine was built which had smart I/O
>processors.  Just for the sake of fun, let's call the central processor a
>"CP," and the I/O processors "PP"'s.  Fun, huh?  Now, the "CP" was 60-bits,
>had something like 70-odd instructions, and was a load-store/3-address
>This machine *screamed*.  It had, for the time, an incredibly fast processor


>(the "CP"), which, even today, will outperform things like Elxsi's.  With

>For those of you who haven't guessed, the machine was the CDC Cyber,

There were two rather distinct flavors of Cybers.  The main distinguishing
feature was the kind of peripheral processors the machine had. (A
hybrid- the Cyber 176 could accept both kinds.)

The PP's in the 7600 ("upper Cyber") did *not* write to any
designated memory, but instead to dedicated memory locations, just like, in
effect, the on-board buffers previously referred to in some postings.  These
PP's would interrupt the CPU *every few hundred words* of I/O to *copy* the
data from one group of memory locations to another.  Strangely enough, this
gave the ~20 VAX MIPS (please, let's argue about the exact rating off-line)
7600 (which was about 2X the similar "lower Cyber" 760) the fastest I/O
around of any commercial machine for many years.  So, is DMA a "good idea"? -
it depends.  Overall, you may get more performance for your dollar that way.  

The PP's on the lower Cybers (6600-Cyber 760) could write to any memory
location, and could make your coffee for you in the morning too.  The NOS
operating system had major pieces in the PP's, and PP saturation was the
usual bottleneck, rather than CPU saturation.  You could keep about twice as
many people happy per CPU "MIP" (whatever that is) on the Cybers, because of
all the hidden MIPS in the PP's, compared with, say, a VAX-11/7xx or similar
machines. 

So, within the Cyber line, you had two examples of both extremes:

A machine where the CPU had to copy data (7600), and, machines
where not only smart I/O, but other major pieces of the system ran in the PP's.
Both architectures worked successfully, especially with respect to performance,
and only came to grief after many years over the wierd word size 
and lack of virtual memory and memory address space.



  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

martyi@sun.Eng.Sun.COM (Marty Itzkowitz) (06/06/89)

The 6600 and 7600 differed in a number of respects, one of which was
the PPU architecture.  On the 6600 all of the PPs were equivalent,
although the OS ( at least the version developed at LBL) treated
them differently.  Every PPU could access every channel, and could
read and write anywhere in central memory, and could exchange-jump
(context switch) the CPU.  On later, 20 PPU versions, each
set of 10 PPUs shared a set of channels, and I don't believe
one could talk to the other set's channels.  The CPU on the 6600
could NOT do its own context switches, and system calls were handled
by placing the request in a known location relative
to the process (job, task) address space (word 1, actually).
The monitor PPU, so designated by software, checked these words,
and then assigned one of the other PPUs to process a request.
Later versions of the machine did have a central processor exchange
jump instruction.  A two CRT display, with refresh done entirely
in SW, was managed by one of the PPUs.

On the 7600, there were several types of PPUs.  PPU zero,
also known as the MCU, or maintenance and control unit, could
read and write anywhere in central (small core) memory, and could
send stuff on channels to the other PPUs.  It could also do
(force) an exchange jump in the CPU.  The other PPUs came in
either high or low-speed versions.  The high-speed ones worked in pairs,
and shared a common external channel to a disk, for example,
and a single channel to central memory.  Each pair's channel went to
a specific hard-wired buffer in cnetral memory, and generated a CPU
interrupt (exchange jump) whenever the buffer was half full/empty,
or when it executed a specific instruction to do so.  The CPU
managed copying the data out of the hard-wired buffer,m typically
into large core memory, since that was the fastest path, and telling
the PPUs when it was OK to send more data.  The CPU could reset its
buffer to a pair of PPUs and generate an interrupt to them.
A high speed buffer was 400 (octal) words, and a disk sector was
1000 (octal) words.  The CPU got 4 interrupts for each disk sector.
High-speed PPUs also had channels connecting the pair, so that, with
much cleverness, one could actually stream data at close to 40 Mb/s
from disk, with one PPU reading the disk, and the other dumping a
previously read sector to CM.  On the 819 disks, one had about
8 microseconds between sectors to avoid missing revs, requiring
a hand-off between the PPUs of the pair, and the CPU.  On LBL's
system, we could do it in time.

Slow PPUs, such as needed for a hyperchannel, worked as individuals,
and had a half-size buffer, again with handshaking between PPU and
CPU at the half-way mark.  The CPU on the 7600 did have an exchange
jump instruction.  Non-privleged tasks could only exchnage to an
address given in its XJ package (some 16 60-bit words);  on error,
the exchange went to a second address.  (Normal Exch. Addr and
Error Exch. Addr, respectively).  Privileged tasks, i.e., the OS,
could exchange to anywhere.  A context switch took 28 clocks
of 27.5 nanosec. each, counting from the time the XJ instruction
was the next to issue to the time the first instruction from the
new context could issue.  For scalar arithmetic, the 7600 was
the fastest machine in the world until the Cray-2.


	Marty Itzkowitz,
		Sun Microsystems

chris@softway.oz (Chris Maltby) (06/06/89)

In article <182@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
> However, just remember that you are throwing MIPS away doing the copying.
> I would rather have a $5 DMA controller spending the time than my high
> powered CPU.  Sure, it works to use the CPU to do the copying, but when
> you realize the amount of time the CPU may be forced to spend because of 
> the copy (including extra interrupt service, context switches, polling
> loops, cache flushes, etc.), it often turns out that a DMA controller
> can provide the user with VERY CHEAP MIPS by freeing up the CPU.  It
> is this logic which allows people to avoid using graphics processors
> in workstations by saying "the CPU is fast therefore I don't need one."

Without rejecting anything you said, let me point out that the opposite
logic can also apply though. Why install special purpose I/O intelligence
if you can ony use it for I/O. A general purpose (extra perhaps) CPU
can do all that I/O nonsense as well as other good things.  I guess it
all depends on what you want the machine to do best. Select the
criteria - then design the machine.

At this point we should adopt Mr Mashey's approach... measure then
draw conclusions on actual data.
-- 
Chris Maltby - Softway Pty Ltd	(chris@softway.sw.oz)

PHONE:	+61-2-698-2322		UUCP:		uunet!softway.sw.oz.au!chris
FAX:	+61-2-699-9174		INTERNET:	chris@softway.sw.oz.au

chris@mimsy.UUCP (Chris Torek) (06/07/89)

In article <185@dg.dg.com> rec@dg.dg.com (Robert Cousins) writes:
>Scenario two:  Small dedicated buffer.

>The buffer is 4K bytes long so the processor is no longer required
>to make as timely response as above.  The real issue is now the 
>copy time, of which there is two components:  transfer time and
>context overhead.  The transfer time will be limited by the memory/
>cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
>half of the transfer will involve a bus cycle in all cases.  Given a minor
>penalty of 4 or 5 instruction periods for this half and assuming a cache
>hit on the other side always, the code will look something like this:
>
>		ld	r1,4096
>		ld	r2,bufferaddress
>		ld	r3,destaddress
>	loop:	ldb	r4,(r2)		/	byte load = 4 clocks for miss
>		ldb	(r3),r4		/	byte store = 1 for cache hit
>		addi	r2,1		/	1
>		addi	r3,1		/	1
>		addi	r1,-1		/	1
>		brz	loop		/	1 (code could be reorg'd)
>
>	Total clocks required: 9*4096=36864 per block
>	* 50 blocks/ second = 1843200 clocks

This is not an unreasonable approach to analysing the time required for
the copies, but the code itself *is* unreasonable---it is more likely to
be something like

		ld	r1,4096/4
		lea	r2,dual_port_mem_addr
		lea	r3,dest_addr
	loop:	ld	r4,(r2)		/	4-byte load ...
		ld	(r3),r4		/	4-byte store
		addi	r2,4
		addi	r3,4
		addi	r1,-1
		brz	loop

which is four times faster than your version.  Still, 50 blocks/second is
much too slow, especially if the blocks are only 4 KB; modern cheap SCSI
disks deliver between 600 KB/s and 1 MB/s.  With 8 KB blocks, we should
expect to see between 75 and 125 blocks per second.

So we might change your 9% estimate to 4.5% (copy four times as fast,
but twice as often).  Nevertheless:

>Scenario three:  Stupid DMA.
>
>Here, the CPU just sets up the DMA and awaits completion.  The overhead
>is approximately 0 compared to the above examples.  

The overhead here is not zero.  It has been hidden.  The overhead lies in
the fact that dual ported main memory is expensive, so either the DMA
steals cycles that might be used by the CPU (and it can easily take about
half the cycles needed to do the copy in the Scenario two), or the main
memory costs more and/or is slower.

>Where does the DMA pay off given that all three examples have approximately
>identical throughput? ...
>DMA is preferable to the second choice whenever the cost of DMA is
>less than 9% of the cost of the CPU or less than the cost of speeding
>up the CPU by 9%.

You have converted `% of available cycles' to `% of cost' (in the first
half of the latter statement) and assumed a continuous range of price/
performance in both halves, neither of which is true.

(I happen to like DMA myself, actually.  But it does take more parts,
and those do cost....)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

jhood@biar.UUCP (John Hood) (06/07/89)

In article <185@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>Where does the DMA pay off given that all three examples have approximately
>identical throughput?  
>
>DMA is preferable to the first choice whenever the cost of DMA is less
>than 20% of the cost of the CPU or less than the cost of speeding up
>the CPU by 20%.
>
>DMA is preferable to the second choice whenever the cost of DMA is
>less than 9% of the cost of the CPU or less than the cost of speeding
>up the CPU by 9%.
>
>I am the first to admit that these models are simplistic, but they
>do represent valid considerations and reasonable approximations to 
>to the actual solutions.
>
>Comments?

Robert has also ignored the cost of setting up the DMA controller,
which can be significant, especially for scatter/gather type
operation.

Also note that with modern operating systems that do buffering or disk
caching, there is going to be a bcopy or its moral equivalent in there
somewhere.  This doesn't reduce CPU time used during programmed I/O,
but it does change the trade off from, say, 3 vs 10% to 13 vs 23% of
CPU availability used for disk I/O.  This makes the nature of the
trade off different.

My other thought is that regardless of the CPU cost, programmed I/O is
often acceptable on single-user machines anyway.  The situation often
arises where the user is only concerned about the speed of the one
process he's using interactively.  That process will usually have to
wait till its data arrives anyway, at least under current programming
models.  If the CPU has to sit and wait, it might as well do the data
movement too.

On the other hand, effective multi-channel DMA can be used to have
several things going at once-- a bunch of disk drives, or as in the
Macintosh and NeXT machines, sound in parallel with other stuff.

I'm not about to make any pontifications about what I think is the
best solution for the future, because I don't know myself ;-)

  --jh-- 
John Hood, Biar Games snail: 10 Spruce Lane, Ithaca NY 14850 BBS: 607 257 3423
domain: jhood@biar.uu.net bang: anywhere!uunet!biar!jhood
"Insanity is a word people use to describe other people's lifestyles.
There ain't no such thing as sanity."-- Mike McQuay, _Nexus_

slackey@bbn.com (Stan Lackey) (06/07/89)

In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes:
>Also note that with modern operating systems that do buffering or disk
>caching, there is going to be a bcopy or its moral equivalent in there
>somewhere.  

1) Is it possible, if not now but possibly in the future, for programmed
   I/O to _eliminate_ some of the 'bcopy's?

2) This discussion brings to mind one that went around some time ago, 
   which was, is it better to supply a bunch of specialized processors
   (then bitblt's, now including DMA controllers), or a bunch of identical
   processors connected together?  Theory was, when the bitblt and DMA are
   done, the other processor(s) can be applied to a compute bound task.
   It seems to me this might make an interesting product; price/perf
   range is varied by the number of [identical] processors, and all I/O
   hardware is very very dumb.

-Stan

stein@pixelpump.osc.edu (Rick 'Transputer' Stein) (06/07/89)

In article <41042@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>2) This discussion brings to mind one that went around some time ago, 
>   which was, is it better to supply a bunch of specialized processors
>   (then bitblt's, now including DMA controllers), or a bunch of identical
>   processors connected together?  Theory was, when the bitblt and DMA are
>   done, the other processor(s) can be applied to a compute bound task.
>   It seems to me this might make an interesting product; price/perf
>   range is varied by the number of [identical] processors, and all I/O
>   hardware is very very dumb.
>
>-Stan

This sure sounds like an physically objective parallel i/o mechanism.
A processor controlling some portion of the i/o stream as mapped to
a specific device.  Sounds like a job for Transputer Man! :-).
-=-
Richard M. Stein (aka Rick 'Transputer' Stein)
Concurrent Software Specialist @ The Ohio Supercomputer Center
Ghettoblaster vacuum cleaner architect and Trollius semi-guru
Internet: stein@pixelpump.osc.edu, Ma Bell Net: 614-292-4122

rik@june.cs.washington.edu (Rik Littlefield) (06/08/89)

In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
< Given that a disk channel will be averaging 200K bytes/second ...

< [comparison of programmed I/O vs small dedicated buffer vs stupid DMA,
<  evaluated against cpu cost and speed]

< I am the first to admit that these models are simplistic, but they
< do represent valid considerations and reasonable approximations to 
< to the actual solutions.
< 
< Comments?

The methodology seems sound, but I question the numbers.  Just guessing,
but I suspect that workstation class systems have an *average* disk
throughput that is at least 10X lower than this number, even when
they are working full out.  (Remember that 200K bytes/second is
720 Mbytes/hour.)  If so, then the value of DMA is also 10X lower.  

Would someone with real utilization numbers care to fill us in?

--Rik

jonasn@ttds.UUCP (Jonas Nygren) (06/08/89)

In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
>< Given that a disk channel will be averaging 200K bytes/second ...
<deleted>
>< Comments?
>
>The methodology seems sound, but I question the numbers.  Just guessing,
>but I suspect that workstation class systems have an *average* disk
>throughput that is at least 10X lower than this number, even when
>they are working full out.  (Remember that 200K bytes/second is
>720 Mbytes/hour.)  If so, then the value of DMA is also 10X lower.  
>
>Would someone with real utilization numbers care to fill us in?
>
>--Rik

I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.
The test used 15 processes reading/writing 2Mb files each, with the following
results:

Write 15x2 Mb: 113 s, 265 kb/s
Read 15x2 Mb:  117 s, 256 kb/s
Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s

Mean value: 234 kb/s

/jonas

rec@dg.dg.com (Robert Cousins) (06/08/89)

In article <17925@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <185@dg.dg.com> I write:
>>Scenario two:  Small dedicated buffer.
>
>>The buffer is 4K bytes long so the processor is no longer required
>>to make as timely response as above.  The real issue is now the 
>>copy time, of which there is two components:  transfer time and
>>context overhead.  The transfer time will be limited by the memory/
>>cache/CPU bottleneck.  Since the buffer is not cacheable (by implication),
>>half of the transfer will involve a bus cycle in all cases.  
>> 	[ code fragment using byte loads and stores excerpted ]
>>	Total clocks required: 9*4096=36864 per block
>>	* 50 blocks/ second = 1843200 clocks
>
>This is not an unreasonable approach to analysing the time required for
>the copies, but the code itself *is* unreasonable---it is more likely to
>be something like
>	[ code excerpted -- uses 4 byte loads and stores ]
>
>which is four times faster than your version.  

Actually, you have created a scenario 2.5.  I was making the assumption
that cost was a driving factor here which will rule out the use of real
two ported RAMs and 32 bit wide data paths.  The increase in peripheral
complexity is substantial (there aren't many 32 bit peripherals yet, but
will be soon! :-)) along with the cost of RAM.  

However, this scenario should be treated as reasonably as the rest.  
The equation of reference is:

	CPU Cost + IO Scheme cost
	------------------------- = $/deliverable compute unit
	CPU speed - IO overhead

I use percentages simply to avoid arguements about what reasonable 
units are.  For your suggestion to be true, the following inequality 
must hold:

	CPU Cost + 32-bit Buffer Cost	   CPU Cost + DMA Cost
	----------------------------- < ------------------------
	CPU Speed - Buffer Overhead     CPU Speed - DMA Overhead

Which is approximately equal to (when converting speed to percent):

	CPU Cost + 32-bit Buffer Cost	   CPU Cost + DMA Cost
	----------------------------- < ------------------------
		 95.5%     		  	~100%

or
	1 * (CPU cost + 32-bit Buffer Cost) < .955 * (CPU Cost + DMA Cost)
or
	.045 * CPU cost + 32-bit buffer cost < .955 DMA cost

which is clearly dominated by the CPU cost.  If the CPU cost is
simply $100, DMA wins if it costs less than about $5 more than the
buffer.

>Still, 50 blocks/second is
>much too slow, especially if the blocks are only 4 KB; modern cheap SCSI
>disks deliver between 600 KB/s and 1 MB/s.  With 8 KB blocks, we should
>expect to see between 75 and 125 blocks per second.

The purpose of the 50 blocks assumption was to estimate average CPU demand
for support of I/O, not for peak situations.  Relatively few machines of
the low end class will be used at 1 mb/s continuously.

>So we might change your 9% estimate to 4.5% (copy four times as fast,
>but twice as often).  Nevertheless:

>>Scenario three:  Stupid DMA.

>>Here, the CPU just sets up the DMA and awaits completion.  The overhead
>>is approximately 0 compared to the above examples.  

>The overhead here is not zero.  It has been hidden.  The overhead lies in
>the fact that dual ported main memory is expensive, so either the DMA
>steals cycles that might be used by the CPU (and it can easily take about
>half the cycles needed to do the copy in the Scenario two), or the main
>memory costs more and/or is slower.

Almost any product we are talking about will have a Cache (or two) with
a reasonable hit rate which will allow DMA activity to take place with
little or no performance impact.  In fact, the major reason for speeding
up RAM is to improve processor performance for cache line loads, not for
improved peripheral performance.  

Anyway, few busses in the machines of this class have useable memory
bandwidths less than 25 megabytes/second sustainable indefinitely.  If
the CPU is hogging 90% of this, there is still 2.5 megabytes per second
available for I/O.  This adds up to a continuously active ethernet (1.25
MB/s) along with healthy disk bandwidth (1.25 megabytes/second).  Since
both of these are bursty, in reality, there is a greater amount of
instantaneously available bandwidth.

In an earlier life, designing a 64 processor 80386 machine (there is a 
working prototype somewhere but the company is no more :-(), I hit upon the
idea of predicting when a CPU will need bus cycles and using cycles
which were predicted not to be needed so that they could be used for I/O.  
On an 80386, it is possible to 100% predict bus cycle requirements with 
a small amount of logic by cheating.  My calculations showed that a 
16 Mhz 80386 would leave almost 10 megabytes per second of bandwidth 
unused which this method could tap for non-time critical I/O operations 
such as SCSI.  Time critical peripherals would have to take CPU cycles
if "free" cycles were not available within their time frame which would
not be very often.

>>Where does the DMA pay off given that all three examples have approximately
>>identical throughput? ...
>>DMA is preferable to the second choice whenever the cost of DMA is
>>less than 9% of the cost of the CPU or less than the cost of speeding
>>up the CPU by 9%.

>You have converted `% of available cycles' to `% of cost' (in the first
>half of the latter statement) and assumed a continuous range of price/
>performance in both halves, neither of which is true.

Actually, the true measure of a machine is the amount of work that it 
can do for an end user divided by the cost.  The user must define the
measure of work.  Since I'm not able to define what the user will use to
measure the machine, I must substitute a rough approximation -- deliverable
CPU power in the form of MIPS, Dhrystones, or whatever.  This value is
directly tailorable by a number of factors in the system.  Slowing down
RAM can drop cost and performance.  Sometimes it improves the ratio, 
sometimes it doesn't.  While there is not a "continous" or even
"twice differentiable" curve here, there are so many points on it that
for the purposes of this discussion it can be assumed to be a line.
For each price point, there is an associated performance level.  Obviously,
plotting each price point vs each performance point does not yield a line,
but a cloud of points.  However, these points are easily reduceable into
a family of general lines based upon CPU clock speed, DRAM speed, 
peripherals and data path size among others.

>(I happen to like DMA myself, actually.  But it does take more parts,
>and those do cost....)
>-- 
>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
>Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

I happen to like low cost myself and have been suprized when certain
solutions turned out to be cheaper than others in counterintuitive ways.

Robert Cousins
Dept. Mgr, Workstation Dev't
Data General Corp.

Speaking for myself alone.

rec@dg.dg.com (Robert Cousins) (06/08/89)

In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes:
>In article <185@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>>Where does the DMA pay off given that all three examples have approximately
>>identical throughput?  
>>
>>DMA is preferable to the first choice whenever the cost of DMA is less
>>than 20% of the cost of the CPU or less than the cost of speeding up
>>the CPU by 20%.

>>DMA is preferable to the second choice whenever the cost of DMA is
>>less than 9% of the cost of the CPU or less than the cost of speeding
>>up the CPU by 9%.

>>I am the first to admit that these models are simplistic, but they
>>do represent valid considerations and reasonable approximations to 
>>to the actual solutions.

>>Comments?

>Robert has also ignored the cost of setting up the DMA controller,
>which can be significant, especially for scatter/gather type
>operation.

True.  However I was assuming DUMB DMA which does not have these features.
Clearly, Scatter/Gather has its associated costs and benefits.  I was
remiss in not including this as an additional scenario.

>Also note that with modern operating systems that do buffering or disk
>caching, there is going to be a bcopy or its moral equivalent in there
>somewhere.  This doesn't reduce CPU time used during programmed I/O,
>but it does change the trade off from, say, 3 vs 10% to 13 vs 23% of
>CPU availability used for disk I/O.  This makes the nature of the
>trade off different.

True, however, many operating systems perform program loads and
paging directly to user space bypassing a buffer cache.  Since these
make up a substantial portion of the actually performed disk operations,
I don't think I was totally out of line, but your point is not only
valid but important.  I didn't include it because I was searching for
a simple approximation.

>My other thought is that regardless of the CPU cost, programmed I/O is
>often acceptable on single-user machines anyway.  The situation often
>arises where the user is only concerned about the speed of the one
>process he's using interactively.  That process will usually have to
>wait till its data arrives anyway, at least under current programming
>models.  If the CPU has to sit and wait, it might as well do the data
>movement too.

I strongly disagree here.  In the modern Unix world, multitasking is
of critical importance for simple system survival.  Take the classic
single user application:  workstations.  Sometime go look at the number
of processes which are running on a Unix workstation some time.  THen,
use a single task heavily while keeping track of the time consumed by
the other tasks.  You will notice that the LAN related tasks will continue
to use some amount of time.  If you are running X Windows, you are then
really using an application plus the X server -- two tasks.  Unix
depends upon being able to run more than one task to handle a variety
of jobs up to and including gettys for each user's terminal.  Programmed
I/O is the enemy of multitasking since it effectively keeps the CPU from
servicing tasks for a potentially long interval killing the ability
for another task to service a request.

Also, multitasking can provide a performance improvement even when only
a single task is running.  Clearly the task must wait while disk reads
are being performed, but writes can be posted and flushed by a daemon
in the background giving the program the effect of 0 wait writes.  For
some classes of applications, this can be a substantial win.

>I'm not about to make any pontifications about what I think is the
>best solution for the future, because I don't know myself ;-)
>
>  --jh-- 
>John Hood, Biar Games snail: 10 Spruce Lane, Ithaca NY 14850 BBS: 607 257 3423
>domain: jhood@biar.uu.net bang: anywhere!uunet!biar!jhood
>"Insanity is a word people use to describe other people's lifestyles.
>There ain't no such thing as sanity."-- Mike McQuay, _Nexus_

The truth is that all alternatives need to be considered.  Sometimes
the answers will fool you.

Robert Cousins
Dept. Mgr, Workstation Dev't
Data General Corp.

Speaking for myself alone.

rec@dg.dg.com (Robert Cousins) (06/08/89)

In article <41042@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes:
>>Also note that with modern operating systems that do buffering or disk
>>caching, there is going to be a bcopy or its moral equivalent in there
>>somewhere.  
>
>1) Is it possible, if not now but possibly in the future, for programmed
>   I/O to _eliminate_ some of the 'bcopy's?

Some already do for paging and program loads.  Already some Unix DBMS 
products bypass the file system and communicate through the raw character
drivers straight to the disks for performance reasons (bypassing the sector
cache).  While I don't know the implementational details of this, I do
know that it has been known to do substantial good for standard DBMS 
jobs.  This is because many character drivers for disks do DMA directly into
user space.

>2) This discussion brings to mind one that went around some time ago, 
>   which was, is it better to supply a bunch of specialized processors
>   (then bitblt's, now including DMA controllers), or a bunch of identical
>   processors connected together?  Theory was, when the bitblt and DMA are
>   done, the other processor(s) can be applied to a compute bound task.
>   It seems to me this might make an interesting product; price/perf
>   range is varied by the number of [identical] processors, and all I/O
>   hardware is very very dumb.

In an earlier life, I headed up a development of just such a machine, 
the CSI-150 which supported up to 32 V30 CPUs, each of which could be
connected to a private SCSI channel and capable of doing about 1.25 mb/s
on each.  It didn't catch on, but boy could it handle some classes of
I/O based jobs!.  Each CPU ran in its own private memory and sent
messages to other CPUs.  The operating system was designed so that the
file systems were locally managed and cached in each CPU so the messages
were higher level requests similar to NFS or RFS today. 

We did have one additional problem: the system supported exactly 1 user
per CPU.  This meant that the CRTs could be driven at 38.4 Kbps all day
long since they effectively had a dedicated CPU to drive them.  There 
were very few CRTs which could keep up with 19.2 Kbps much less 38.4.
We found that at 38.4, most CRTs couldn't manage to send ^S out to
shut off transmittion!

>-Stan

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/08/89)

In article <1213@ttds.UUCP> jonasn@ttds.UUCP (Jonas Nygren) writes:
>In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>>In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
>>< Given that a disk channel will be averaging 200K bytes/second ...

>I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.

>Write 15x2 Mb: 113 s, 265 kb/s
>Read 15x2 Mb:  117 s, 256 kb/s
>Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s
>Mean value: 234 kb/s

I have performed some single user tests.  200KBytes/sec reading rate 
is typical for small workstations with SCSI or similar disks, etc.   
With SMD on one of the newer controllers you can do ~600 KBytes/sec.  On
mainframes, I have seen single applications which *averaged* 3 MB/sec on
4.5 MB/sec channels on 8 simultaneous data streams.

So, the ratios quoted seem reasonable to me.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

sandrock@uxe.cso.uiuc.edu (06/08/89)

Yes, it would be fantastic to have *real* numbers which represented the
overall *system* performance, in particular, expressing *multi-user*
cpu & i/o throughput, similar perhaps to the transaction processing
benchmarks used on certain types of systems. I have recently seen some
multi-user benchmarks called MUSBUS 5.2 and AIM III for various SGI
machines, but these tests do not yet appear to be in wide use, nor do
I have any feeling for how valid a measurement they would provide.
Anyone care to comment about this?

Mark Sandrock
UIUC Chemical Sciences

rik@june.cs.washington.edu (Rik Littlefield) (06/09/89)

In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes:
< In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes:
< < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
< < < Given that a disk channel will be averaging 200K bytes/second ...
< <
< < I suspect that workstation class systems have an *average* disk
< < throughput that is at least 10X lower than this number, even when
< < they are working full out.
< <
< < Would someone with real utilization numbers care to fill us in?
< 
< I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.
< The test used 15 processes reading/writing 2Mb files each, with the following
< results:
<  <stuff deleted> 
< Mean value: 234 kb/s

Sure, but how much of the time does your workstation run 15 processes reading
and writing the disk as fast as it can?  Program loads and file copies run at
200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches.
Whether DMA (or any other feature) is worthwhile depends on what the machine
spends its time doing.

Apparently my question was not clear, so I will restate it.  Does anybody have
numbers that reflect actual usage over an extended period?  If so, please tell
us what sort of work was being done, and how much I/O was required to do it.

--Rik

hammondr@sunroof.crd.ge.com (richard a hammond) (06/09/89)

One might want to look at "An Analysis of TCP Processing Overhead"
by Clark, Jacobson, Romkey, and Salwen in the June 1989 issue
of IEEE Communications Magazine (Vol. 27, No. 6).  They point out
that the CPU has to do a checksum of the data during processing and
propose (as one alternative) removing the DMA controller from the
network controller and letting the CPU do the byte copy and move
together in a single loop.

Also, the disk throughput numbers of interest are not peak numbers
but avverage use over applications.  In previous jobs I've helped
collect the processing and disk I/O requirements of months of jobs on
Convex computers (i.e. big enough to do lots of number crunching
and I/O) and there were very few jobs which were both CPU intensive
and I/O intensive at the same time.

So - DMA on RISC based workstations might have a different cost
function than simply the cost of the CPU lost, one has to consider
whether the CPU cycles lost to doing the data movement would be
of use by another process.  I estimate that the answer is yes
only some fraction of the time, and so the numbers given are
upper bounds on the costs.

Note that I'm talking about workstations and not large mainframes
or number crunchers, where the load or applications may be different.

Definte numbers for applications which do both
CPU crunching and require some sort of I/O AT THE SAME TIME
would be interesting, since the Convex wasn't a personal workstation.

Rich Hammond

mlord@bnr-rsc.UUCP (Mark Lord) (06/09/89)

In article <8499@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes:
>< In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes:
>< < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
>< < < Given that a disk channel will be averaging 200K bytes/second ...
>< < ...
>< < Would someone with real utilization numbers care to fill us in?
>< 
>< I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.
>< The test used 15 processes reading/writing 2Mb files each, with the following
>< results:
><  <stuff deleted> 
>< Mean value: 234 kb/s
>
>Sure, but how much of the time does your workstation run 15 processes reading
>and writing the disk as fast as it can?  Program loads and file copies run at
>200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches.
> ...

Er uhm.. excuse me.. but I think there may be two issues here.  One is quantity
of I/O, and the other is rate of I/O.  This experiment with 15 processes doing
lots of I/O probably (IMHO) comes close to determining the rate of transfer
which is maintained when the system is actually reading from disk.  Thus, for
brief intervals, the system is doing transfers at the rate of 234kb/s, and it
is this rate which the CPU/DMA_device must keep up with, IN ADDITION to keeping
up with all other events/interrupts at the same time.  Sure, so it's only busy
for a second once every 30 seconds, but it still ought to be able to handle the
load when it comes.  Now imagine a system with several users running BIG 
simulations, with the associated paging going on as their tasks (and the 30
or so daemons which are always running) get swapped.  Personally, I'd like the
I/O to be fast, and I'd also like not to have to type slowly as I am doing
right now (about a two second pause between hitting a key and seeing the result
at times).  DMA might be appropriate for such a system.  Especially since the
CPU could easily be running something else from its huge caches, leaving lots
of idle bus cycles for the DMA.  This does not require dual port memory, but 
does require some sort of snooping (h/w or s/w) to maintain cache consistancy.

As processors continue to become much faster than the memory bus, DMA seems to
begin looking better and better for bulk data transfers with slower I/O devices.

-Mark

ken@hugi.ACA.MCC.COM (Ken Zink) (06/10/89)

Without getting totally mired in ancient history, I think it is germane
to the discussion to point out that the "I/O processors" (called "peripheral
processors" by CDC) had sufficient intellegence to do more than I/O.
In fact, for the first ten years or so of the 6000 architecture (6600, 6400,
Cyber 70, Cyber 170, lower), the ENTIRE operating system resided in the
domain of the "lowly" PP's.  It seems that very few operating system
functions really require floating point capability or 60 bits of significance.

One of the PP's was dedicated to "system monitor" functions and was
staticly allocated; one other PP was staticly allocated and assigned to
driving the system console (dual 15 inch CRT's).  The rest of the PP's
were dynamically assigned, by PP Monitor, to perform I/O or some OS
function as necessary. [Note: On the 20 PP systems, any PP could access
any I/O channel or else it caused too many problems in the OS to determine
which available PP could handle a given I/O request.]

Since the early systems were limited to 131K words of central (60 bit)
memory, having the OS reside in the PP world preserved the critical
memory resource for user job space.  As larger memory configurations
became available, a few of the most often used OS functions were migrated
to CP code - for performance.

The 7000 architecture (7600, Cyber 170/Model 176) with the PP's hard-wired
to external channels and to central memory buffers dictated that the OS
be CPU (60 bit) based.

In summary, there is a spectrum of system architectures with "intellegent"
I/O ranging, perhaps, from the dumbest of DMA-like to clusters of identical
processors, equally capable of performing an I/O function or executing
user provided code.  An optimal architecture would trade-off all of the
options in that spectrum against the available-technologies, cost-to-
manufacture, price-in-the-market, performance-in-the-desired-target-
application(s) and market-acceptability variables to determine which
of the options is "optimal."

In short, as we know, there is no single, best architecture.

Ken Zink            zink@mcc.com
MCC
Austin, TX

elg@killer.DALLAS.TX.US (Eric Green) (06/10/89)

in article <26636@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) says:
>>I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.
> 
>>Write 15x2 Mb: 113 s, 265 kb/s
>>Read 15x2 Mb:  117 s, 256 kb/s
>>Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s
>>Mean value: 234 kb/s

Note that this is probably not an accurate account of disk drive
bandwidth at all. Unix (at least older AT&T versions) DMA their data
into the disk cache, then has the CPU manually copy it into the user's
own buffer. With a plain-jane ST157N and a non-DMA SCSI controller
pushed by a plain old 8mhz 68000, I get 550K/second (at least until my
disk gets fragmented). And there are still visible pauses where the
68000 takes a while to digest the data. Another (DMA) disk controller
gets 650K/second out of the same disk drive (of course, a 68020 or
faster processor wouldn't have run out of steam like my 68000, so this
isn't really an argument of DMA is better than CPU driven).

Strangely enough, I have never seen anything on preferential caching
schemes for file systems. You'd want to cache small I/O requests, as
is currently done... but what about the scientific types who want to
stream in a few megabytes of data, crunch on it, then stream it back
out -- as fast as possible? That'd blow any reasonable cache to
pieces.  You'd want to DMA it straight into the user's memory. Or even
use CPU-driven IO straight into the user's memory... you'd still come
out at least as well as the traditional DMA-it-to-cache-then-copy-it.
Thinking on it a bit, seems you'd want to cache only small I/O
requests that don't overwhelm the amount of cache you have, while
DMA'ing large I/O requests straight into the user's memory ASAP. That
way crontab, whotab, and other small files hit fairly often would stay
cached longer. An interesting problem...

I suppose it irritates the designers of these disk subsystems that all
their beautiful bandwidth is chewed to shreds by OS overhead. 

> On
> mainframes, I have seen single applications which *averaged* 3 MB/sec on
> 4.5 MB/sec channels on 8 simultaneous data streams.

Which particular mainframes? Sounds like something a Cray could do...
very little overhead there at all (don't have to cope with memory
protection, can DMA straight into the user's data space without
worrying about how "real" memory maps into the user's "virtual"
memory, etc.).

Sounds to me like another speed reason for Crays to not have virtual
memory :-) (for the old veterans of past comp.arch discussions). Have
to consider all aspects of the architecture, including disk subsystem
performance, not just what it looks like from a user or CPU point of
view.

> So, the ratios quoted seem reasonable to me.

Yes, seems reasonable to me too. But somewhat sad, considering the
performance that the hardware is capable of.

--
    Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     
     ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849    
"I have seen or heard 'designer of the 68000' attached to so many names that
 I can only guess that the 68000 was produced by Cecil B. DeMille." -- Bcase

elg@killer.DALLAS.TX.US (Eric Green) (06/10/89)

in article <188@dg.dg.com>, rec@dg.dg.com (Robert Cousins) says:
> In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes:
>>>Where does the DMA pay off given that all three examples have approximately
>>>identical throughput?  
>>My other thought is that regardless of the CPU cost, programmed I/O is
>>often acceptable on single-user machines anyway.  The situation often
> I strongly disagree here.  In the modern Unix world, multitasking is
> of critical importance
> of jobs up to and including gettys for each user's terminal.  Programmed
> I/O is the enemy of multitasking since it effectively keeps the CPU from
> servicing tasks for a potentially long interval killing the ability
> for another task to service a request.

I have a non-DMA hard disk controller on the multitasking machine that
I use (so multitasking that each filesystem and device driver runs as
seperate tasks, albeit high-priority tasks). I'll agree that
programmed I/O sucks the wind out of any other task (of lower
priority) that is running, but it's not quite as bad as you put it.
The CPU can fetch data from buffer on the disk controller faster than
it can do anything with that data, so most things, e.g. compiles,
consist of long moments of Deep Thought by the process, punctuated by
occasional disk hits.  This is hardly the "disaster" that you claim,
although I do wish I had been able to get a DMA controller (I couldn't
trust one with the bus expander I'm using, alas).

And for the person who thought that the buffer on the disk controller
would have to be double-ported -- nope. Just bus oriented, with access
from either the disk side or the bus. I'm no hardware wiz (beware of
programmers with slaughtering irons!), but I can think of a couple of
ways to do it with standard TTL. E.g., use '273s as the "data port",
and loadable counters to address the RAM buffer sequentually (without
having to do math on it for each additional byte). 

> The truth is that all alternatives need to be considered.  Sometimes
> the answers will fool you.

Yep. And the "answer", in this case, is that while programmed I/O is
certainly nothing to be proud of (at least with a slow processor like
my 8mhz 68000), it's not the big disaster to multitasking that you
might expect. 

--
    Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     
     ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849    
"I have seen or heard 'designer of the 68000' attached to so many names that
 I can only guess that the 68000 was produced by Cecil B. DeMille." -- Bcase

jonasn@ttds.UUCP (Jonas Nygren) (06/10/89)

In article <8499@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes:
>< In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes:
>< < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
>< < < Given that a disk channel will be averaging 200K bytes/second ...
>< <
>< < I suspect that workstation class systems have an *average* disk
>< < throughput that is at least 10X lower than this number, even when
>< < they are working full out.
>< <
>< < Would someone with real utilization numbers care to fill us in?
>< 
>< I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk.
>< The test used 15 processes reading/writing 2Mb files each, with the following
>< results:
><  <stuff deleted> 
>< Mean value: 234 kb/s
>
>Sure, but how much of the time does your workstation run 15 processes reading
>and writing the disk as fast as it can?  Program loads and file copies run at
>200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches.
>Whether DMA (or any other feature) is worthwhile depends on what the machine
>spends its time doing.
>
>Apparently my question was not clear, so I will restate it.  Does anybody have
>numbers that reflect actual usage over an extended period?  If so, please tell
>us what sort of work was being done, and how much I/O was required to do it.
>
>--Rik

My figures was intended to show the upper limit of througput on a comercial
available workstation. I have been informed by people inside Digital that 
it would be possible to achieve figures at least twice what I stated if
you care to look for disk-drive sources outside Digital.

/jonas

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/12/89)

In article <8327@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:

>in article <26636@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) says:

>> mainframes, I have seen single applications which *averaged* 3 MB/sec on
>> 4.5 MB/sec channels on 8 simultaneous data streams.

>Which particular mainframes? Sounds like something a Cray could do...

This exact performance figure is from a Cyber 205, but I have seen similar
performance on Crays (not quite as good *then*, but should be better now
because of faster disks - newer disks run at ~100 Mbits/sec transfer rate
as opposed to the older 36 Mbits/sec disks)

Also, I expect large IBM mainframes to do almost as well.  Although the disk
transfer rate is not as high, the disk controller to channel connection runs
at 4.5 MBytes/sec on some models.

(*Aside*)

These I/O rates are not particularly high by mainframe standards, just by
Mini/Micro standards.  There used to be a rule of thumb that for balance
a system should have a constant ratio of 1 MIPS/1 MByte/1 Mbyte/sec I/O.
The latter was slightly nebulous, but usually interpreted as channels
capable of it and disks capable of reading at that rate sustained.  
It was also considered a "good idea" if disk and channel utilization was 
less than 5% of raw aggregate capacity in order to guarantee that the
disk subsystem was not the bottleneck.  I actually did a study once and found
that the ratio on one heavily used (i.e. many users) system here actually used
15KB/sec/MIP *average*.   This (mainframe) system was capable of at least 
.5 MB/sec/MIP I/O.  This 3% utilization helped make the CPU the bottleneck.

Disk I/O is the usual bottleneck on mini/micro systems.  This is not 
necessarily a "problem", it is just a system design and configuration tradeoff.  
(*end Aside*)

On a Cray, if you have an SSD, your I/O rate can run a *lot* faster than
the above disk rates.


>very little overhead there at all (don't have to cope with memory
>protection, can DMA straight into the user's data space without

Yes, this is part of the reason such rates can be sustained.  These rates
were always with data copied directly into user memory.  I note that there
is a way to do this some Unix systems:  a facility to map virtual
memory to files.  Then "paging" can potentially move the data directly into
memory without copying.  This is the case where virtual memory actually helps.
Most of the time it doesn't matter one way or the other for this problem.

>worrying about how "real" memory maps into the user's "virtual"
>memory, etc.).

Anyway, the Cyber 205 is a virtual machine.  VM has nothing to do with it
specifically.  The cost of copying large blocks of data is much less on a
Cray or Cyber 205/ETA machine because block data copies are done at vector
rate, and there is enough memory bandwidth available to sustain such rates.

Crays have memory protection, and the Operating System still has to figure
out what real memory addresses user memory buffers are in.  It takes a few
microseconds to do this either way, virtual or not.  These operations were
actually faster on the Cyber 205 than on the Cray X-MP/48, for various reasons.
The cost of an I/O operation has generally been in
figuring out where the data is in disk and initiating and sustaining the
transfer.  The Cyber 205 did this quickly because the hardware had *very*
capable controllers which did all the cylinder/track/sector mapping, and
presented a simple blockserver interface to the operating system.
(The 205 did not have the complicated "channel program" problem that IBM's
have because this overhead was all done in the controllers.)

>Sounds to me like another speed reason for Crays to not have virtual
>memory :-) (for the old veterans of past comp.arch discussions). Have

It sounds like a reason for systems to support fast I/O to me :-)

1) parallel I/O paths to memory (aka "channels")
2) fast disks
3) low overhead to do a raw disk operation
4) lots of memory bandwidth

5) operating systems which support multiple asynchronous I/O requests

6) operating systems which support transfer of data directly into user
	memory without being buffered elsewhere

**********************************************************************

I have an actual number to present here:

I have seen a significant number of applications which can only do about
20 floating point operations per word of I/O, unless the entire problem
can be memory contained.  The memory required for the entire problem
is in the range of 1 Million Words for every 1 to 10 MFLOPS.  So, a single
job running at ~100 MFLOPS may need about 800 MBytes, *or* the ability
to do I/O at a rate of 40 MBytes/sec.  

The single job referred to earlier was running at about 200 MFLOPS on a 
Cyber 205 and needed about 50 Mbytes/sec of I/O (it didn't get it - it
only got ~24 MBytes/sec)  I do not remember exactly how much memory was needed,
but it was significantly more than 32 MW (256 MBytes).

You have to look at the requirements of the entire problem before you
can say what your system requirements are.

**********************************************************************

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

gws@Xylogics.COM (Geoff Steckel) (06/12/89)

One concern that doesn't seem to be addressed on the DMAC vs. CPU data movement
question is transfer latency: the maximum delay between individual data items
moved.

Restrictions on latency (due to small buffers somewhere or a small decision
time) can override the otherwise attractive choice of using the CPU for
data movement.  If the CPU is not interruptable (or restricts interrupts)
during the data transfer other services (communications, screen, mouse, etc.)
may experience an unacceptable latency as well.

Sufficient buffering or other hardware to `smooth over' large latency times
can become more expensive (choose your metric) than putting in real or
simulated multiport memory and a dedicated DMAC; it entirely depends on the
desired performance.

For example, if the CPU must move data between every disk transfer, you may
miss revolutions on the disk... or maybe not.  Read, read, measure, and read
again.
	geoff steckel (steckel@alliant.COM, gws@xylogics.COM)