[comp.arch] Using a DMA chip in strange ways

dupuy@amsterdam.columbia.edu.UUCP (02/18/87)

    While reading Tanenbaum's new OS book (the Minix book) a sort of half baked
idea came to me.  While sophisticated OS design strives to minimize the number
of memory to memory copies, there is some irreducible minimum in a system with
distinct user and kernel spaces.

    Since the DMA chip on your favorite disk/tape controller works by stealing
bus cycles when the CPU is busy with other things (like arithmetic), would
there be any advantage in having a DMA chip which would simply be used for
memory to memory copies (from user to kernel space, or from one user space to
another)?

    At some point the various DMA chips may start bumping into each other, and
it may not be worth the effort (in more complex bus access/arbitration logic)
to add this memory to memory DMA chip.  But given sufficient bus bandwidth, if
the CPU spends a significant amount of time without accessing the bus, there
could be a significant performance boost for current operating systems.

    So what do you hardware types think?  Is there anything to this idea?  Does
this sort of thing already exist, and I just don't know about it?  Or is there
some problem which I have missed?

@alex

----
arpa: dupuy@columbia.edu
uucp: ...!seismo!columbia!dupuy

ron@brl-sem.UUCP (02/18/87)

In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>     Since the DMA chip on your favorite disk/tape controller works by stealing
> bus cycles when the CPU is busy with other things (like arithmetic), would
> there be any advantage in having a DMA chip which would simply be used for
> memory to memory copies (from user to kernel space, or from one user space to
> another)?
It's a good idea.  I'm glad I've used computers that the designers had
thought of it.  This is used most commonly in certain graphics displays
to make the block memory moves on the display for things like windows
happen faster.

The Denelcor HEP super computer had a block transfer hardware device, but
we never got around to making use of it before we scrapped the thing to
make room for the CRAY.   UNIX could probably see a pretty good speed up
from this thing.  Certain performance studies show that UNIX spends a
majority of it's kernel time shuffling data between the buffer cache and
user data space.

-Ron

grr@cbmvax.UUCP (02/18/87)

In article <4343@columbia.UUCP> dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>
>    Since the DMA chip on your favorite disk/tape controller works by stealing
>bus cycles when the CPU is busy with other things (like arithmetic), would
>there be any advantage in having a DMA chip which would simply be used for
>memory to memory copies (from user to kernel space, or from one user space to
>another)?

Many general purpose DMA controller chips can already do this sort of thing.
Other, fancier things like BLIT chips can also to high-speed memory to memory
DMA as a degenerate case.
-- 
George Robbins - now working for,	uucp: {ihnp4|seismo|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@seismo.css.GOV
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

farren@hoptoad.UUCP (02/18/87)

In article <4343@columbia.UUCP> dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>
>While reading Tanenbaum's new OS book (the Minix book) a sort of half baked
>idea came to me.

 No, your idea seems fully baked :-)

>    Since the DMA chip on your favorite disk/tape controller works by stealing
>bus cycles when the CPU is busy with other things (like arithmetic), would
>there be any advantage in having a DMA chip which would simply be used for
>memory to memory copies (from user to kernel space, or from one user space to
>another)?

  Many DMA circuits that are not designed for a single purpose (i.e., disk
controller to memory transfer) can be used like this, and it IS a good idea,
as long as the system is not DMA intensive.  In particular, a number of micros
(specifically, the Amiga) have this capability, and use it.  Particularly good
if you are copying large blocks of memory to other memory spaces, a task often
associated with graphics, but otherwise very useful.  DMA can usually do the
move from two to four times faster than even a tightly-coded loop.

-- 
----------------
                 "... if the church put in half the time on covetousness
Mike Farren      that it does on lust, this would be a better world ..."
hoptoad!farren       Garrison Keillor, "Lake Wobegon Days"

elh@vu-vlsi.UUCP (02/18/87)

In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
> ....would
> there be any advantage in having a DMA chip which would simply be used for
> memory to memory copies (from user to kernel space, or from one user space to
> another)?
> 
I believe this is currently a feature on many commercially available
DMA controllers.  In particular, while I was working on the architecture/
partitioning of some of the peripheral chips for the ATT WE32XXX family,
we decided to include this feature in the the DMA member of that family.
This has shown increased performance in memory-to-memory copies in the
operating system (as reported in a paper in the 1986 International
Conference on Computer Design... I forget the exact reference).

This part also has a number of other interesting features including
a *separate* byte wide bus which services commercially available
byte wide peripherals (disk, lan, etc. controllers), byte to word
packing, word buffering and burst mode bus transactions....  The 
peripherals on the byte wide bus lie in the address space of the
DMA peripheral which lies somewhere in the address space of the system
(obviously...).

Dr. Ed Hepler,  Adjunct Prof. Villanova University
                Staff Engineer, GE Astro Space, Valley Forge
                (Formally MTS, Bell Labs, Naperville, Ill.)

kds@mipos3.UUCP (02/19/87)

like has been said before, most DMA chips can be set up to do this, but...
...whether what you are suggesting is effective depends on the system.  I
believe that the 680[12]0 and the [23]86 processors are capable of moving
data across the entire width of the data bus at the maximum bus bandwidth,
so to move your data around as quickly you'd have to have a 32-bit dma
chip around that can also run at the maximum processor bandwidth.  Also, the
setup time at the beginning of the transfer is probably going to be longer,
since it usually takes longer just to set one of these things up, and if
it is sitting on the other side of the memory management, you have to take
that into account.  Also, whether it is going to really be effective is
dependent on whether the processor can really do something useful while
DMA is going on, since if it cannot gain access to the bus during the transfer,
it will just be sitting there anyway if it needs to get something from memory.
Some DMA controllers have a "throttle" which limits their maximum bus 
utilization to take care of this so the processor can get in a transfer
edgewise to take care of problems like this.

And another novel use of DMA controller?  I believe the original IBM pc uses
a DMA controller to do DRAM refresh.
-- 
The above views are personal.

The primary reason innumeracy is so pernicious is the ease with which numbers
are invoked to bludgeon the innumerate into dumb acquiescence.
			- John Allen Paulos

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
csnet/arpanet: kds@mipos3.intel.com

rdt@houxv.UUCP (02/19/87)

	 simple answer. 

	 There is a distinct tradeoff between the size of the
	 block transfer (how large a chunk) and the amount of
	 indirection overhead in:

	 1) making a system call, 
	 2) being authorized to use the DMA in the fashion you want,
	    (permission checks via probes and address translations;
	    recall that the DMA works in physical space where as the
	    CPU tends to work in virtual spaces) 
	 3) and then setting up the DMA context registers with the
	    necessary pointers, sizes and configuration information.
	
	CONCLUSION:

	 for transfers under 32-64 words, the overhead time to setup
         the dma may swamp the more efficient block move capabilities
	 of the specialized dma hardware. 

	 mileage (breakeven point) may vary with OS system call structure, 
	 and manufacturers MMU and DMA hardware.

Richard Trauben
ATTIS, Holmdel, New Jersey
WE32x00 Processor Development

andy@batcomputer.UUCP (02/19/87)

Every board in our FPS T-20 (18 Transputers, 16 vector coprocessors) has
a DMA chip that can be used for memory-to-memory copies.

They also use video RAMs (with a special address mode to get at more than
32 bits / fetch) to pump things into and out of the vector processor.

Nifty.
-- 
Andy Pfiffer					andy@tcgould.tn.cornell.edu
Cornell Theory Center / Cornell U.		cornell!batcomputer!andy
Home of the first usable T-Series		(607) 255-8686
"...that's the way a Transputer works, right?"  Systems Group

mahar@weitek.UUCP (02/19/87)

In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>     Since the DMA chip on your favorite disk/tape controller works by stealing
> bus cycles when the CPU is busy with other things (like arithmetic), would
> there be any advantage in having a DMA chip which would simply be used for
> memory to memory copies (from user to kernel space, or from one user space to
> another)?
> 
The memory management hardware could be used to send messages between the
OS and tasks if messages were multiples of the page size. This could reduce
message transfer time a lot.

-- 

	Mike Mahar
	UUCP: {turtlevax, cae780}!weitek!mahar

	Disclaimer: The above opinions are, in fact, not opinions.
	They are facts.

paul@unisoft.UUCP (02/21/87)

In article <630@vu-vlsi.UUCP> elh@vu-vlsi.UUCP (Edward L. Hepler) writes:
>DMA controllers.  In particular, while I was working on the architecture/
>partitioning of some of the peripheral chips for the ATT WE32XXX family,

....

>This part also has a number of other interesting features including
>a *separate* byte wide bus which services commercially available
>byte wide peripherals (disk, lan, etc. controllers), byte to word
>packing, word buffering and burst mode bus transactions....  The 
>peripherals on the byte wide bus lie in the address space of the
>DMA peripheral which lies somewhere in the address space of the system
>(obviously...).
>
>Dr. Ed Hepler,  Adjunct Prof. Villanova University
>                Staff Engineer, GE Astro Space, Valley Forge
>                (Formally MTS, Bell Labs, Naperville, Ill.)

	Everyone doing serious peripheral design should look at this chip
(WE32106 I think). It must the the best DMA chip on the market ... only one
catch - the price $250 a piece in 100 quantities. Still, if you want to see
how to design a DMA chip right (esp. if you are going to design one), look
at this one.


		Paul Campbell
		..!ucbvax!unisoft!paul

mac@uvacs.UUCP (02/23/87)

>     Since the DMA chip on your favorite disk/tape controller works by stealing
> bus cycles when the CPU is busy with other things (like arithmetic), would
> there be any advantage in having a DMA chip which would simply be used for
> memory to memory copies (from user to kernel space, or from one user space to
> another)?

Been done.  I believe it was on the Univac 1100s.  Used for plated-wire to
core memory transfers, among other things.

bzs@bu-cs.UUCP (03/04/87)

>    Since the DMA chip on your favorite disk/tape controller works by stealing
>bus cycles when the CPU is busy with other things (like arithmetic), would
>there be any advantage in having a DMA chip which would simply be used for
>memory to memory copies (from user to kernel space, or from one user space to
>another)?

I remember suggesting this on our LSI-11 systems using the DMA buffer
in the RX02 floppy drive as you could write/read it w/o any I/O going
to the disk (load/unload buffer [DMA] and xfer to/from disk were
different operations.) I figured we could buy an extra RX02 controller
if this brilliant idea worked (this was around '78-'79 I guess.)

Just an anecdote, we never tried it because the Mini-Unix was too busy
swapping to the RX02 to be used for this...I think we figured out it
wouldn't be very fast either.

Anyhow, maybe you already have a device to try it with, sure this wouldn't
work on your disk controller or some such (hey, no warranties expressed
or implied!)

	-Barry Shein, Boston University

greg@utcsri.UUCP (Gregory Smith) (03/04/87)

Somebody writes:
>     Since the DMA chip on your favorite disk/tape controller works by stealing
> bus cycles when the CPU is busy with other things (like arithmetic), would
> there be any advantage in having a DMA chip which would simply be used for
> memory to memory copies (from user to kernel space, or from one user space to
> another)?

The Z80 DMA chip does this. It can copy from a moving memory address to
a moving memory address, or from a moving memory address to a constant
i/o address, or the reverse of the latter.

Actually, in most configurations, the DMA chip steals the bus whether
the CPU wants it or not - you still win since a DMA copy operation can
be done with two memory cycles per byte, and most CPU's don't allow
this.

The Z80 chip was the first DMA chip I ran into, and I was surprised to
find out that others were not like that (most others generate moving
memory addresses and control signals, and the I/O device must be wired
to place the data on the bus (or read it) during the memory cycle). With
the Z80 setup, the DMA chip addresses the I/O device, so if this device
is always ready, it needs no special hardware and doesn't 'know' it is
being addressed by the DMA chip rather than the CPU. If it is not always
ready, there is a control signal to throttle the DMA. The disadvantage
is that I/O transfers require two memory cycles per byte, whereas only
one is required in the usual setup.

-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

adam@gec-mi-at.co.uk (Adam Quantrill) (03/13/87)

In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>> [] would
>> there be any advantage in having a DMA chip which would simply be used for
>> memory to memory copies (from user to kernel space, or from one user space to
>> another)?

Yup. You don't wear out the cpu so much.

-- 
       -Adam.

/* If at first it don't compile, kludge, kludge again.*/

njh@root44.UUCP (03/23/87)

In article <518@gec-mi-at.co.uk> you write:
>In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>>> [] would
>>> there be any advantage in having a DMA chip which would simply be used for
>>> memory to memory copies (from user to kernel space, or from one user space to
>>> another)?
>
>Yup. You don't wear out the cpu so much.

When doing a port of UniPlus+ on a machine with a spare channel on it's
68450 I tried a few benchmarks. I'm afraid (with a 68k at least) it was
slower using the 68450 than the 68010 (dbra's, moveml's etc.) for memory
to memory copies. I put this down to overhead of setting it up in C,
the CPU having to wait for the DMAC to finish (as this gives rise
to bus contention - the 68010 has to do *something* while the 68450 is
copying, a wait till finish means it has to sit in a loop reading
instructions, or you run the 68450 in interrupt mode, in which case you
have all that nasty interrupt goo after it's finished) etc.

Sorry about the double subclauses in the above paragraph - I never was any
good at English. Anyhow, moral is, leave the CPU do to memory copies,
it's not worth the hassle.
-- 
--

Nigel Horne, Divisional Director, Root Technical Systems.
<njh@root.co.uk>	G1ITH	Fax:	(01) 726 8158
Phone:	+44 1 606 7799 Telex:	885995 ROOT G	BT Gold: CQQ173

chris@mimsy.UUCP (03/25/87)

In article <246@root44.root.co.uk> njh@root.co.uk (Nigel Horne) writes:
>When doing a port of UniPlus+ on a machine with a spare channel on it's
>68450 I tried a few benchmarks. I'm afraid (with a 68k at least) it was
>slower using the 68450 than the 68010 (dbra's, moveml's etc.) for memory
>to memory copies. ...

(I assume you mean `movl's; moveml will not run in loop mode.)

>... leave the CPU do to memory copies, it's not worth the hassle.

Well, now, that depends on the machine architecture.  (Must be why
this is in comp.arch :-).)  We have some Heurikon 68010 based boards
that, when the MMU is enabled, suffer a wait state per CPU memory
access.  The DMA chip does not go through the MMU, and copies over
a certain size (we have not yet caculated or measured just *what*
size) will run faster when done via the DMA chip in spite of the
setup overhead.  But as yet we are not worried about this sort of
(small) performance improvement.  McMob needs first an O/S....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu

elh@vu-vlsi.UUCP (03/26/87)

In article <246@root44.root.co.uk>, njh@root.co.uk (Nigel Horne) writes:
> In article <518@gec-mi-at.co.uk> you write:
> >In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
> >>> [] would
> >>> there be any advantage in having a DMA chip which would simply be used for
> >>> memory to memory copies (from user to kernel space, or from one user space to
> >>> another)?
> >
> >Yup. You don't wear out the cpu so much.
> 
> When doing a port of UniPlus+ on a machine with a spare channel on it's
> 68450 I tried a few benchmarks. I'm afraid (with a 68k at least) it was
> slower using the 68450 than the 68010 (dbra's, moveml's etc.) for memory
> to memory copies......
> 
> Nigel Horne, Divisional Director, Root Technical Systems.

This is probably due in part to the fact that the 68450 does not (I believe)
have the capability of doing "burst" transfers (Issue one address followed
by 2,4 words of data without the overhead of the entire bus cycle). This
along with (if I remember correctly) the fact that the part had a multiplexed
address/data bus (at the part) hurt its performance.  Notice that using
moveml instructions to effect a block move (as suggested in the article)
really emulate a "burst" (in a manner). Up to 16 words (every register) 
can be copied in using contiguous bus cycles (non-multiplexed) and then 
copied back out... 

The ATT (WE32xxx) part (which I am familiar with) provides the capability
to perform such burst mode transfers. Of course the memory system must
be capable of servicing such requests.

Ed Hepler
Villanova University

davidsen@steinmetz.UUCP (03/27/87)

In article <246@root44.root.co.uk> njh@root44.UUCP (Nigel Horne) writes:
>In article <518@gec-mi-at.co.uk> you write:
>>In article <4343@columbia.UUCP>, dupuy@amsterdam.columbia.edu (Alexander Dupuy) writes:
>>>> [] would
>>>> there be any advantage in having a DMA chip which would simply be used for

>When doing a port of UniPlus+ on a machine with a spare channel on it's
>68450 I tried a few benchmarks. I'm afraid (with a 68k at least) it was
>slower using the 68450 than the 68010 (dbra's, moveml's etc.) for memory
>to memory copies.

What I think you mean is "takes less real time" using the CPU.
If you are doing an operation which requires waiting until the
memory has been moved this is correct. If you can make other
good use of the CPU to run another process, the system will
probably run faster using DMA. For instance, moving a process in
memory to garbage collect, might have enough overhead with
pointer fiddling in tables to make the total real time less
using DMA. The overhead of handling the interrupt is trivial:
set a flag in the interrupt handler and return. The CPU can loop
on the flag when the rest of the tasks are done.

-- 
bill davidsen			sixhub \
      ihnp4!seismo!rochester!steinmetz ->  crdos1!davidsen
				chinet /
ARPA: davidsen%crdos1.uucp@ge-crd.ARPA (or davidsen@ge-crd.ARPA)

mats@forbrk.UUCP (03/30/87)

In article <668@vu-vlsi.UUCP> elh@vu-vlsi.UUCP (Edward L. Hepler) writes:
>The ATT (WE32xxx) part (which I am familiar with) provides the capability
>to perform such burst mode transfers. Of course the memory system must
>be capable of servicing such requests.

This is a lovely chip, agreed. Now, if only the price would come down
to where one could afford to use it in a moderately priced system....
Our hardware guys had to reject it right away becuase of its' high cost,
and because the AT&T rep didn't see any prospects of it coming down at all.
We were facing having this part be the most expensive chip in the system,
since the 68020 CPU and 68851 MMU are clearly coming down quickly.

Sigh.

Mats Wichmann