[comp.sys.ibm.pc] DMA

bright@dataio.UUCP (Walter Bright) (11/06/86)

I am interested in copying pixel data from one page to another on the
IBM EGA. This involves moving 128k bytes of data. Doing it with a
REP MOVSW takes about 1/2 second (on an AT), which is too slow.
Does anyone know
how the DMA channel could be programmed to do this? It is not clear
from the documentation how to program the DMA chip, or even if
it is capable of memory-to-memory transfers.

Thanks for any help.

spud@oliveb.UUCP (John E. Purser) (11/12/86)

In article <1189@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
>I am interested in copying pixel data from one page to another on the
>IBM EGA. This involves moving 128k bytes of data. Doing it with a
>REP MOVSW takes about 1/2 second (on an AT), which is too slow.
>Does anyone know
>how the DMA channel could be programmed to do this? It is not clear
>from the documentation how to program the DMA chip, or even if
>it is capable of memory-to-memory transfers.

How did you arrive at the time of 1/2 second? The way I figure this
it should only take about .05 seconds. According to the 286 programmers
referance guide a REP MOVSW takes 5+(4*CX) clocks. In your example that would
be 64k words times 4 plus 5 or a total of 262,149 clocks. The clock speed
of the AT is 6Mhz so dividing the 262,149 by 6,000,000 leaves us with .045
seconds. It may be that the video RAM is slow and requires a wait state
or 2 on each access but thats a memory limitation and it won't help to use
DMA in that case.

Now all this is just numbers and I'm not a hardware type so let me back
this up with some experience. I've done some programming on an ATT 6300
that uses an 8086 at 8Mhz. I've written routines to move 32k bytes to video
RAM and it happens in the blink of an eye. It runs at a faster clock and
I'm only moving 1/4 the data but according to the programmers guide it
takes 17 clocks per rep on an 8086 for a MOVSW instruction so it should be
about the same time as on your system.

In summary you may want to investigate further using the CPU to do the
move.

John Purser
Olivetti ATC
Cupertino CA.

bright@dataio.UUCP (Walter Bright) (11/14/86)

In article <197@oliveb.UUCP> spud@oliven.UUCP (John Purser) writes:
>In article <1189@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
>>I am interested in copying pixel data from one page to another on the
>>IBM EGA. This involves moving 128k bytes of data. Doing it with a
>>REP MOVSW takes about 1/2 second (on an AT), which is too slow.
>>Does anyone know
>>how the DMA channel could be programmed to do this? It is not clear
>>from the documentation how to program the DMA chip, or even if
>>it is capable of memory-to-memory transfers.
>
>How did you arrive at the time of 1/2 second? The way I figure this
>it should only take about .05 seconds. According to the 286 programmers
>referance guide a REP MOVSW takes 5+(4*CX) clocks. In your example that would
>be 64k words times 4 plus 5 or a total of 262,149 clocks. The clock speed
>of the AT is 6Mhz so dividing the 262,149 by 6,000,000 leaves us with .045
>seconds.

I did some checking. First off, the EGA is 8 bit ram, so the 128kb move
should take .09 seconds. Second, the EGA needs 4 out of 5 memory cycles
to do refresh, so the copy winds up taking about 1/2 seconds. DMA obviously
wouldn't help much here.

Also, nobody replied with a method of doing memcpy()s with the DMA.

zhahai@gaia.UUCP (Zhahai Stewart) (11/15/86)

In article <197@oliveb.UUCP>, spud@oliveb.UUCP (John E. Purser) writes:
> In article <1189@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
> >I am interested in copying pixel data from one page to another on the
> >IBM EGA. This involves moving 128k bytes of data. Doing it with a
> >REP MOVSW takes about 1/2 second (on an AT), which is too slow.
> 
> How did you arrive at the time of 1/2 second? The way I figure this
> it should only take about .05 seconds. According to the 286 programmers
> referance guide a REP MOVSW takes 5+(4*CX) clocks. In your example that would
> be 64k words times 4 plus 5 or a total of 262,149 clocks. The clock speed
> of the AT is 6Mhz so dividing the 262,149 by 6,000,000 leaves us with .045
> seconds. It may be that the video RAM is slow and requires a wait state
> or 2 on each access but thats a memory limitation and it won't help to use
> DMA in that case.
> 
First off, if you want to move 128K "pages", I presume that you are using the
EGA in the highest resolution, 640x350x16 colors, mode 10 (hex).  In this mode
the video refresh seems to eat up much of the memory bandwidth; thus the EGA
inserts wait states as needed until a free "access slot" is available to
service the processor - this happens even on a 4.77 MHz 8088 in the PC, not
to mention, for example, my 8 MHz 80286.  Because of this, and the very well
optimized string instructions on the 286, I doubt that DMA could do any faster
than CPU based moves, copying EGA->EGA.  (Even if you could get memory->memory
DMA working, that is).

You have two basic possibilities for EGA->EGA moves: plane by plane or all at
once.  For plane by plane, set the EGA to read a given plane (of 4), and to
write only to the same plane, do the copy (80x350 = 28KBytes) using MOVSW
with CX = 14000; then switch read and write enables to the next plane and
repeat.  This will be hampered by the fact that each 16 bit read or write
will actually be done as 2 back to back 8 bit writes (transparent to the
CPU - the EGA is an 8 bit card), each with several wait states, so this
will be considerably slower than the MOVSW calculation above (which assumed
real 16 bit transfers with 0 wait states).

The other way is to set up the EGA to write from its internal latches,
which hold 32 bits retrieved by the last EGA read (8 bits x 4 planes).
Then you do a 1 byte read from the source, and a 1 byte write (contents of
write do not matter, only the write strobe and address), in order to xfer
32 bits (1 byte x 4 planes).  You cannot do this word at a time because the
internal latches are only 8 bits wide.  So in this case you use a MOVSB
with CX=28000.  Each cycle transfers 32 bits with only two memory cycles
(and the corresponding wait states), as opposed to the first method which
transfers 16 bits/rep with 4 memory cycles and corresponding waits.  This
should be much faster; ironically, it should run at approximately the same
speed on a PC or AT, since the limitation is the EGA cycle stealing and
8 bit wide internal path.  Did that come across - the second technique
should work faster on a PC than the first does on an AT?

Also note that a full screen image in this mode only occupies 114 KBytes,
not 128 - so you can save another 10% or so if you only need to move the
visible image.

The exact ways to set up the EGA registers for this can be found in the
IBM manuals, or PC Tech Journal had an article, etc.  To much to go into
here and now.  Good luck.


-- 
--
Zhahai Stewart
{hao | nbires}!gaia!zhahai

jallen@netxcom.UUCP (John Allen) (11/19/86)

In article <1196@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
>In article <197@oliveb.UUCP> spud@oliven.UUCP (John Purser) writes:
>>In article <1189@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
>>>I am interested in copying pixel data from one page to another on the
>>>IBM EGA. This involves moving 128k bytes of data. Doing it with a
>>>REP MOVSW takes about 1/2 second (on an AT), which is too slow.
>>>Does anyone know
>>>how the DMA channel could be programmed to do this? It is not clear
>>>from the documentation how to program the DMA chip, or even if
>>>it is capable of memory-to-memory transfers.
>>
>>How did you arrive at the time of 1/2 second? The way I figure this
>>it should only take about .05 seconds. According to the 286 programmers
>>referance guide a REP MOVSW takes 5+(4*CX) clocks. In your example that would
>>be 64k words times 4 plus 5 or a total of 262,149 clocks. The clock speed
>>of the AT is 6Mhz so dividing the 262,149 by 6,000,000 leaves us with .045
>>seconds.
>
>I did some checking. First off, the EGA is 8 bit ram, so the 128kb move
>should take .09 seconds. Second, the EGA needs 4 out of 5 memory cycles
>to do refresh, so the copy winds up taking about 1/2 seconds. DMA obviously
>wouldn't help much here.
>
>Also, nobody replied with a method of doing memcpy()s with the DMA.

As you just pointed out, DMA wouldn't help much.  To address the real
question, the INTEL documentation for the 8237 outlines a method for
performing memory to memory block DMA transfers using DMA channels 0
and 1.  I've used block DMA from memory to I/O using one DMA channel,
(as does the FD controller, I believe) and can assure you that this
works just fine.  I would be surprized to hear that the memory to memory
DMA didn't work.  If you still want to give it a try, and need more
help, please send email.

John Allen
=========================================================================
NetExpress Communications, Inc.      seismo!{sundc|hadron}!netxcom!jallen
1953 Gallows Road, Suite 300         (703) 749-2238
Vienna, Va., 22180
=========================================================================

brian@umbc3.UMD.EDU (Brian Cuthie) (10/28/88)

[as I slip into my asbestos suit]

I have decided to respond to several postings with this single response 
rather than tie up net bandwidth with followups to followups etc.

First, I humbly apologize for two things:  1) If I came off sounding
omniscient about the AT, I'm sorry.  I have, in past years, designed several
disk controllers for the PC and written suitable drivers that used DMA.
2) for incorrectly extrapolating PC expertise to cover the design of the
AT.  Some of my statements about DMA on the AT were clearly wrong as I made
some bad assumptions about that particular part of the AT design.

However, most of my points about DMA are true in the general case.  It is
true that, after pawing through the AT technical reference manual, the AT
has some serious deficiencies in it's DMA design.  This, however, does not
change the fundamental reasons for using DMA in most systems.

The Intel 80286/80386 processors have the unique ability to behave much like
a DMA controller.  That is, they can transfer data in single memory cycles (
please note that a memory cycle is not the same as a clock cycle).  In this
mode, using the string transfer instructions, the 80*86 is capable of
generating address and timing signals without placing data on the bus.  Thus
the peripheral or memory is free to drive the data bus directly to the
recipient of the data (memory for INS instructions, and peripheral for OUTS
instructions).  There are some instances when DMA controllers will buffer
data however these are rare.  Data usually flows between the peripheral and
memory in single memory cycles unless, of course, the peripheral's
controller cannot transfer data at memory speeds (unlikely since most
peripheral controllers have some buffer cache).

Normally, however, a processor does not have this ability.  Thus, to transfer
a block from a peripheral to memory requires that the processor read a
byte/word from the peripheral and subsequently write that byte/word to
memory.  This operation, even under the best of caching scenarios, requires
at least two memory accesses.  It can be seen, then, that a processor
lacking this special ability could never be as fast as a well designed 
DMA subsystem.

DMA controllers seize the bus by placing the CPU in a HOLD state.  In this
state, the CPU is not able to perform any external bus accesses.  Instead,
all address and timing information is generated by the DMA controller.  When
the DMA controller has placed the CPU into a HOLD state, and has asserted
the appropriate address onto the address bus, it asserts either MEMREAD (for
a transfer from memory to the peripheral) or MEMWRITE (to transfer from a
peripheral to memory).  The device which has requested DMA recognizes these
signals in conjunction with the DMA ACK signal and data is transfered over 
the data bus directly between the peripheral and memory with no intermediate 
lay-overs.

It can be seen that during this transfer, the CPU will remain idle, once it
has completed it's current instruction, until it can regain control of the
bus.  Therefor, most DMA controllers offer the ability to generate limited
burst DMA transfers.  The Intel 8237 is limited to either single transfers or
complete block transfers.  Other DMA controllers, such as the Motorola 68445
(I believe that is the correct part number), allow the burst length to be
programmed over a wider range.  Limiting the burst length allows some
interleaving of CPU and DMA memory accesses.

Interleaving CPU and DMA access to memory is usually less desirable than
complete block transfers since there is substantial overhead in placing the
CPU into a HOLD state.  This problem can be solved by multiported memory
designs.  However, since processor speeds outstrip memory speeds (that is
as CPUs get faster, they spend more time waiting for memory) there is little
advantage to this scheme.

In summary, DMA is used primarily because, in a well designed system, it can
almost always be made to be more than twice as fast as the CPU in doing 
peripheral to memory transfers.  However, memory bandwidth is limited and 
thus you must rob peter to pay paul, so the idea that DMA allows concurrent 
CPU and peripheral access to memory is somewhat mislead.

-brian