[comp.sys.ibm.pc] DMA, Followup on EGA memory moves

lotto@wjh12.HARVARD.EDU (Jerry Lotto) (11/12/86)
I tried to post to comp.sys.ibm.pc, but this is not yet enabled at our
site.

In article <197@oliveb.UUCP> spud@oliven.UUCP (John Purser) writes:
>In article <1189@dataio.UUCP> bright@dataio.UUCP (Walter Bright) writes:
>>IBM EGA. This involves moving 128k bytes of data. Doing it with a
>>REP MOVSW takes about 1/2 second (on an AT), which is too slow.

>How did you arrive at the time of 1/2 second? The way I figure this
>it should only take about .05 seconds. According to the 286 programmers
>referance guide a REP MOVSW takes 5+(4*CX) clocks. In your example that would
>be 64k words times 4 plus 5 or a total of 262,149 clocks. The clock speed
>of the AT is 6Mhz so dividing the 262,149 by 6,000,000 leaves us with .045
>seconds. It may be that the video RAM is slow and requires a wait state
>or 2 on each access but thats a memory limitation and it won't help to use
>DMA in that case.

I once tried to understand this. The conclusions I came to were:

The DMA controller operates at 3 MHz. If you want to work on PC's you
have to use an 8 bit DMA channel (0 and 3 are spare and 1 is used for
the ever popular SDLC adapter so you probably have this available as
well). This means that you can move 64K max. per transfer. The data
transfer bus cycle is 5 clocks (1.66 usec) so we are up to .108 sec
for 1/2 the total transfer or about .25 sec total.

16 bit DMA is ~2x faster if I read the specs correctly. Same bus
cycle, but words are transferred. Max is 128K here so the lower limit
is .108 sec.

A REP MOVSW takes 5+(4*CX) clocks. This means that there should be a read
and a write operation once every 4 clocks. Also means that there should be
two writes or reads to the same chip every 8 clocks. The factors that might
throttle the processor include 1) Memory access time and 2) Bus bandwidth.

If there are 4 clocks/instruction x 167 ns/clock = 1.5 instructions /
usec.  in this time, we must do both a read and a write of 16 bits. If
we include 1 wait state for each memory cycle on a standard AT, two
memory cycles are required, each 5 clocks long. So the I/O to memory
is 4/10 the speed of the instruction at 100% bus utilization. But we
don't have 100% of the bus, we are refreshing memory at 5.3%
utilization, so we are down to ~ .6 instructions / usec. At this rate,
64K words / .6 words/usec gives .109 sec.

Actual benchmarks indicate that MOV is more efficient than DMA on
PC's.  I welcome a critique of the analysis above. All of my
understanding of this stuff comes from reading technical ref. manuals.
If there is a flaw in the assumptions above, please point it out so I
can learn more about it.