[comp.arch] Standard DMA architectures

dave@dtg.nsc.com (David Hawley) (06/06/90)
This is a fairly lengthy posting; it contains an abstract and list of
objectives for an IEEE standard on bus-based DMA architectures.  The
chair of this P1212 subcommittee is Mike Wenzel of Hewlett-Packard
(you can contact him at mw@hprnd.hp.com).  The proposed standard needs
wider discussion, and I thought this would be a good forum for the
ideas presented.  If you want copies of the proposed standard, contact
Mike.  If the response is small enough, he can mail them to you direct,
otherwise they will be made available through ftp or by order from a
copy center.

Dave Hawley, National Semiconductor (dave@dtg.nsc.com)

--------------------------------------------------------------------------------

The DMA Framework document is part of the IEEE P1212 CSR Architecture
Specification:  a register set to be used for control, configuration,
identification, and diagnostics between nodes on a system bus interconnect.
The goal of the P1212 CSR Architecture is to provide a scalable, extensible
bus-independent framework for interoperability between nodes capable of
communicating across a system interconnect.

This DMA Framework specifies recommended architectures for providing high-
performance interfaces between I/O controllers and system memory.  Other DMA
architectures are possible in P1212-compliant systems.  But to maximize inter-
vendor compatibility, use of the frameworks as specified in this document is
recommended for initial development.

It must be emphasized that the objective is NOT to arrive at a single DMA model
that will be welded forever to P1212.  DMA specifications should be optional to
allow for newer and better schemes, or ones more suited to a given application.
Yet a point of departure is needed that will guide the initial development of
common interfaces.

The general idea is to carefully craft a very primitive, yet high-performance
means of passing messages across the bus.  This scheme should make minimal
demands on the instruction set and hardware required.  To this would be added
several simple conventions for the structure of the messages passed--to do
common operations in a common way.  The most important convention would be the
vendor structure used for representing data buffer segments.  These message-
passing and structure concepts would then form the building blocks on which
specific interfaces would be built, for example, for LAN, disc, and printer
interfaces.

Objectives of the IEEE P1212 DMA Framework (Part III-A):

A. FUNCTIONAL

	 1. Support DMA to disc, tape, and communication devices.
	 2. Allow multiple active I/O transactions to be outstanding
	    simultaneously on a single DMA channel.
	 3. Provide for communication to multiple devices through a single DMA
	    channel (e.g., SCSI).
	 4. Interrupts can be synchronous or asynchronous.  Synchronous
	    interrupts are generated at the completion of a pre-specified data
	    transfer.  Events not associated with active DMA requests generate
	    asychronous interrupts.
	 5. Support out-of-order processing of DMA requests.
	 6. Allow I/O Units to support both cache-coherent and mixed-coherent
	    systems with minimal special handling.
	 7. Support both powerful I/O co-processors and low-end micro-processor
	    or custom-IC-based I/O units.
	 8. For static data structures, contiguous blocks of memory which are
	    larger than a RAM page must only be obtained at system power-up time
	    or during system re-configuration (e.g. when adding a driver).
	    These are not encouraged.
	 9. For dynamic data structures, those which may be allocated and
	    deallocated on a per-message or per-transaction basis, must not
	    require contiguous blocks of memory which are larger than a RAM
	    page.  To conserve physical RAM, data structures must be capable of
	    being sized for typical transactions and be extensible to cover
	    worst-case logical transaction sizes via multiple blocks of
	    contiguous RAM.  Contiguous blocks of physical RAM must be of a
	    convenient size for system memory managers (e.g. Unix "mbufs").

B. PERFORMANCE

	 1. Performance equal to or better than that of today's systems.
	 2. Minimize the number of Processor reads to Unit CSRs.  Because of bus
	    converters, arbitration, device and processing delays, reads have
	    higher latency than writes through the bus, keeping the Processor
	    waiting for the response.
	 3. Maximize the amount of data that the Unit reads from System Memory
	    in one block-copy setup.  For similar reasons, the more work that
	    can be done per access to system memory, the better.
	 4. Require one or fewer writes to a Unit CSR per I/O transaction.  If
	    actions can be arranged into blocks in system memory, rather than
	    involving pokes of Unit CSRs, performance should be improved.
	 5. Result in one or fewer interrupts per I/O.
	 6. Be efficient for small transfers (20 bytes) and large transfers.
	 7. Avoid the need for additional, sequential copy steps, especially
	    for large user data segments.
	 8. Efficient for both cached and non-cached Processors.  If only one
	    entity writes into a given (cache) line of memory, then line
	    ownership need not change, saving bus traffic for coherence
	    protocols.  For non-coherent Units and agents there is also a
	    robustness consideration here.  It's easy for the flush of a
	    Processor's cache unintentionally to overwrite adjacent bytes of
	    memory that were written by the unit.
	 9. Minimize overhead bytes: data moved through the bus that is of no
	    interest to the party on the other side, or that is only an artifact
	    of the DMA model itself.
	10. Optimal for the normal case of command size and data fragmentation.
	    Expandable to cover the worst-case.
	11. Efficiently vector interrupts when there are many Units in a system.
	    The source of an interrupt (one of 100's) should be easily
	    determined.  To simplify Processor designs, efficient dispatch to
	    the proper interrupt routine should be possible even when large
	    numbers of Units share the same interrupt bit.

C. COST

	 1. Minimize the required hardware cost (e.g., for Processors, memory
	    controllers and device adaptors).  Yet this objective must be kept
	    in balance with the others, instead of overriding them, especially
	    performance.
	 2. Fit the implementation into a reasonable area of silicon using 1990
	    technology.  Product development projects need backplane DMA chips.
	 3. Do not require the DMA controller to interpret virtual addresses.
	    Virtual addressing is more expensive and is highly vendor-dependent.
	 4. Avoid the need to completely double-buffer data for active
	    transactions in both System Memory and the I/O Unit.

D. SYSTEM COMPATIBILITY

	 1. Avoid the use of interlocked transactions (instructions).  The DMA
	    model should not require the implementation of locked bus
	    transactions, since these are not well supported by many existing
	    bus standards.  This is also a performance item because of
	    processing time, bus cycles and potential contention for a lock.
	    However...
	 2. Do NOT prohibit the use of locks, to support large system
	    configurations efficiently.

E. FLEXIBILITY
	
	 1. Make low-level primitives as simple as possible (for both
	    flexibility and efficiency).
	 2. Efficiently adapt to a wide range of applications.
	 3. Must be scalable to large system configurations.
	 4. Except for processor interrupt CSRs, avoid specifying CSR read or
	    write characteristics that would require custom hardware.  Treat
	    each CSR as though it could be read or written by a microprocessor
	    in the Unit.  This would include avoiding bits that don't set when
	    written, reads that produce side-effects, writes that cause
	    instantaneous results, etc.  This also can be a robustness issue in
	    cases where bus transactions can be retried by low-level agent
	    hardware.  (For example, if writing a register causes a count to
	    increment, then the count could be thrown off if a bridge retries a
	    bus write transaction.)
	 5. Leverage: DMA message-passing mechanisms also should be usable for
	    processor-to-processor communication.