[net.unix] Raw vs. block device

v.wales%ucla-locus@sri-unix.UUCP (01/06/84)

From:            Rich Wales <v.wales@ucla-locus>

Jonathan --

Here is an attempt on my part to describe "block" and "raw" I/O in as
much detail as reasonably possible.  If I have inadvertently made some
misstatement, or left out some important feature, I trust one of the
other "veterans" on this list will correct me.

UNIX has two kinds of device interfaces:  "block", and "character" (also
called "raw").  I'll discuss here the "raw" interface first, since it is
the "lower-level" of the two, and since virtually all devices with block
interfaces will have a raw interface as well.

RAW (CHARACTER) DEVICE INTERFACE

    Generally speaking, the "raw" interface to a device gives you direct
    control over that device.  If you do a "read" system call on a disk
    via the "raw" interface, for example, you will generally invoke a
    single input operation on that disk to read your data.  (There may
    be exceptions here; for example, I once wrote a "raw" device driver
    for an RX02 floppy disk, and since this device can read or write
    only one sector at a time, I implemented long "read" or "write" re-
    quests via multiple I/O commands to the drive.)
    
    Raw I/O is "synchronous":  I/O operations are always done in the
    order requested.  There can never be more than one raw I/O request
    pending per device.  In 4.1BSD, this restriction is generally imple-
    mented by having the driver declare a single "buf" structure per
    device for all raw I/O on that device.  All raw I/O for the device
    goes through a routine called "physio" (in dev/bio.c); "physio" in
    turn checks and manipulates a "busy" status bit in the "buf" struc-
    ture, using the kernel's "sleep"/"wakeup" facility to force requests
    on a busy "buf" structure to wait.

    Raw I/O is generally subject to any requirements imposed by the
    hardware itself.  For example, if a given disk demands (as most do)
    that all I/O operations start on a sector boundary and comprise an
    integral number of full sectors, then you must observe this restric-
    tion when doing raw I/O on that disk.
    
    If you do try to read/write random amounts of data at random places
    on a disk via a raw interface, you are likely to get unpredictable
    results.  (In particular, a misaligned "write" is liable to trash
    innocent data.)  If the driver is well written and checks for this
    situation, you may get an explicit error, but you shouldn't in gen-
    eral depend on this.  This, by the way, is why you can't use "adb"
    on a raw device.
    
    In the case of my RX02 driver which I mentioned earlier, by the way,
    I chose to implement multi-sector "read"s and "write"s as a conve-
    nience to the user.  I could have forbidden them (because the RX02
    hardware doesn't support them) and have been perfectly within the
    philosophy of raw I/O interfaces by so doing.  My driver still re-
    quired all transfers to start on sector boundaries and comprise an
    integral number of full sectors, though -- and I explicitly tested
    for violations of this constraint before doing the I/O.

    Raw I/O on terminal lines is somewhat complicated by the use of the
    "clist" mechanism (see sys/prim.c).  Hence, terminal I/O may be to
    some extent asynchronous, even though a "raw" interface is in use.

BLOCK DEVICE INTERFACE

    The block interface (if one exists) to a device goes through a com-
    plicated buffering/caching scheme.  A number of buffers (each one
    1024 bytes long in 4.1BSD, or 512 bytes long in Version 7) are allo-
    cated by the kernel for block I/O.  Each buffer is labelled with the
    device (major/minor) and block numbers, so that repeated references
    to the same block do not result in actual "read" operations if the
    block is already in main memory.
    
    Each buffer has a "dirty" bit, so that the data is not written back
    to disk immediately upon the issuance of a "write" system call.
    Data is written back when the buffer is needed for another block
    (LRU caching strategy); when a "sync" system call is issued by a
    process; or when a block device is closed and (if it was mounted)
    unmounted.

    A "block" driver interface to a device is free to perform I/O opera-
    tions in any order it sees fit -- not necessarily the order in which
    "read" or "write" system calls were issued.  (Hence, while raw I/O
    is "synchronous", block I/O is "asynchronous".)  Most disk drivers
    use a queue of pending I/O requests for each drive, sorted in order
    by cylinder so as to allow the disk arm to sweep back and forth
    across the surface in "elevator" fashion.  In a "raw" interface, on
    the other hand, there is no need for a queue of pending requests,
    since by definition only one raw I/O request can ever be pending for
    any given device.

    The buffering scheme allows you to do I/O with arbitrary byte off-
    sets and byte counts, even if the device itself does not support
    such access.  For example, if you want to write a single byte in the
    middle of a block using the block interface, the kernel will read in
    the entire block and then change the single byte in question.  An
    I/O operation which spans multiple blocks (perhaps starting in the
    middle of one block and ending in the middle of another) is handled
    in a similar fashion.

    The block I/O mechanism is used by the routines which implement reg-
    ular file I/O, needless to say.

WHICH DEVICES ARE BLOCK?  WHICH DEVICES ARE RAW?

    In general, every device will have a raw interface.  Additionally,
    a device on which it would make sense to put a file system (i.e.,
    disks) will generally have a block interface.  Most tape drivers
    also have a block interface, although I have never had occasion to
    access a tape by anything but the raw interface.

    If you are doing a "dd" (byte-for-byte copy) of a large area of disk
    (say, for example, that you are moving a file system from one part
    of the disk to another), you should probably use the raw interface,
    since it is far more efficient than the block interface.  In partic-
    ular, large block sizes in "dd" can generally be handled by the raw
    disk interfaces, whereas the block interface will cut a large trans-
    fer down into 1K-byte chunks.

    Terminals have only a raw interface.  Also, such "funny" files as
    /dev/null and /dev/kmem are implemented via raw interfaces.  (Of
    course, you can still do I/O on /dev/kmem from random offsets and
    with random byte counts, since memory does not have the alignment
    restrictions that a disk does.)

DEVICE SPECIAL FILES AND RAW VS. BLOCK I/O

    There are two kinds of device special files in UNIX:  raw and block.
    The major device number of a device special file is associated with
    a set of device-driver routines via one of two tables in dev/conf.c:
    "cdevsw" ("c" = "character" = "raw") for raw devices, and "bdevsw"
    ("b" = "block") for block devices.  In particular, note that there
    is no necessary relationship whatsoever between "raw" major device
    number N and "block" major device number N.

I hope this covers your question adequately.  If not, let me know and I
will try and supply additional information.

-- Rich <v.wales@UCLA-LOCUS>

v.wales%ucla-locus@sri-unix.UUCP (01/08/84)

From:            Rich Wales <v.wales@ucla-locus>

Steve --

One statement you made in your explanation of raw vs. block I/O may need
a little bit of clarification or expansion.

	Physio() hands to the disk device strategy routine the
	"block number" of the request.  The block number is
	derived quite simply as u.u_offset>>BSHIFT.  u.u_offset
	is the current "lseek" position of the open raw device
	file, BSHIFT is log2(BSIZE).  Thus, all RAW I/O opera-
	tions must occur on a BSIZE boundary.

This is true in most cases (in particular, it is true for the disk driv-
ers that come distributed with most UNIX systems), but it need not be
true in general.

The "physio" routine does set the "b_blkno" variable to the disk block
number, as you described.  However, a raw driver interface does not have
to use the "b_blkno" value proffered by "physio" if it doesn't want to
-- since it still has access to "u.u_offset" in the user structure.

So it is possible to design a raw interface that doesn't do operations
on a BSIZE boundary -- provided that the device in question supports
such activity.  Indeed, I did this very thing in an RX02 driver I wrote
-- since the sector size on an RX02 is either 128 or 256 bytes (depend-
ing on the density of the disk), I ignored the "b_blkno" calculated by
"physio" and did my own computation based on "u.u_offset".

This brings up another interesting facet of UNIX, by the way -- namely,
"When can you safely refer to the user structure in kernel code?"

The basic rule is that you can safely refer to data in the user struc-
ture only in those parts of the kernel code that are executed in direct,
immediate response to a request by the user program -- such as a system
call (CHMK instruction) or a trap.  In the case of a system call or a
trap, the user process context (including, in particular, the process's
user structure) is retained and can safely be referenced.

Kernel code executed asynchronously (e.g., as the result of an inter-
rupt), on the other hand, must not do anything with or to the user
structure, because the process that happened to be running at the time
of the interrupt is, in general, simply an innocent bystander with no
logical connection to the interrupt condition.  This is the reason, by
the way, why you can't put a "uprintf" (kernel-generated write to user's
terminal) in an interrupt routine -- "uprintf" identifies the user's
terminal by looking in the user structure (u.u_ttyp), and a "uprintf" in
an interrupt routine would end up writing to a random terminal.

In dev/hp.c, for example, the only routines where it is safe to refer to
the user structure are "hpread" and "hpwrite" (which are called as a re-
sult of a "read" or "write" on a raw device).  Although "hpstrategy" is
also used for raw I/O, you can't refer to the user structure in it be-
cause it is also used for block I/O (which is asynchronous to the pro-
cess or processes doing the "read"s or "write"s).

In the case of the RX02 driver I mentioned earlier, I put code in my
"rxstrategy" routine to compute a sector number based on "u.u_offset"
and the disk density.  I could safely do this because my driver sup-
ported ONLY a "raw" interface, and thus my "strategy" routine would al-
ways be invoked in the context of the user process that issued the
"read" or "write" system call.  If I had chosen to implement a "block"
interface as well, I would have had to use two different "strategy"
routines -- one for raw I/O (specified as an argument to the "physio"
calls), and one for block I/O (in the "bdevsw" array in dev/conf.c).
The "raw strategy" routine could safely use "u.u_offset"; the "block
strategy" routine, on the other hand, would have to make do with the
"b_blkno" value from "physio".

-- Rich <v.wales@UCLA-LOCUS>