[comp.unix.sysv386] The performance implications of the ISA bus

pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/11/90)

On 5 Dec 90 14:44:45 GMT, jcburt@ipsun.larc.nasa.gov (John Burton) said:

In article <1990Dec5.144445.18632@abcfd20.larc.nasa.gov>
jcburt@ipsun.larc.nasa.gov (John Burton) writes:

jcburt> In article <PCG.90Dec4160737@odin.cs.aber.ac.uk>
jcburt> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pst> You're comparing CPU performance to I/O performance. [ ... ] Back
pst> when there were REAL(tm) computers like 780, a lot of time and
pst> energy went into designing efficient I/O from the CPU bus to the
pst> electrons going to the disk or tty. [ ... ] Sure OS's and apps have
pst> gotten bloated, but when you put a chip like the MIPS R3000 on a
pst> machine barely more advanced than an IBM-AT you end up with a toy
pst> that can think fast but can't do anything.

pcg> No, no, no, no, no, no, no. The IO bandwidth of a typical 386 is
pcg> equivalent or better than that of any UNIBUS based machine, and, in
pcg> practical terms, equivalent to that of MASSBUS based ones. You can get
pcg> observable raw disc data rates of 600-900KB/s and observable filesystem
pcg> bandwidths of 300-500KB/s under SVR3.2 (with suitable controllers and a
pcg> FFS of some sort). This is way better than a PDP-11.

jcburt> True, a typical 386 machine has good I/O bandwidth, but
jcburt> bandwidth isn't everything. The majority of 386 machines have an
jcburt> ISA bus which is a very simple bus controlled by the cpu. When
jcburt> performing I/O, the cpu blocks itself and turns control of the
jcburt> bus to the I/O device.

This not quite true. Actually it is not true at all. You seem to be
describing synchronous programmed IO, which is not used in most ISA
peripherals.  Most ISA peripherals are interrupt driven, and even use
DMA, and the CPU can work between interrupts. Definitely.

jcburt> Machines that were originally designed as a multi-user platform
jcburt> usually where set up so that the I/O could be performed without
jcburt> the direct control (or blocking) of the cpu. The system bus was
jcburt> designed so that multiple operations could occur more or less
jcburt> independent of the cpu (multi-tasking hardware design).

This is entirely true of the ISA bus and any PC system around. Hey, they
even have DMA (well, read on).

However, I can easily see that you misconceptions have a root in three
problems with typical ISA machines, one that is particular to the design
of a PC clone, and two that are particular to the most common disk
controller design for such machines.

For a very ugly reason, the DMA chips that perform DMA under the CPU
control are nearly useless for high speed transfers, and on some designs
the braindamage is bad enough that the few slow DMA channels avaialable
cannot ven be shared. But there is no such restriction for DMA driven by
a peripheral board itself, not by the CPU, and some (rare) boards have
bus mastering ability and have their own DMA onboard.

Since DMA using the CPU controlled DMA channels is so bad, the standard
WD style AT controller does not use DMA. It is interrupt driven, so
while the controller is seeking the disk or transferring data the CPU is
free. When the controller is done seeking and transferring, the CPU gets
an interrupt, and then copies byte by byte, with a very fast block move,
the sector read from the controller's onboard cache to core. This is
indeed done using programmed IO, synchronously and the CPU is busy while
doing it, but it takes relatively little.

Finally, the common type of ISA disk controller, for other relatively
ugly reasons, is single threaded. This means that it cannot overlap
seeks and transfers to/from multiple disks. It cannot overlap multiple
tranfers because of the above mentioned sector buffer; there is only one
sector buffer... In theory it could overlap seeks on two drives, or
seeking on one with transfer on another, and indeed this can be done
with seek buffering (ST506) devices using a clever (and obscene) hack.

The really big problem for multiuser operation is the lack of overlap;
the authors of the UNIX disk driver sort routine report that on with a
multithreaded controller on a PDP-11, three moving arm disks operating
in parallel givem under typical timesharing loads, the same performance
as if they were a single fixed arm one with the sum of their capacities.

This means that with a multithreaded disk controller, three disks, and
typical timesharing load, the ability to move three arms in parallel is
the same as having a single zero seek time arm. A big, big, big win.

Two disks on a multithreaded disk controller are already a very large
improvement over a single disk for timesharing, especially if you spread
the (instantaneous) load across them by careful positioning of your
partitions.

Now back to the ISA bus. As somebody observes elsewhere, the IO
bottlenecks of a timesharing system are the terminal lines and the disk
controllers. If you use intelligent terminal controllers and intelligent
multithreaded disk controllers you timesharing performance will be
impressive, on a par with that of a VAX of the same class.

Just using FIFO based serial line controllers substantially reduces
terminal IO overhead; just using two ESDI controllers, one per each
disk, will give tremendous improvements, because the two controllers
will be able to seek and transfer in parallel.

If you want higher performance use a microprocessor based intelligent
serial line controller, and something like an AHA 154x disk controller,
that is multithreaded, bus mastering, and has its own fast DMA channels.

Ah, a final note: if you really want high performance form your
multiuser ISA machine, DO NOT use in any way the console. Access to
video RAM is so abysmally slow that it could consume a large portion of
your bus bandwidth. If you want to do fast graphics on an ISA machine,
buy an X terminal and a fast Ethernet board, don't use the console,
unless you get a really expensive super intelligent video board with
very fast truly 16 bit memory, but I think that for timeharing the X
terminal solution is still better, and not much more expensive, because
it allows further overlap in the generation fo the graphics and in its
rendering on the screen.


In summary: to saturate an ISA bus (5 MB/sec) you need a pretty large
number of peripherals running continuously, such as more than three
disks (say 800KB/sec each) and a network board (say 600KB/sec), which
brings us to 2/3 of nominal. Things like a QIC tape (90KB/sec), 8 serial
ports (20KB/sec for eight ports simultaneously at 19200 baud), and so on
are irrelevant for bandwidth. You have then a problem with the typical
high interrupts processing overheads of 386 UNIX systems, with their
often badly written drivers, but if you use the right controllers even
these are not that important.

Let's say that a machine with 8 FIFO based serial lines, 2 < 20msec seek
time discs attached to an AHA154x, a 386/25 noncaching motherboard (4
MIPS, let's say), and 16 MBytes can comfortably support 8 users doing
fairly heavvy development work even using things like G++ and GNU Emacs.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

rreiner@yunexus.YorkU.CA (Richard Reiner) (12/11/90)

Thanks, Piercarlo Grandi, for your clarifying analysis of ISA bus +
disk issues.  I wonder if I could ask you one or two questions.

>just using two ESDI controllers, one per each disk, will give
>tremendous improvements [because of multi-threaded operation]

What about using SCSI equipment?  Do there exist SCSI host adaptors
for the ISA bus which support multi-threaded operation?

And what about track-buffering ESDI controllers?  Would their
advantages go away if they were used in the setup you suggest (since
you claim that one would get effectively near-zero seek times anyway)?

--richard

dougp@ico.isc.com (Doug Pintar) (12/12/90)

In article <18871@yunexus.YorkU.CA> rreiner@yunexus.YorkU.CA (Richard Reiner) writes:
>
>Thanks, Piercarlo Grandi, for your clarifying analysis of ISA bus +
>disk issues.  I wonder if I could ask you one or two questions.
>
>>just using two ESDI controllers, one per each disk, will give
>>tremendous improvements [because of multi-threaded operation]
>
>What about using SCSI equipment?  Do there exist SCSI host adaptors
>for the ISA bus which support multi-threaded operation?
>
>And what about track-buffering ESDI controllers?  Would their
>advantages go away if they were used in the setup you suggest (since
>you claim that one would get effectively near-zero seek times anyway)?
>
The comments below are are intended to relate to ISC Unix, but most will
apply in the general case (HPDD stuff notwithstanding) -- DLP

First, the use of two ESDI controllers will swamp the system before giving
you much advantage.  Remember, standard AT controllers interrupt the system
once per SECTOR.  The interrupt code must then push or pull 256 16-bit words
to/from the controller.  Given an ESDI raw transfer rate of 800 KB/sec (not
unreasonable for large blocks) that's 1600 interrupts per second, each with
a (not real fast, due to bus delays) 256-word PIO transfer.  Try getting two
of those going at once and the system drags down REAL fast.  I've tried it on
a 20 MHz 386 and found at most a 50% improvement in aggregate throughput
using 2 ESDI controllers simultaneously.  At that point, you've got 100% of
the CPU dedicated to doing I/O and none to user code...

Two drives on a single AT-compatible controller will gain you something
in latency-reduction, as the HPDD does some cute tricks to overlap seeks.

Bus-mastering DMA SCSI adapters, like the Adaptec 154x (ISA) or 1640 (MCA)
provide MUCH better throughput.  They ARE multi-threaded, and the HPDD will
try to keep commands outstanding on each drive it can use.  The major win is
that the entire transfer is controlled by the adapter, with host intervention
only when a transfer is complete.  You get lots more USER cycles this way!
The limiting factor here is how fast you can get transfers happening between
the bus and memory.  This varies from motherboard to motherboard and is
unrelated to bus speed or processor speed.  You normally want to tune the
SCSI adpater to have no more than a 50% on-bus duty cycle, or you start
losing floppy bytes (and, in the worst case, refresh!).  On Compaq and
Micronics motherboards, you can go at 5.7 MB/sec bursts.  Some motherboards
can go at 6.7 and others will go up to 8.  Your max rate will be about half
this, given the 50% bus duty cycle limit.  Arbitration for the SCSI bus can
limit this even more if you've got a bunch of drives trying to multiplex data
through a slow pipe to memory.  I found that I couldn't get much over 1.7
MB/sec using 3 simultaneous SCSI drives on a Compaq.  Going to more drives
actually slowed things down due to extra connections and releases of the sCSI
bus.  I would imagine I'd see a big improvement if I could get the transfer
rate up to the 8 MB/sec burst rate.

I'm still not convinced that cacheing controllers are a big win over a large
Unix buffer cache.  I usually use 1-2 MB of cache, and a couple-MB RAMdisk
for /tmp if I have the memory available.  Using system memory as a cache is
LOTS faster than going over the bus to cache on a controller, and I trust the
Unix disk updater more than some unknown algorithm used in a controller.
At least when you shut Unix down with a normal controller, you know you can
really power the system down.  With some controllers, there's an unknown
latency time before the final 'sync' and write of the superblock actually
gets out there.  Could get ugly.

As usual, should any opinion of mine be caught or killed, ISC will disavow
any knowledge of me...

Doug Pintar

pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/13/90)

On 11 Dec 90 15:37:41 GMT, rreiner@yunexus.YorkU.CA (Richard Reiner) said:

pcg> just using two ESDI controllers, one per each disk, will give
pcg> tremendous improvements [because of multi-threaded operation]

rreiner> What about using SCSI equipment?  Do there exist SCSI host adaptors
rreiner> for the ISA bus which support multi-threaded operation?

Ah yes, the common recommendation is the Adaptec Host Adapter 154xB. It
sings it dances, it is a floor wax and a dessert topping. Not only it is
multithreaded, it does bus mastering without CPU involvement, does DMA
with its won fast DMA technology, and does scatter/gather in hardware
with command chaining. In other words, it is more of an IO coprocessor
than a crude disk controller. The ISC HPDD exploits all its wonderful
aspects.

The only defect of the 1542 seems to be fairly long operation setup
times, in the millisecond range, but I don't thing this is terribly
important, unless you attach solid state disks to your SCSI bus.

Other SCSI controllers (OMTI, Future Domain, WD FASST) may be as
sophisticated, but I have no certain data. The Adaptec seems to be the
most popular, and can be bought fairly cheaply from Tandy. Other drivers
may be able to exploit all its wonders, (the Esix one maybe), but again
I have no details.

rreiner> And what about track-buffering ESDI controllers?

A word of caution here: I have been reminded by William Bogstad by
e-mail that there is another reason (that I had already mentioned myself
long ago in comp.unix.i386) for which a multithreded controller is
preferable to two ESDI ones. ESDI discs cannot do command chaining,
which means that they scatter/gather has to be interrupt driven by the
UNXI disc driver, not by the controller. This means that as the IO
operations per second increase interupt processing overhead also
increases, and can be come quite severe, because disk interrupt
processing is a very high overhead activity in all 386 Unixes I know
(not many). This obscene overhead could be largely obviated, like in
some PDP/VAX drivers, with an interrupt processing fastpath, called
pseudo-DMA in software. Maybe some 386 Unix vendor has already
implemented it, but I am not aware of any.

It can take several thousand instructions (milliseconds!)  for an
interrupt to be processed by a 386 Unix disk driver, and for a new block
operation to be reissued. With IO operation rates of a several dozen per
second on 4 MIPS processors this can represent a significant percentage
of CPU time. For very high IO loads with many fast discs, hardware
scatter/gather is very important.

rreiner> Would their advantages go away if they were used in the setup
rreiner> you suggest (since you claim that one would get effectively
rreiner> near-zero seek times anyway)?

Track buffering is not a property of ESDI controllers alone; some
popular RLL controllers also have track buffering. Track buffering reads
an entire track when you read or write a sector on that track. This is
only a win if you access consecutively several sectors in the same
track, otherwise it is a lose because it forces you to wait for an
entire revolution to read a sector, when on average only a third/half
would be enough. With old style filesystems, which are fragmented fairly
easily, this is usually not a win, especially for writes; I have turned
off track buffering on my RLL controller. It is instead a definite win
if you use the various styles of Fast File System, as they usually
succeed in keeping logically consecutive sectors physically contiguous
as well, and in doing multi sector requests.

Note that track buffering only influences rotational latency, not seek
latency.

The zero-seek-time property of three well scheduled moving arm discs
moreover must be carefully understood -- it says that you get the same
number of IO operations per second from 3 arms moving in parallel over
3 discs of X capacity, as you get out of a single disc with fixed heads
and 3X capacity, *if* there is enough load.

Note that this number of IO operations per second is lower than the
number of IO operations per second from 3 discs with fixed arms, because
there are two extra transfer channels than for a single disc with 3X
capacity and a fixed arms. Still the speedup is impressive (but you must
balance the load across the three discs!).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

jmm@eci386.uucp (John Macdonald) (12/18/90)

In article <PCG.90Dec12195835@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|rreiner> Would their advantages go away if they were used in the setup
|rreiner> you suggest (since you claim that one would get effectively
|rreiner> near-zero seek times anyway)?
|
|Track buffering is not a property of ESDI controllers alone; some
|popular RLL controllers also have track buffering. Track buffering reads
|an entire track when you read or write a sector on that track. This is
|only a win if you access consecutively several sectors in the same
|track, otherwise it is a lose because it forces you to wait for an
|entire revolution to read a sector, when on average only a third/half
|would be enough. With old style filesystems, which are fragmented fairly
|easily, this is usually not a win, especially for writes; I have turned
|off track buffering on my RLL controller. It is instead a definite win
|if you use the various styles of Fast File System, as they usually
|succeed in keeping logically consecutive sectors physically contiguous
|as well, and in doing multi sector requests.
|
|Note that track buffering only influences rotational latency, not seek
|latency.

It is possible to set up a controller to give most of the benefit
of track buffering without any possible loss.  Have the controller
do the following when attempting to read a sector: seek to the
right track, start reading sectors.  When the desired sector has
been read process it (send it to CPU using the appropriate DMA and
so on and then interrupt the CPU to terminate the IO; while this
is being done continue to read the track and save each sector that
is read into the track buffer.  Keep a record of which sectors have
been read and which haven't.  Whenever the CPU's device driver
handles the IO completion it will likely issue another request.
When a request comes to the controller, check to see if it can
be satisfied from any available buffered track.  If so, do that
and don't interfere with any disk reading that is still going
on filling a track buffer.  If not, terminate any ongoing track
buffer activity for a different track and seek to the desired
track, and start buffering.  When the background processing is
able to finish reading a track buffer and there is still no new
request that requires a real disk access, then additional background
activity can be done (complete filling a track buffer that has been
partially filled, read a new track when many of the sectors in the
previous track have been used, write out any buffered changed
sectors if write-through is not being used, etc.).  Since this
procedure returns a result as soon as it is available, and start
to process a new request as soon as it is issued, there is no loss;
it is just that there is the potential for using the sectors that
come under the read head during rotational latency, and in using
the time between host requests, and in using the time saved by
filling a host request from a track buffer.

Off hand, I don't know whether any particular disk controller
uses this algorithm, but I wouldn't be surprised if one did.
It does require that the controller be able to do some activities
simultaneously (DMA, IO completion, and new activity startup with
the host; all at the same time as IO processing to the disk.)
-- 
Cure the common code...                      | John Macdonald
...Ban Basic      - Christine Linge          |   jmm@eci386

pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/19/90)

On 11 Dec 90 22:58:39 GMT, dougp@ico.isc.com (Doug Pintar) said:

dougp> First, the use of two ESDI controllers will swamp the system
dougp> before giving you much advantage.  Remember, standard AT
dougp> controllers interrupt the system once per SECTOR.  The interrupt
dougp> code must then push or pull 256 16-bit words to/from the
dougp> controller.

This need not be a big problem. I have had e-mail discussion of these
issues in the last few days, and I take advantage of your posting to
dispel some myths publicly.

The interrupt latency and sector transfer times are quite small. They,
combined, amount to two or three hundred microseconds at most (100 usec
interrupt latency plus time to transfer 512 bytes at 5MB/sec which is
another 100 usec) depending on CPU speed and kernel design.

The *real* problem is that most (all, I think) 386 UNIX disc (and tape!)
drivers are poorly written, as they do not use pseudo-DMA, a standard
technique of PDP/VAX drivers (it is even mentioned in the 4.3BSD Leffler
book). This is described a bit later in this article.

dougp> Given an ESDI raw transfer rate of 800 KB/sec (not unreasonable
dougp> for large blocks) that's 1600 interrupts per second, each with a
dougp> (not real fast, due to bus delays) 256-word PIO transfer.  Try
dougp> getting two of those going at once and the system drags down REAL
dougp> fast.

A *sustained* transfer rate of 800KB/sec., that is nearly 100% of peak
transfer rate, is extremely rare. If you are pounding really hard on the
disc you may get from each disk 300KB thru the filesystem in any given
second. This translates to 600 sectors per second; you can do a sector
in 200-300 microseconds, or say 4 sectors per millisecond, so we have an
overhead of 150 milliseconds per every second. 15% is high, but not
tragic.

dougp> I've tried it on a 20 MHz 386 and found at most a 50% improvement
dougp> in aggregate throughput using 2 ESDI controllers simultaneously.
dougp> At that point, you've got 100% of the CPU dedicated to doing I/O
dougp> and none to user code...

This is mostly because the driver is written so that each IO transaction
involves only one sector. Therefore for every sector the top half of the
driver starts the transaction, then sleeps, the bottom half gets
activated by the interrupt and wakeups the top half.

The sleep/wakeup between the top and bottom halves involves, on a busy
system, two context switches, which is already bad, and, most
importantly, calls the scheduler. There is a paper that shows that under
many UNIX ports the cost of a wakeup/sleep is not really that of the
context switches, but of the scheduler calls to decide who is going to
run next, as this takes 90% of the time of a process activation.

With pseudo-DMA the top and bottom halves of the disk driver communicate
via a queue; the top half inserts as many IO operations as it has in the
queue, marking those for whose completion it wants to be notified. The
bottom half will start the first operation in the queue, and then when
it gets the interrupt that signals it is complete, it will immediately
start the next and then, if the just completed operation was marked for
notify, it will wakeup the relevant top half (note that there can be as
many instances of the top half active as there are processes with IO
transactions outstanding, while there will be as many instances of the
bottom half as there are CPUs).

This mode of operation means that the bottom half can issue IO
operations as fast as the controller will take them, synchronously with
each interrupt, that each IO operation will have a small overhead
consisting of just the interrupt latency and sector transfer times, and
that the wakeup/sleep and reschedules will not only be needed,
asynchronously, for every IO transactions, which can well involve many
IO operations. This is simulating an intelligent controller in the
driver's bottom half.

A typical IO transaction will consist of an (implied) seek command and a
list of 4-8 sectors, usually contiguous, to be transferred. A block read
via the buffer cache will typically cause two IO transactions, one for
the sectors making up the current block, one for the read ahead block.

One can also do tricks in the scheduler to reduce the cost of a
reschedule. UNIX implementations are usually badly designed in this, but
one could use a technique used for MUSS (SwPract&Exp, Aug 1979).

The idea is to have a short term scheduler and a long term scheduler,
where UNIX normally has only a long term scheduler.  The short term
scheduler manages, in a deterministic way, e.g.  priority based or FIFO,
a fixed number of processes; the long term scheduler selectes,
periodically, which processes are in the short term scheduler set. The
real cost of scheduling is the policy decision of which processes are
eligible for scheduling. Normally this need only be changed fairly
rarely, and periodically, not on every context change.

Having a short term scheduler means that the cost of process switch is
only marginally higher than that of a context switch, because the short
term scheduler job is just to find the first ready-to-run process in a
fixed size list of maybe 16 entries.

A nice extra idea found in MUSS was to make the short term scheduler use
bitmap queues for strictly priority based scheduling; queues are words,
and each bit in a word represents a different process, and a different
priority. To add a process to a queue (e.g. the ready to run queue) one
just turns on its bit, and so on.

Ah, if only UNIX designers and implementors had one tenth of the insight
of the MUSS ones!

dougp> Two drives on a single AT-compatible controller will gain you
dougp> something in latency-reduction, as the HPDD does some cute tricks
dougp> to overlap seeks.

For a multiuser system, which is the scope of my posting, this is far
more important than bandwidth. Multiusers systems are seek-limited more
than bandwidth limited (for small timesharing multiuser systems, that
is).

dougp> Bus-mastering DMA SCSI adapters, like the Adaptec 154x (ISA) or
dougp> 1640 (MCA) provide MUCH better throughput.  They ARE
dougp> multi-threaded, and the HPDD will try to keep commands
dougp> outstanding on each drive it can use.  The major win is that the
dougp> entire transfer is controlled by the adapter, with host
dougp> intervention only when a transfer is complete.  You get lots more
dougp> USER cycles this way!

Yes, this is true in general. But there are twists to this argument. In
the pseudo-DMA technique described above, a multithreaded, hw DMA and
scatter gather controller is simulated by "lending" the main CPU to a
dumb controller; the bottom half of the disk driver becomes the
microcode of this "pseudo intelligent controller" and simulates the DMA
and the scatter gather.

The main CPU is usually *much* faster than the one that is actually put
in actual intelligent controllers (say 386 vs. 8086), so IO rates
_might_ be higher with a pseudo intelligent controller than a real one.
On the other hand the real intelligent controller can work in parallel
with the main CPU. In IO bound systems this is of course little or no
benefit (because there are CPU cycles to spare), unless there are
multiple intelligent controllers, which is rare.

dougp> I'm still not convinced that cacheing controllers are a big win
dougp> over a large Unix buffer cache.  I usually use 1-2 MB of cache,

Ah yes! Devoting to the cache 25% of available memory seems to be a good
rule of thumb.

dougp> and a couple-MB RAMdisk for /tmp if I have the memory available.

But /tmp should not be on a RAM disk, it should be in a normal
filesystem even if actually almost never causing IO transactions as
short lived files under /tmp should exist only in the cache.

Unfortunately the "hardening" features of the System V filesystem means
that even short lived files will be sync'ed out (at least the inodes),
but this can be partially obviated by tweaking tunable parameters. For
example enlarging substantially the inode cache (almost a simportant as
the block cache), and slowing down bdflush. Overall instead of having a
RAM disk for /tmp, I would devoted the core that would go to it instead
to enlarging the buffer and inode caches.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

boyd@necisa.ho.necisa.oz.au (Boyd Roberts) (12/21/90)

In article <PCG.90Dec19145630@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|
|The *real* problem is that most (all, I think) 386 UNIX disc (and tape!)
|drivers are poorly written, as they do not use pseudo-DMA, a standard
|technique of PDP/VAX drivers (it is even mentioned in the 4.3BSD Leffler
|book). This is described a bit later in this article.

Very probably.

|
|This is mostly because the driver is written so that each IO transaction
|involves only one sector. Therefore for every sector the top half of the
|driver starts the transaction, then sleeps, the bottom half gets
|activated by the interrupt and wakeups the top half.
|

The standard technique is for xxstrategy() sort the I/O on to a queue
of pending I/O operations and then call xxstart().  xxstart() peels
the next I/O off the queue and instructs the controller to do the I/O.

When xxintr() is called it picks up the completed I/O and calls iodone()
on the buffer, waking up anyone who's waiting for the buffer (there may
or may not be anyone waiting).  xxintr() call xxstart() and the process
is repeated until the queue of pending I/O's is empty.

This, of course, requires sane controllers but it's the standard way to
do the job.  More than that, it's the _textbook_ way of doing the job.
Even if you have a dumb controller, and it requires several request/interrupt
cycles, you do it at interrupt time, unless it's _really_ expensive.  It's
all a trade off.

|The sleep/wakeup between the top and bottom halves involves, on a busy
|system, two context switches, which is already bad, and, most
|importantly, calls the scheduler. There is a paper that shows that under
|many UNIX ports the cost of a wakeup/sleep is not really that of the
|context switches, but of the scheduler calls to decide who is going to
|run next, as this takes 90% of the time of a process activation.

Modern UNIX systems only use one context switch.  The switch to the
scheduler's context is no longer done.  The scheduler was never called
to do high level scheduling from the dispatcher.  The scheduler would
run periodically and _assist_ processes in running by swapping old
processes out and deserving processes in.

However, its context was `borrowed' to do the run queue search.  Its
_context_ and nothing more.  The search is cheap, although the switches
are usually expensive.  Modern UNIX systems search the run queue in the
context of the process who's giving up the CPU.

|Ah yes! Devoting to the cache 25% of available memory seems to be a good
|rule of thumb.

Sure.

|dougp> and a couple-MB RAMdisk for /tmp if I have the memory available.
|
|But /tmp should not be on a RAM disk, it should be in a normal
|filesystem even if actually almost never causing IO transactions as
|short lived files under /tmp should exist only in the cache.
|

Oh dear, it's RAM disk time again.  Where is that revolver?

|Unfortunately the "hardening" features of the System V filesystem means
|that even short lived files will be sync'ed out (at least the inodes),
|but this can be partially obviated by tweaking tunable parameters. For
|example enlarging substantially the inode cache (almost a simportant as
|the block cache), and slowing down bdflush. Overall instead of having a
|RAM disk for /tmp, I would devoted the core that would go to it instead
|to enlarging the buffer and inode caches.

Eh?  Writing things out doesn't cause them to be thrown away.


Boyd Roberts			boyd@necisa.ho.necisa.oz.au

``When the going gets wierd, the weird turn pro...''