[comp.unix.wizards] Raw disk I/O

marc@focsys.uucp (Marc H. Morin) (02/14/90)

I am investigating using the raw disk interface to increase performance
of our application. The application is an imaging system,  thus the I/O
consists of large data transfers to and from the disk.

I need to know a little more about the behaviour
of the raw disk driver (in general) for BSD systems.  If it helps, I am
particularly interested in SCSI drives on Sony and Mips workstations.

1) Does file locking via fcntl() still apply ?

2) I understand the disk is viewed as one single flat-file,  but does
   this also mean that I must handle bad sectors within my application ?

3) Any experiences that you might have had with raw disk I/O.


Thanks for your help

Marc H. Morin							Focus Automation Systems Inc.
watmath!focsys!marc						Waterloo, Ontario
										(519)746-4918

chris@mimsy.umd.edu (Chris Torek) (02/15/90)

In article <MARC.90Feb14083116@focsys.uucp> marc@focsys.uucp (Marc H. Morin)
writes:
>I need to know a little more about the behaviour
>of the raw disk driver (in general) for BSD systems.  If it helps, I am
>particularly interested in SCSI drives on Sony and Mips workstations.

These are not `BSD systems' so much as `systems whose original source
was 4BSD'.  Both Sony and MIPS have made substantial changes to parts
of the kernel, some of which will affect raw I/O (raw I/O is tied into
the VM system, for instance).

>1) Does file locking via fcntl() still apply ?

No, because there are no files.  fcntl() does not do file locking in BSD
anyway: fnctl() for locking is a System V feature.  BSD provides only flock().

>2) I understand the disk is viewed as one single flat-file,  but does
>   this also mean that I must handle bad sectors within my application ?

No.  ECC and bad sector replacements are handled by the disk driver
(or by a software layer either above or within the driver, but in any
case, below the `raw I/O' level).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

fmayhar@hermes.ladc.bull.com (Frank Mayhar) (02/16/90)

In article <MARC.90Feb14083116@focsys.uucp>, marc@focsys.uucp (Marc H.
Morin) writes:
> I am investigating using the raw disk interface to increase performance
> of our application. The application is an imaging system,  thus the I/O
> consists of large data transfers to and from the disk.
>
> I need to know a little more about the behaviour
> of the raw disk driver (in general) for BSD systems.  If it helps, I am
> particularly interested in SCSI drives on Sony and Mips workstations.
> 
> 1) Does file locking via fcntl() still apply ?
> 
> 2) I understand the disk is viewed as one single flat-file,  but does
>    this also mean that I must handle bad sectors within my application ?
> 
> 3) Any experiences that you might have had with raw disk I/O.

We were looking into this as well.  After testing, we decided to
continue to use
the block device, since it was at least three times faster than the raw device.
I concluded then that the block device was doing some disk access optimizations
that the raw device wasn't doing.  Also, using the block device has the added
advantage that any new optimizations in the driver would automatically be used
by the application.

Of course, we're using System V.  I think you'll probably see the same thing
under BSD, though.
--
Frank Mayhar  fmayhar@hermes.ladc.bull.com (..!{uunet,hacgate}!ladcgw!fmayhar)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241

lm@snafu.Sun.COM (Larry McVoy) (02/16/90)

In article <1990Feb15.212708.19046@ladc.bull.com> fmayhar@hermes.ladc.bull.com writes:
>In article <MARC.90Feb14083116@focsys.uucp>, marc@focsys.uucp (Marc H.
>Morin) writes:
>> I am investigating using the raw disk interface to increase performance
>> of our application. The application is an imaging system,  thus the I/O
>> consists of large data transfers to and from the disk.
>We were looking into this as well.  After testing, we decided to
>continue to use
>the block device, since it was at least three times faster than the raw device.
>I concluded then that the block device was doing some disk access optimizations
>that the raw device wasn't doing.  Also, using the block device has the added
>advantage that any new optimizations in the driver would automatically be used
>by the application.

Under SunOS, at least, the raw device doesn't do read ahead.  Some applications
consider this a feature, others a bug.  To prove this, try

$ /bin/time dd if=/dev/rsd0a of=/dev/null bs=8k
$ /bin/time dd if=/dev/sd0a of=/dev/null bs=8k
---
What I say is my opinion.  I am not paid to speak for Sun, I'm paid to hack.
    Besides, I frequently read news when I'm drjhgunghc, err, um, drunk.
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

cpcahil@virtech.uucp (Conor P. Cahill) (02/18/90)

In article <1990Feb15.212708.19046@ladc.bull.com> fmayhar@hermes.ladc.bull.com writes:
>In article <MARC.90Feb14083116@focsys.uucp>, marc@focsys.uucp (Marc H.
>Morin) writes:
>> I am investigating using the raw disk interface to increase performance
>> of our application. The application is an imaging system,  thus the I/O
>> consists of large data transfers to and from the disk.

>> 3) Any experiences that you might have had with raw disk I/O.
>
>We were looking into this as well.  After testing, we decided to
>continue to use
>the block device, since it was at least three times faster than the raw device.

This kind of conclusion is very dependent upon the way you are using the
data on the device and on the specific device driver itself.

On most systems, if you have a single process that is reading very large
sections of the disk drive, you will get better performance by using
the raw partition.  For example: I wrote a program that restores a database
file to a disk partition (not to the filesys, the database used the disk 
partition directly).  Initially in restoring 300MB from tape to disk using
the block partition, it took almost 3 hours to restore the data and the 
performance on the system while this was going on was real slow.  When
I changed it to use the raw device, the same restoration took only 45 minutes
and it had almost no effect on system performance.

Some of the matters to consider are:

	1. How much data are you reading/writing at a time.  If you are only
	accessing small amounts (remember i/o to a raw partition must be a
	minimum size depending on the device, usually 512 or 1024 bytes), 
	you probably would get better performance from the block device.
	However, using the block device will be a performance gain over 
	using a larg file in a file system since no block indirection will take
	place, the data blocks will be shared amongst processes, and delayed
	writes will be utilized.

	2. Will multiple processes be accessing the same data?  If they are
	accessing the same data frequently, the buffer caching available on
	the block device should inprove performance.

>Of course, we're using System V.  I think you'll probably see the same thing
>under BSD, though.

This isn't a SV vs BSD thing, it is a driver vs driver, hardware vs hardware
thing.

-- 
+-----------------------------------------------------------------------+
| Conor P. Cahill     uunet!virtech!cpcahil      	703-430-9247	!
| Virtual Technologies Inc.,    P. O. Box 876,   Sterling, VA 22170     |
+-----------------------------------------------------------------------+

jiro@heights.cit.cornell.edu (02/18/90)

>>In article <MARC.90Feb14083116@focsys.uucp>, marc@focsys.uucp (Marc H.
>>Morin) writes:
>>> I am investigating using the raw disk interface to increase performance
>>> of our application. The application is an imaging system,  thus the I/O
>>> consists of large data transfers to and from the disk.
>>We were looking into this as well.  After testing, we decided to
>>continue to use
>>the block device, since it was at least three times faster than the raw 
>>device. I concluded then that the block device was doing some disk access 
>>optimizations that the raw device wasn't doing.  Also, using the block
>>device has the added advantage that any new optimizations in the driver
>>would automatically be used by the application.
>
>Under SunOS, at least, the raw device doesn't do read ahead.  Some applications
>consider this a feature, others a bug.  To prove this, try

>$ /bin/time dd if=/dev/rsd0a of=/dev/null bs=8k
>$ /bin/time dd if=/dev/sd0a of=/dev/null bs=8k
>---
>Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com


   On the NeXT with a floptical, the two above commands reveal interesting
times:
csh> /bin/time dd if=/dev/od0a of=/dev/null bs=8k	<-- Block floptical
30492 Records                                     	<-- 240 megabytes
4188.9 real  1.8 user  2927.7 sys
                       ^^^^^^

csh> /bin/time dd if=/dev/rod0a of=/dev/null bs=8k	<-- Raw floptical
30492 Records
1320.9 real  2.0 user  51.2 sys              
                       ^^^^ 

 On the NeXT, the raw device is overwhelmingly faster than the block
device. What is this supposed to mean?  Any relationship to the disk
being an Optical?

   - Jiro Nakamura -
    NeXT Developer (Unregistered, Independent)

----------------------------------------------------------------
jiro@heights.cit.cornell.edu    Disclaimer: I work for no-one.

fmayhar@hermes.ladc.bull.com (Frank Mayhar) (02/24/90)

I recently replied to an article here, regarding raw disk I/O.  I
received the following
response:

(I wrote:)
}}We were looking into this as well.  After testing, we decided to
}}continue to use the block device, since it was at least three times
}}faster than the raw device.  I concluded then that the block device
}}was doing some disk access optimizations that the raw device wasn't
}}doing.  Also, using the block device has the added advantage that
}}any new optimizations in the driver would automatically be used
}}by the application.
}}Of course, we're using System V.  I think you'll probably see the same thing
}}under BSD, though.

}Were you using the raw device in an efficient way?  Or were you just trying
}to use it like the block device?
}
}I.e., did you rewrite your application to have a writer process that blocks
}for the raw write, allowing the write to proceed asyncronously?  If you just
}wrote a big block directly to disk in the main application, then you have
}no disk cache in operation, the application blocks, and of course the
}raw throughput looks lousy.
}
}All the companies that do databases under Unix use raw partitions,
}and one of the reasons is throughput is much higher.
}
}I suggest you read Rochkind, "Advanced Unix Programming", specifically
}p. 47ff, where he states
}
}"As you can see, the speed advantages of raw I/O are enormous."
}---
}A. Lester Buck     buck@siswat.lonestar.org  ...!texbell!moray!siswat!buck

I sent Mr. Buck the following reply, but never received any response.  Can
anyone else answer my questions?

(My reply:)
Well, we don't seem to have _Advanced_Unix_Programming_ around here.  I can
say that we weren't doing anything special, just blocking I/O.  We got about
40k-45k/s.  With block-mode I/O, we get from around 90 k/s to as much as
200 k/s or a little more, depending on the test parameters.  Are you saying
that we could exceed these figures using raw I/O?

One of the real problems is that we have to do this for multiple users.  It's
not easy to justify two processes per user, nor did I want to bottleneck
everything through a single I/O process.  I _could_ do that, but how
would I get requests back and forth?  Message queues?  (I envision a single
[or even multiple] I/O processes, communicating with clients via message
queues, possibly using an elevator algorithm, certainly doing buffering.)
I still fail to see how this is faster than letting the block-mode device
driver do the same thing, though.  Plus, the device driver knows about the
idiosyncracies of the hardware, and can take advantage of them, where my
I/O process would have to be written for the lowest common denominator.
--
Frank Mayhar  fmayhar@hermes.ladc.bull.com (..!{uunet,hacgate}!ladcgw!fmayhar)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241

lm@snafu.Sun.COM (Larry McVoy) (02/25/90)

In article <1990Feb23.223030.9851@ladc.bull.com> fmayhar@hermes.ladc.bull.com writes:
>... blocking I/O.  We got about
>40k-45k/s.  With block-mode I/O, we get from around 90 k/s to as much as
>200 k/s or a little more, depending on the test parameters.  Are you saying
>that we could exceed these figures using raw I/O?

I sure hope so.  Those are pathetic I/O rates.  

sparcstation 1, SunOS 4.1, Quantum 3.5" SCSI drive:

$ time dd if=/dev/rsd1c of=/dev/null count=400 bs=8k
400+0 records in
400+0 records out

real    0m7.61s
user    0m0.03s
sys     0m0.66s
$ bc
400*8/7.61
420			# kbytes / second for sequential
^D
$

It's probable that you are doing radom I/O so I wrote a little program
that uses rand(3) to calulate the offset into the drive (see below) and
that gives me about 180 Kb / sec (you can really hear the old head seeking).
This assumes that I need all of the 8k block that I read. If you have smaller
records, you'll get worse throughput.  For example, if you only need one byte
of that 8k, your throughput goes down by a factor of 8192.  

Here's the random program (start/stop/ptime are timing utils):

#include <signal.h>

int blks;

main(ac, av)
	char **av;
{
	int fd;
	int disksize;
	char buf[8192];
	int done();

	disksize = atoi(av[2]);		/* blocks */
	srand(3962);			/* my birthday */
	signal(SIGINT, done);
	start();
	fd = open(av[1], 0);
	for (blks = 0; ; blks++) {
		lseek(fd, (rand() % disksize) * 8192, 0);
		if (read(fd, buf, sizeof buf) != sizeof buf)
			break;
	}
	done();
}

done()
{
	stop();
	ptime(blks);
	exit(0);
}
---
What I say is my opinion.  I am not paid to speak for Sun, I'm paid to hack.
    Besides, I frequently read news when I'm drjhgunghc, err, um, drunk.
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

cpcahil@virtech.uucp (Conor P. Cahill) (02/25/90)

In article <1990Feb23.223030.9851@ladc.bull.com> fmayhar@hermes.ladc.bull.com writes:
>}All the companies that do databases under Unix use raw partitions,
>}and one of the reasons is throughput is much higher.

And depending upon the application, some of them rarely use the raw
partition itself.  The use the block device to access data
on the disk partiton so that they bypass the problems with triple 
indirection.  The raw partition might only be used for special parts of 
the database (like backing it up, or some forms of sequential access).

>Well, we don't seem to have _Advanced_Unix_Programming_ around here.  I can
>say that we weren't doing anything special, just blocking I/O.  We got about
>40k-45k/s.  With block-mode I/O, we get from around 90 k/s to as much as
>200 k/s or a little more, depending on the test parameters.  Are you saying
>that we could exceed these figures using raw I/O?

In applications that I have written using raw I/O, I use a very larg i/o
buffer (on the order of 300K or so) and have found that the performance
can be on the order of 10 times greater than the i/o performance when using
the same i/o buffer size and the block device.

>One of the real problems is that we have to do this for multiple users.  It's
>not easy to justify two processes per user, nor did I want to bottleneck
>everything through a single I/O process.

When reading/writing large amounts of data through the block device the
overall system performance is usually dismal, however when the same amount
of i/o is going to the raw device, the system performance is much less 
impacted by the i/o, since there is no contention for disk buffers.

>I still fail to see how this is faster than letting the block-mode device
>driver do the same thing, though.  Plus, the device driver knows about the
>idiosyncracies of the hardware, and can take advantage of them, where my
>I/O process would have to be written for the lowest common denominator.

The problems with the block driver are:

	1. The data must be copied from disk to kernel memory to user memory
	   adding an additional copy.
	2. The throughput is limited by the contention for free/available
	   block buffers in the kernel.  This is especialy apparent for
	   the output side.

The problems with the raw disk driver are:

	1. No implicit sharing of data read from disk.
	2. I/O must be a multiple of disk block size (usually 512 or 1024
	   bytes).
	3. I/O is synchronous, so small writes must wait for the actual disk
	   i/o to complete. 

So, in determining which type of access to use you must determine the
amount of data that will be flowing and the direction it will be flowing.

I use raw i/o when reading > 1 MB from a disk in large blocks (200K+).

Another thing to remember is that individual device drivers may or may 
not have a performance gain with raw i/o, but they usually do.
-- 
+-----------------------------------------------------------------------+
| Conor P. Cahill     uunet!virtech!cpcahil      	703-430-9247	!
| Virtual Technologies Inc.,    P. O. Box 876,   Sterling, VA 22170     |
+-----------------------------------------------------------------------+

fmayhar@hermes.ladc.bull.com (Frank Mayhar) (02/26/90)

In article <132220@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>In article <1990Feb23.223030.9851@ladc.bull.com> fmayhar@hermes.ladc.bull.com writes:
>>... blocking I/O.  We got about
>>40k-45k/s.  With block-mode I/O, we get from around 90 k/s to as much as
>>200 k/s or a little more, depending on the test parameters.  Are you saying
>>that we could exceed these figures using raw I/O?
>I sure hope so.  Those are pathetic I/O rates.  
> [shows example giving 420k/s rate on Sparcstation 1]
>It's probable that you are doing radom I/O so I wrote a little program
>that uses rand(3) to calulate the offset into the drive (see below) and
>that gives me about 180 Kb / sec (you can really hear the old head seeking).
>This assumes that I need all of the 8k block that I read. If you have smaller
>records, you'll get worse throughput.  For example, if you only need one byte
>of that 8k, your throughput goes down by a factor of 8192.  

Well, I'm doing I/O in 1k chunks, not 8k, so that should make some difference.
(Hmmm.  I wonder what would happen if I did some kind of read-ahead?  I.e.
read in 8k chunks, and buffer them until the caller needs them, flushing
the buffer when he does a seek into another part of the disk.  Hmmm.  Maybe
multiple buffers?)  Also, the system is a Bull XPS-100 running SVR2, so that
will make a little more difference (not much, though).  I'm doing both random
and sequential I/O (in different tests).  Actually, the random I/O gets a
tiny bit better throughput, since there's a caching scheme between the
user and the disk.

The real problem, I guess, is that I have to do this for multiple users,
concurrently.  It's hard to have a separate I/O process for *each* of
a large number of users.  I could make an entirely separate I/O server,
which does nonblocking I/O using a separate process, but I _still_ fail
to see how that's different than just going through the block device driver.
It also has the disadvantage that there's a communication bottleneck in
this scheme, that's taken care of by the OS when using the device driver.
--
Frank Mayhar  fmayhar@hermes.ladc.bull.com (..!{uunet,hacgate}!ladcgw!fmayhar)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241

leres@ace.ee.lbl.gov (Craig Leres) (03/14/90)

(Sorry to jump into this so late but it seems like I never have time to
read news.)

Folks interested in raw disk I/O will also be interested in the little
disk bench mark package we've thrown together. Eventually, I'd like to
it to include a complete tutorial on How To Tune Your Filesystem; for
now the (limited) documentation really only deals with how to measure
raw disk speeds.

The package is easy to use; the primary part is the "disktest" script
which does raw reads at block sizes ranging from 4k to 128k. There's a
simple line fitting program that allows you to munch "disktest" runs
down to the throughput (in KB/sec) and fixed overhead (in ms) numbers.
For example, I've measured 2094.07 KB/sec, 1.49 ms on my Sun 3/180
which runs SunOS 3.5 and has Fujitsu M2344's on an Interphase 4400
controller. There are also some scripts for use with xgraph.

The disktest package is available via anonymous ftp from the host
ftp.ee.lbl.gov (128.3.254.68) in the file disktest.shar.Z. As usual,
use binary mode for best results.

		Craig

amoss@batata.huji.ac.il (amos shapira) (03/24/90)

chris@mimsy.umd.edu (Chris Torek) writes:

>>1) Does file locking via fcntl() still apply ?

>No, because there are no files.  fcntl() does not do file locking in BSD
>anyway: fnctl() for locking is a System V feature.  BSD provides only
>flock().

While it's true that there are no files in the partition used, the process
should still be able to lock the partition itself (e.g. /dev/rxsd0c), which
could help provide some degree of consistency.

Correct me if I have a mistake.

>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
>Domain: chris@cs.umd.edu Path: uunet!mimsy!chris

Cheers,
- Amos Shapira - amoss@batata.huji.ac.il[.bitnet]

Super users do it without asking for permission - me.