[comp.sys.dec] SCSI to DECSystem 5400 - Prestoserve

chet@decwrl.dec.com (Chet Juszczak) (02/05/91)

I've been asked a lot of questions about Prestoserve lately
(which is great). I thought it would be a good idea to start
off with a solid base of information instead of assuming context
on the part of readers. 

*** First off, an introduction:

  I'm an engineer in the ULTRIX engineering group.
  I ported the Legato Prestoserve software to ULTRIX V4.1;
  I work with Digital h/w engineers to develop NVRAM platforms
  for this technology.

*** Next, a little Prestoserve background:

  Prestoserve is a filesystem (actually, block device) accelerator.
  It is composed of a non-volatile, battery backed RAM (NVRAM) cache,
  kernel software, and administrative utilities.

  Prestoserve quickly turns around synchronous disk write requests by
  writing this class of data to NVRAM instead of writing it to disk.
  Instead of synchronously waiting on an electro-mechanical disk arm,
  the requesting process synchronously waits on an NVRAM copy operation.
  Later, only when necessary, data is written to disk. NVRAM cache
  management gives preference to writing older data and
  (logically) contiguous blocks; a set of blocks is handed off to the
  underlying disk subsystem for (asynchronous) writing, where head
  optimization algorithms can be put to use. Disk "hot spots" tend to
  remain in the cache; this can lead to a significant reduction in
  actual disk I/O.

  Although read requests will be satisfied from the NVRAM cache,
  this effect is usually minimal unless the filesystem buffer cache
  is configured too small for the application mix. Raw (character)
  device read requests are handed off to the underlying disk driver
  after first flushing any overlapping dirty blocks in the NVRAM cache.
  Raw device writes are not accelerated.

  When Prestoserve is disabled (for a particular disk partition),
  all I/O requests are directly passed through to the underlying
  disk driver.

  The NVRAM cache is flushed upon administrative command or upon
  normal system shutdown (shutdown, halt, reboot). The NVRAM cache is
  also flushed at boot time following an abnormal shutdown: either
  power failure, hardware failure, or software failure. In the event
  of power failure, data integrity is retained by the battery subsystem
  (for months).

  Applications that can take advantage of Prestoserve are 
  NFS servers, some DBMSs, and the local ULTRIX filesystem.

  The reason that NFS servers are dramatically accelerated
  (general rule of thumb: twice as many clients, all served twice as fast)
  lies in the nature of the NFS protocol. As part of its crash recovery
  design, the NFS protocol requires that a server commit data to stable
  storage before responding to the client that a modification request
  has been completed. NVRAM is much faster stable storage than a disk.
  Also, inodes and indirect blocks for files being written form hot spots
  that remain in the NVRAM cache (2/3 of the disk writes generated by
  writing a large file over NFS are repeated writes to the inode and
  indirect blocks). 

*** Digital Prestoserve platforms:

  ULTRIX V4.1 supports Prestoserve on the DECsystem 5500 (standard)
  and the DECsystem 5100 (optional).

  Digital does not make use of the Legato (VME board) NVRAM hardware.
  On the DS5500, the NVRAM cache (.5 MB) is located on the processor
  board and is accessed via private memory interconnect. On the DS5100,
  the NVRAM cache (1 MB) is implemented as an optional memory SIMM
  and is accessed via the main memory interconnect.

  Architecturally (in my opinion) separating NVRAM copies from actual
  transfers on the I/O bus makes sense. From a performance point of view,
  memory interconnects yield high bandwidths; e.g. on the DS5100, NVRAM
  writes occur at 40 MB/sec.
 
  It is a safe bet that ULTRIX will support Prestoserve on other platforms
  in the future.

*** Now, on comp.sys.dec:

In article <VIXIE.91Jan28210405@volition.pa.dec.com>, vixie@decwrl.dec.com (Paul A Vixie) writes:
> Richard,
> 
> >>	- "PrestoServe" synchronous file system accelerator
> >>	  *dramatically* speeds up write access to any synchronous
> >>	  mounted file system, notably including NFS server disks.
> 
> I keep meaning to ask Chet about this.  Do you know if the DEC PrestoServe
> product allows asynchronous writes to local filesystems' inode and directory
> blocks?  For Decwrl, that would be a big win since it creates and deletes a
> lot of files per second.
> 
> I'm posting this to Usenet since I think the answer would be generally
> interesting.  
> 
> Cheers,
> --
> Paul Vixie
> DEC Western Research Lab	<vixie@wrl.dec.com>
> Palo Alto, California		...!decwrl!vixie

Directory operations (create, remove, rename) in the ULTRIX (Berkeley fast)
local file system require *synchronous* disk writes to ensure protection
of the name space in case of system failure. This class of local
file system operation is directly accelerated by Presotserve.

Here in ULTRIXland, Prestoserve systems have shown themselves to
be desirable platforms for software generation. When doing compiles,
temporary files are created and deleted. When doing loads, renames are
done. The associated directory blocks will live in the NVRAM cache.
Disk I/O and head movement is reduced; that portion of the software
generation cycle that was spent waiting on the disk arm for these operations
is reduced. We have seen reductions of 20% to 30% in the elapsed
time needed for kernel builds, for example.

Here are the results of some experiments I've just run on a DS5500.
The system has 32 MB of memory, 25% bufcache, and RZ57 disks.
1. When I do a machine-specific kernel make (36 small compiles, one large ld):

   without presto:  
	# time make >&log
	35.0u 23.0s 3:30 27% 87+650k 4801+3082io 853pf+0w
	#

   with presto:
	# time make >&log.presto
	35.7u 22.3s 2:26 39% 86+644k 4814+3024io 715pf+0w
	#

   or, a decrease of 30% in elapsed time.

2. When I touch h/buf.h, and do a make in BINARY (175 larger compiles):

   without presto:  
	# time make >&log
	648.8u 115.4s 19:47 64% 193+743k 1577+14716io 3359pf+0w
	#

   with presto:
	# time make >&log.presto
	648.2u 116.9s 15:42 81% 192+742k 1439+14490io 3358pf+0w
	#

   or, a decrease of 17% in elapsed time.

3. When I create and destroy a /usr/include directory clone
    (approx. 400 files) via tar(8):

   without presto:  
	# umount /rz1g
	# mount /rz1g
	# cd /rz1g/chet/junk
	# time tar xf ../../inc.tar
	0.2u 1.9s 0:38 5% 52+76k 373+1543io 4pf+0w
	# time rm -rf *
	0.0u 0.5s 0:19 2% 15+30k 2+911io 0pf+0w

   with presto:
	# umount /rz1g
	# mount /rz1g
	# cd /rz1g/chet/junk
	# presto -u
	# time tar xf ../../inc.tar
	0.2u 2.4s 0:12 21% 52+76k 373+1541io 4pf+0w
	# time rm -rf *
	0.0u 0.9s 0:01 61% 15+31k 2+912io 0pf+0w

   or, a decrease of 68% in elapsed time for the create,
   and a decrease of 95% in elapsed time for the remove.


	-chet

Chet Juszczak				chet@decvax.dec.com
Digital Equipment Corporation		decvax!chet
110 Spit Brook Rd. ZKO3-3/U14
Nashua, NH 03062