chet@decwrl.dec.com (Chet Juszczak) (02/05/91)
I've been asked a lot of questions about Prestoserve lately (which is great). I thought it would be a good idea to start off with a solid base of information instead of assuming context on the part of readers. *** First off, an introduction: I'm an engineer in the ULTRIX engineering group. I ported the Legato Prestoserve software to ULTRIX V4.1; I work with Digital h/w engineers to develop NVRAM platforms for this technology. *** Next, a little Prestoserve background: Prestoserve is a filesystem (actually, block device) accelerator. It is composed of a non-volatile, battery backed RAM (NVRAM) cache, kernel software, and administrative utilities. Prestoserve quickly turns around synchronous disk write requests by writing this class of data to NVRAM instead of writing it to disk. Instead of synchronously waiting on an electro-mechanical disk arm, the requesting process synchronously waits on an NVRAM copy operation. Later, only when necessary, data is written to disk. NVRAM cache management gives preference to writing older data and (logically) contiguous blocks; a set of blocks is handed off to the underlying disk subsystem for (asynchronous) writing, where head optimization algorithms can be put to use. Disk "hot spots" tend to remain in the cache; this can lead to a significant reduction in actual disk I/O. Although read requests will be satisfied from the NVRAM cache, this effect is usually minimal unless the filesystem buffer cache is configured too small for the application mix. Raw (character) device read requests are handed off to the underlying disk driver after first flushing any overlapping dirty blocks in the NVRAM cache. Raw device writes are not accelerated. When Prestoserve is disabled (for a particular disk partition), all I/O requests are directly passed through to the underlying disk driver. The NVRAM cache is flushed upon administrative command or upon normal system shutdown (shutdown, halt, reboot). The NVRAM cache is also flushed at boot time following an abnormal shutdown: either power failure, hardware failure, or software failure. In the event of power failure, data integrity is retained by the battery subsystem (for months). Applications that can take advantage of Prestoserve are NFS servers, some DBMSs, and the local ULTRIX filesystem. The reason that NFS servers are dramatically accelerated (general rule of thumb: twice as many clients, all served twice as fast) lies in the nature of the NFS protocol. As part of its crash recovery design, the NFS protocol requires that a server commit data to stable storage before responding to the client that a modification request has been completed. NVRAM is much faster stable storage than a disk. Also, inodes and indirect blocks for files being written form hot spots that remain in the NVRAM cache (2/3 of the disk writes generated by writing a large file over NFS are repeated writes to the inode and indirect blocks). *** Digital Prestoserve platforms: ULTRIX V4.1 supports Prestoserve on the DECsystem 5500 (standard) and the DECsystem 5100 (optional). Digital does not make use of the Legato (VME board) NVRAM hardware. On the DS5500, the NVRAM cache (.5 MB) is located on the processor board and is accessed via private memory interconnect. On the DS5100, the NVRAM cache (1 MB) is implemented as an optional memory SIMM and is accessed via the main memory interconnect. Architecturally (in my opinion) separating NVRAM copies from actual transfers on the I/O bus makes sense. From a performance point of view, memory interconnects yield high bandwidths; e.g. on the DS5100, NVRAM writes occur at 40 MB/sec. It is a safe bet that ULTRIX will support Prestoserve on other platforms in the future. *** Now, on comp.sys.dec: In article <VIXIE.91Jan28210405@volition.pa.dec.com>, vixie@decwrl.dec.com (Paul A Vixie) writes: > Richard, > > >> - "PrestoServe" synchronous file system accelerator > >> *dramatically* speeds up write access to any synchronous > >> mounted file system, notably including NFS server disks. > > I keep meaning to ask Chet about this. Do you know if the DEC PrestoServe > product allows asynchronous writes to local filesystems' inode and directory > blocks? For Decwrl, that would be a big win since it creates and deletes a > lot of files per second. > > I'm posting this to Usenet since I think the answer would be generally > interesting. > > Cheers, > -- > Paul Vixie > DEC Western Research Lab <vixie@wrl.dec.com> > Palo Alto, California ...!decwrl!vixie Directory operations (create, remove, rename) in the ULTRIX (Berkeley fast) local file system require *synchronous* disk writes to ensure protection of the name space in case of system failure. This class of local file system operation is directly accelerated by Presotserve. Here in ULTRIXland, Prestoserve systems have shown themselves to be desirable platforms for software generation. When doing compiles, temporary files are created and deleted. When doing loads, renames are done. The associated directory blocks will live in the NVRAM cache. Disk I/O and head movement is reduced; that portion of the software generation cycle that was spent waiting on the disk arm for these operations is reduced. We have seen reductions of 20% to 30% in the elapsed time needed for kernel builds, for example. Here are the results of some experiments I've just run on a DS5500. The system has 32 MB of memory, 25% bufcache, and RZ57 disks. 1. When I do a machine-specific kernel make (36 small compiles, one large ld): without presto: # time make >&log 35.0u 23.0s 3:30 27% 87+650k 4801+3082io 853pf+0w # with presto: # time make >&log.presto 35.7u 22.3s 2:26 39% 86+644k 4814+3024io 715pf+0w # or, a decrease of 30% in elapsed time. 2. When I touch h/buf.h, and do a make in BINARY (175 larger compiles): without presto: # time make >&log 648.8u 115.4s 19:47 64% 193+743k 1577+14716io 3359pf+0w # with presto: # time make >&log.presto 648.2u 116.9s 15:42 81% 192+742k 1439+14490io 3358pf+0w # or, a decrease of 17% in elapsed time. 3. When I create and destroy a /usr/include directory clone (approx. 400 files) via tar(8): without presto: # umount /rz1g # mount /rz1g # cd /rz1g/chet/junk # time tar xf ../../inc.tar 0.2u 1.9s 0:38 5% 52+76k 373+1543io 4pf+0w # time rm -rf * 0.0u 0.5s 0:19 2% 15+30k 2+911io 0pf+0w with presto: # umount /rz1g # mount /rz1g # cd /rz1g/chet/junk # presto -u # time tar xf ../../inc.tar 0.2u 2.4s 0:12 21% 52+76k 373+1541io 4pf+0w # time rm -rf * 0.0u 0.9s 0:01 61% 15+31k 2+912io 0pf+0w or, a decrease of 68% in elapsed time for the create, and a decrease of 95% in elapsed time for the remove. -chet Chet Juszczak chet@decvax.dec.com Digital Equipment Corporation decvax!chet 110 Spit Brook Rd. ZKO3-3/U14 Nashua, NH 03062