mikes@3comvax.UUCP (Mike Shannon) (08/23/85)
I'm curious about the 'suitability' of UNIX with respect to database implementations. I've heard criticisms about the file system because to find a given sector in a file may require looking at up to 3 other sectors associated with the file inode. Some people say that 'extent-based' file systems are inherently faster than unix's. I've also heard that databases usually rely on locking primitives which aren't present in UNIX. (So how do UNIX databases perform locking?). I've heard that SystemV UNIX (release 2?) has now defined some locking primitives. Is this true? Doesn't this imply some important changes to the kernel? What about different file access methods (ISAM, hashed, etc)? Can these always be satisfactorily layered on unix's bytestream file system as a library? Isn't this expensive in terms of performance? Personally, I think UNIX is the best development system anywhere. I'd like to see it become the most popular success in the business arena. And I think UNIX's performance in supporting databases is a key issue to this success. So, what are your opinions? -thanks, Michael Shannon ({ihnp4,hplabs}!oliveb!3comvax!mikes)
dws@tolerant.UUCP (Dave W. Smith) (08/23/85)
> Some people say that 'extent-based' file systems are inherently > faster than unix's. > ... So, what are your opinions? > Michael Shannon ({ihnp4,hplabs}!oliveb!3comvax!mikes) We built extents into our rewrite of the Unix file system, and allow primary and secondary extent sizes of 1K to a current max of 8Mb. Avoiding indirect blocks when dealing with big files is *really* nice. -- David W. Smith {ucbvax}!tolerant!dws Tolerant Systems, Inc. 408/946-5667
guy@sun.uucp (Guy Harris) (08/24/85)
> I'm curious about the 'suitability' of UNIX with respect to > database implementations. I've heard criticisms about the file system > because to find a given sector in a file may require looking at up to > 3 other sectors associated with the file inode. > Some people say that 'extent-based' file systems are inherently > faster than unix's. If the extents are big enough, this may be true. The UNIX file map for the V7 file system (used by all standard UNIX releases except 4.2/4.3BSD) has, in the inode (which is always in core), 10 direct block pointers, one singly-indirect block pointer, one doubly-indirect block pointer, and one triply-indirect block pointer. The 10 direct block pointers point to the first 10 blocks of the file (usually 512 or 1024 bytes), the singly-indirect block pointer points to a block which points to the next 128 512-byte blocks or the next 256 1024-byte blocks, the doubly-indirect block pointer points to a block which points to the appropriate number of indirect blocks, etc.. At worst, you have to fetch the triply-indirect block, the appropriate doubly-indirect block it points to, and the appropriate singly-indirect block it points to, in order to fetch a block of the file. However, there is a fair chance that the triply-indirect block is in core if you've just referenced another block using it; the same applies to the other blocks. The 4.2BSD file system has 12 direct pointers, and the blocks are usually 4096 or 8192 bytes; it has triply-indirect blocks but they've never been tested, as files can get a lot bigger before they require them. The file map for an extent-based file system typically consists of entries which say "the next N blocks of the file are located in the N blocks on the disk starting at block M". Thus, a map entry can map more blocks than a UNIX map entry, which always maps in units of the file system's block size. However, you can't directly compute the location of the map entry for a given block of a file in a scheme like this; you have to search the map for the entry. This could, in principle, require more blocks to be read than the UNIX scheme. I don't know that it does so in practice. I don't think there's a simple answer to the question "is an extent-style file map faster than a UNIX-style file map?" It depends on the pattern of accesses to blocks of the file. The worst case would be purely random access; however, both the UNIX scheme and the extent map scheme perform much better if the locus of file access doesn't move too fast (the UNIX scheme is more likely to find the desired indirect block in the buffer cache, and the extent map scheme is more likely not to have to "turn the window" and pull in another map block). I don't know the locality of "typical" block accesses in a "typical" database system. Another issue is "how much is a large file scattered across the disk?" The V7 file system uses a linked list to record unused blocks; when writing a file, this makes it harder to allocate block N+1 of a file on a cylinder near block N. The 4.2BSD file system (and some other file systems used on UNIX-based systems), and most (if not all) extent-based file systems, use a bit map and can more easily allocate a block near its neighbor in the file. Does anybody have any numbers comparing the behavior of a database system on the V7 file system, the 4.2BSD file system, and some extent-based file systems? > I've also heard that databases usually rely on locking primitives > which aren't present in UNIX. (So how do UNIX databases perform locking?). Some of them (such as INGRES) require you to add those primitives to UNIX (INGRES has a device driver which you install into your system). > I've heard that SystemV UNIX (release 2?) has now defined some locking > primitives. Is this true? Doesn't this imply some important changes to the > kernel? S5R2 Version 2 (for the VAX; lord knows what version for other systems) has a set of locking primitives. They are influenced by a set of locking primitives originally developed by John Bass, which permit you to lock a region of a file defined by a file offset and a byte count. Most of the UNIX file/record locking primitives descend from these primitives. The /usr/group UNIX-based OS standard includes one such set; the S5R2V2 system implements these on top of their own similar primitives as a compatibility layer. Yes, it implies important changes to the kernel; so what? One would expect it to require some changes... > What about different file access methods (ISAM, hashed, etc)? Can > these always be satisfactorily layered on unix's bytestream file system > as a library? Isn't this expensive in terms of performance? Why would it be any more expensive than, say, layering it on VMS's block-array file system as a library (which is exactly what VMS does)? Many OSes out there have a low-level set of file I/O primitives which imply no file structure, and build text file, or ISAM, or hashed, or whatever file structure on top of them, just as UNIX does. Guy Harris
anton@ucbvax.ARPA (Jeff Anton) (08/24/85)
Follow up-To: Distribution: net Organization: University of California at Berkeley Keywords: UNIX & Databases: A quick overview. UNIX provides a great environment for program development, but while building a powerful and efficient DBMS for it you are going to run into a fair number of problems. Small address space: This problem is quickly disappearing. PDP's are going out of style and 32bit micros which can page are eliminating it. However, physical memory is not growing on most machines, and this will affect performance. Relational model dominant: Most DMBS's on UNIX are relational. The relational model allows a nonprocedural language to build queries. It's very easy to write a query that will not have an easy solution, but since the databases are in general small on UNIX this is ok. Network and Hierarchical DBMS's have strict methods of accessing the data and will be faster and stored more efficiently in consequence. Relational systems will not just scale up to gigabyte sizes and keep their performance. (Try sorting a billion names sometime, n log n is still the rule and sort merge requires as much free space as what you are sorting) Locking (concurrency): Locking, what locking? UNIX file locking schemas are, in general, bad. v7: klugey file locking by making links. BSD: file locking with 2 levels. ATT: /usr/group flock call that can lock ranges of files but is only 2 levels. A DBMS would be well off with three (3) levels of locks, (operating system people LISTEN!) We need shared, update, and exclusive locks. (This may be advisory since the data is accessed by special routines) Conflict table: S U E S + + - U + - - E - - - Reasoning: If you think you might have to change some data in a file you get an update lock. Read only routines can have shared locks. If you find you don't have to change data you remove the lock or lower it to shared, allowing someone else to get an update lock. If you do have to change the data you raise the lock to exclusive and wait until all the shared locks go away. Know what else goes away? Deadlock. (Well, not completely but the problem is easier.) Since no UNIX provides adequate locking you see things like INGRES's lock driver kernel mods. I'd like to know how any UNIX DMBS can be concurrent without such a beast. Filesystem (recovery): The UNIX filesystem (new or old) can be treated as blocks instead of bytes by starting and moving data at multiples of 512 bytes. This is more efficient for UNIX and good for DBMS's. The Berkeley fast filesystem does a good job of putting blocks of the same file together. However, there are no synchronous disk writes. The system may crash with data supposably written to disk that is not there when the system comes up again. Also, the system might drop inodes (files) on the floor during reboot disk auditing. INGRES uses a system of batching updates and transactions can be completed intact if the batch file is complete. If it is not then what is there can be used to back out of partially complete transactions. There is a system call, fsync, which will cause a file to be flushed out of the system buffers and onto the disk, but it's very slow. Random access of large files on UNIX implies an increase in disk I/O in order to fetch indirect blocks which point to indirect blocks which point to data. And again, when you cause a new block to be allocated to a file and change the indirect block, does the system write that indirect block before or after it writes the data block. The solution to these problems, "Don't Panic." UNIX DBMS's are reliable up to the last dump of the system and from then on it's unlikely you'll lose. But for the banks and the purests, we're working on what may be a way around the problems. I'm not sure enough right now to publicize yet. Security (IPC): Microcomputers can't really worry about such things. Any program can reach into MSDOS and convince it that things are fine. On UNIX we must be more careful. If you are going to allow programs to be written that interact with the DBMS you will have to ether trust that program not to touch things it should not or put IPC between that program and a DBMS. Pipes can do the job, but they are simple and do not lend themselves to sequenced packets which would be ideal. The early days of INGRES used pipes as the center piece of a complex system allowing the DBMS to be broken into many pieces connected by pipes. Seven (7) processes were connected at one time on a PDP that did not have separate text and data space and only 64k bytes in a proc. Today, remote procedure calls are a fad and are likely to handle the job fine. But, they are only in BSD. A commercial DBMS would be limiting its market too much. Shared memory could also work with some coercion. Privileged sections of processes, programs which the effective uid could change while the program counter is on certain memory pages, would be best since it would allow real procedure calls and security that unauthorized data will not be examined. Devices: The number of physical devices on a UNIX system is a constraint. 4.3 BSD has increased the number of filesystems from 16 to 256, but files can not extend beyond a filesystem and a filesystem can only be 4 gigabytes (2^32)? (something makes me think 1G.) Beyond these number you don't have many UNIX machines. Few 'C' compilers support integers > 32 bits, hardware dependent. The future: I believe UNIX can grow to meet these problems. However, It may take lots of argument to get the operating systems people to move on these matters. They are 'dull' database stuff. Databases run the modern world. Windowing and networking are the current fads, and perhaps the database community should move into the OS world with it's own ideas. We've have to work around the OS for a long time, perhaps the effort of working around could change into working with and then our jobs would be much easier for a long time to come. And the OS world might learn something too. -- C knows no bounds. Jeff Anton U.C.Berkeley Ingres Group ucbvax!anton anton@BERKELEY.EDU
hachong@watmath.UUCP (Herb Chong [DCS]) (08/25/85)
In article <2693@sun.uucp> guy@sun.uucp (Guy Harris) writes: >I don't think there's a simple answer to the question "is an extent-style >file map faster than a UNIX-style file map?" It depends on the pattern of >accesses to blocks of the file. The worst case would be purely random >access; however, both the UNIX scheme and the extent map scheme perform much >better if the locus of file access doesn't move too fast (the UNIX scheme is >more likely to find the desired indirect block in the buffer cache, and the >extent map scheme is more likely not to have to "turn the window" and pull >in another map block). I don't know the locality of "typical" block >accesses in a "typical" database system. IBM's IMS dbms is overlaid on top of their VSAM file structure. IMS is a hierarchical system, although it is possible to construct network and ISAM-like databases with a little bit of work. unless they have changed it again in recent years, IMS uses a B-tree method of storing indices into a database with the leaf nodes themselves being the actual "segments" of data. VSAM itself allows three physical access methods depending on how the VSAM file was defined: byte offset (like unix), record offset (like IBM direct access, not ISAM), and keyed access. IMS uses only keyed access. a VSAM key is a character string up to 256 characters long. using multiple levels of indirection through memory buffered indices, fast access is allowed into the VSAM file system itself. the VSAM routines provided an additional layer of buffering for data and indices for you. all you have to do is specify the size of each, or use the defaults assigned at file creation time. the relative sizes and the absolute sizes can be tuned for the particular application, i.e. mostly sequential access requires large data but small index buffers, while mostly random requires large index but small data buffers). obviously, we are talking a lot of memory here to hold all the indexing information, but IMS checkpoints the index information to disk periodically in case of system crashes and of course, a complete log file is maintained to back out changes as well as for billing. one large database used for online banking running on a single processor can support sustained transaction rates of about 1000/second. applications run under the control of IMS and request services from it to perform database transactions. only one copy of the database index is actually kept by IMS, and most of it is paged out when not needed anyway. Herb Chong... I'm user-friendly -- I don't byte, I nybble.... UUCP: {decvax|utzoo|ihnp4|allegra|clyde}!watmath!hachong CSNET: hachong%watmath@waterloo.csnet ARPA: hachong%watmath%waterloo.csnet@csnet-relay.arpa NETNORTH, BITNET, EARN: herbie@watdcs
henry@utzoo.UUCP (Henry Spencer) (08/26/85)
> Locking (concurrency): Locking, what locking? > UNIX file locking schemas are, in general, bad. > ... > A DBMS would be well off with three (3) levels of locks, > (operating system people LISTEN!) ... If you database people could manage to get your act together and agree on something for more than 15 milliseconds, maybe we *would* listen. The main reason nobody wants to put database-oriented locking into an operating system is that no two database people agree on what they want. The existing Unix locking schemes are simple, hence often inadequate, because nobody wants to go through the hassle of implementing a complex mechanism when there is no consensus on what it should look like. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
roy@phri.UUCP (Roy Smith) (08/27/85)
> From: Jeff Anton, Ingres Group <ucbvax!anton> > > A DBMS would be well off with three (3) levels of locks [...] shared, > update, and exclusive. [...] If you think you might have to change some > data in a file you get an update lock. [...] If you do have to change > the data you raise the lock to exclusive and wait until all the shared > locks go away. Know what else goes away? Deadlock. I'm not sure I follow. I don't see how the update lock is anything more than a hint that you might want an exclusive lock later. Like all hints (vadvise for example), this may improve efficiency, but I don't see how it really changes anything in the end. In particular, how does this eliminate deadlocks (and why doesn't the 2-stage 4.2 flock)? Since we're on the subject, there's something I don't understand about the 4.2 flock(2) call. As I understand it, having a shared lock guarantees that the file can't change beneath you. But when you update your shared lock to an exclusive one, "this results in the previous lock being released and the new lock applied (POSSIBLY AFTER OTHER PROCESSES HAVE GAINED AND RELEASED THE LOCK)" [my emphasis]. This doesn't make sense; you poke around in the data base to figure out what you want to do and then, just before you are ready to write your changes, you give somebody else a chance to write theirs? Doesn't that mean that you then have to go back and re-compute your whole update? I can see how *not* giving others a shot when you upgrade a lock might lead to deadlocks (which would answer my previous question nicely, thank you). As I read the manual, though, it just doesn't make sense. Can somebody tell me what I'm missing (if anything)? > [talking about reliability and limitations of the file system] > But for the banks and the purests, we're working on what may be a way > around the problems. I'm not sure enough right now to publicize yet. If there are problems with the file system, why can't you just open up a raw disk and implement your own like was done for testing the 4.2 F.S. under a 4.1 kernel? I don't see any way to get around the write-behind strategy in the block I/O system, but at least you can make multi-volume files if you want. You could even implement shadow pages and volumes (to keep "the banks and the purests" happy) in the user-level file system routines. Cpu cycles *are* cheaper than HSC-50's, aren't they? :-) -- Roy Smith <allegra!phri!roy> System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
jeff@rtech.UUCP (Jeff Lichtman) (08/28/85)
> A DBMS would be well off with three (3) levels of locks, > (operating system people LISTEN!) We need shared, update, and > exclusive locks. (This may be advisory since the data is accessed > by special routines) > Conflict table: > S U E > S + + - > U + - - > E - - - > Reasoning: If you think you might have to change some data > in a file you get an update lock. Read only routines can have > shared locks. If you find you don't have to change data you > remove the lock or lower it to shared, allowing someone else to > get an update lock. If you do have to change the data you > raise the lock to exclusive and wait until all the shared locks go away. > Know what else goes away? Deadlock. > (Well, not completely but the problem is easier.) I'm curious about this proposed scheme. I can't see how it would make deadlock less frequent, and why would it make it easier to deal with? With two-level locking, deadlock happens if two transactions hold locks and each tries to get locks that conflict with the other's. For example, t1 gets an exclusive lock on a, t2 gets a shared lock on b, t1 gets a shared lock on b but blocks trying to escalate it to exclusive, and t2 blocks trying to get a shared lock on a. Using update locks, t1 would get an update lock on a and escalate it to exclusive, t2 would get an update lock on b and de-escalate it to shared, t1 would get an update lock on b but would block trying to escalate it to exclusive, and t2 would block trying to get an update lock on a. You still get deadlocked, even though the exact locks involved are not the same. And since deadlock is possible, one has to deal with it somehow, so I don't see why the problem is easier. Also, how do you decide when it is safe to de-escalate a lock from update to shared? This would be easy to do with single-statment transactions, but RTI's version of INGRES has multi-statement transactions (that is, one can have more than one QUEL statement within a transaction). The only solution I can see is to de-escalate to shared for each object touched but not altered in each statement, then re-escalate to update if another statement re-reads the object with the possibility of changing it. Under some circumstances, this could lead to a lot more locking activity than one would encounter with two-level locking, which could be costly. -- Jeff Lichtman at rtech (Relational Technology, Inc.) aka Swazoo Koolak {amdahl, sun}!rtech!jeff {ucbvax, decvax}!mtxinu!rtech!jeff
peter@baylor.UUCP (Peter da Silva) (08/29/85)
> Some people say that 'extent-based' file systems are inherently > faster than unix's. Depends on the file size. If by an 'extent based' system you mean a linked list of large extents, as opposed to a tree of small extents which is what the UNIX file system actually is (an extent is just a single indirect block, in UNIX terminology), then for small (1-2 extent) files UNIX will be slower. Once you have more than 2 extents you're going to be doing 3 indirects with your extent based system anyway, whereas UNIX will still be accessing the first or second indirect block (I'm not sure of the terminology): Assuming buffering and equal blocksize: N = #ptrs/block UNIX: INODE 10 * ^block 1* ^indirect block (<n> of ^block) 1* ^double indirect block (<n> of ^indirect block) 1* ^triple indirect block (<n> of ^double indirect block) E-SYS: EXTENT <n>-1 * ^block 1* ^EXTENT So the first 10 blocks take 1 indirect in each. The next <n>-11 blocks take 1 indirect in your extent system, 2 in UNIX. The next 9 blocks take 2 indirects in each. Then there are another <n>-11 with 3 indirects. After that UNIX is ahead... 3 indirects for UNIX, 4 or more for extents. That may be screwed up somewhere, but by inspection you should be able to see that after the third extent UNIX is ahead. I wonder if that's why this damn IBM-PC is seeking all over the place? Nah. My files aren't that big. -- Peter (Made in Australia) da Silva UUCP: ...!shell!neuro1!{hyd-ptd,baylor,datafac}!peter MCI: PDASILVA; CIS: 70216,1076
anton@ucbvax.ARPA (Jeff Anton) (08/29/85)
In article <423@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >> From: Jeff Anton, Ingres Group <ucbvax!anton> >> >> A DBMS would be well off with three (3) levels of locks [...] shared, >> update, and exclusive. [...] If you think you might have to change some >> data in a file you get an update lock. [...] If you do have to change >> the data you raise the lock to exclusive and wait until all the shared >> locks go away. Know what else goes away? Deadlock. > > I'm not sure I follow. I don't see how the update lock is anything >more than a hint that you might want an exclusive lock later. Like all >hints (vadvise for example), this may improve efficiency, but I don't see >how it really changes anything in the end. In particular, how does this >eliminate deadlocks (and why doesn't the 2-stage 4.2 flock)? Note that an update lock conflicts with other update locks but allows shared locks. This avoids the possible deadlock case of two shared locks tring to promote to exclusive locks as only update locks should promote to exclusive, and there can only be one update lock. > Since we're on the subject, there's something I don't understand >about the 4.2 flock(2) call. As I understand it, having a shared lock >guarantees that the file can't change beneath you. But when you update >your shared lock to an exclusive one, "this results in the previous lock >being released and the new lock applied (POSSIBLY AFTER OTHER PROCESSES >HAVE GAINED AND RELEASED THE LOCK)" [my emphasis]. This doesn't make >sense; you poke around in the data base to figure out what you want to do >and then, just before you are ready to write your changes, you give >somebody else a chance to write theirs? Doesn't that mean that you then >have to go back and re-compute your whole update? > > I can see how *not* giving others a shot when you upgrade a lock >might lead to deadlocks (which would answer my previous question nicely, >thank you). As I read the manual, though, it just doesn't make sense. Can >somebody tell me what I'm missing (if anything)? The funny behavior of flock avoids the deadlock problem I mentioned above, with dangerous consequences. If two shared locks tried to promote to exclusive, you have a problem. If I understand the internals of 4.3 on the vax correctly, once all shared locks dissappear the proc which first requested the exclusive lock should get it and then the next. To protect yourself, you should re-read whatever data you had durring the shared lock before you do your write. This just slows things down more and intrduces all kinds of problems includeing concurancy conflict which is the problem locks are specifically used to avoid. (You're right on both counts.) >> [talking about reliability and limitations of the file system] >> But for the banks and the purests, we're working on what may be a way >> around the problems. I'm not sure enough right now to publicize yet. > > If there are problems with the file system, why can't you just open >up a raw disk and implement your own like was done for testing the 4.2 F.S. >under a 4.1 kernel? I don't see any way to get around the write-behind >strategy in the block I/O system, but at least you can make multi-volume >files if you want. You could even implement shadow pages and volumes (to >keep "the banks and the purests" happy) in the user-level file system >routines. Cpu cycles *are* cheaper than HSC-50's, aren't they? :-) I should have said "an elegant solution." (i.e. new) -- C knows no bounds. Jeff Anton U.C.Berkeley Ingres Group ucbvax!anton anton@BERKELEY.EDU
anton@ucbvax.ARPA (Jeff Anton) (08/29/85)
In article <5909@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> Locking (concurrency): Locking, what locking? >> UNIX file locking schemas are, in general, bad. >> ... >> A DBMS would be well off with three (3) levels of locks, >> (operating system people LISTEN!) ... > >If you database people could manage to get your act together and agree >on something for more than 15 milliseconds, maybe we *would* listen. >The main reason nobody wants to put database-oriented locking into an >operating system is that no two database people agree on what they want. > >The existing Unix locking schemes are simple, hence often inadequate, >because nobody wants to go through the hassle of implementing a complex >mechanism when there is no consensus on what it should look like. Ok, no name calling. File locking is a feature of almost every major operating system. UNIX aviods the issue and I can appreciate the reasons. UNIX has survived, in part, because what it does it does in a generalized way. What is needed is a generalized view of file locking. Simple applications such as mail (mail is not that simple but stuffing mailboxs is) don't need more than flock(2). Complex applications, DBMS's and custom databases, will need better locking. How about range locking with arbitrary levels and user defined conflict conditions? This should cover all the bases. I can imagine how to do it as well. If the /usr/group lockf call could be so generalized we could have backward compatability and forward progress. I know it's not all that simple. But if I did build the above locking call, would it be accepted by the OS world? Maybe. -- C knows no bounds. Jeff Anton U.C.Berkeley Ingres Group ucbvax!anton anton@BERKELEY.EDU
larry@ucbingres.ARPA (Larry Rowe) (08/30/85)
In article <16271@watmath.UUCP> hachong@watmath.UUCP (Herb Chong [DCS]) writes: >.... one >large database used for online banking running on a single processor >can support sustained transaction rates of about 1000/second. >applications run under the control of IMS and request services from it >to perform database transactions. only one copy of the database index >is actually kept by IMS, and most of it is paged out when not needed >anyway. > i am somewhat surprised to hear you quoting an ims benchmark that runs 1000 xacts a second. i thought ims was only capable of running around 100 xacts/sec when run on dual 3084 running fastpath. could you give more info on the application you cite or provide a pointer to an article that describes it. thanks, larry rowe