lindner@cs.umn.edu (Paul Lindner) (11/11/90)
Does anyone know of any public domain code for doing relational database stuff. Basically my requirements are variable length records, searching on different keys in the database, and multiple concurrent accesses to the same database. My prototype just used a directory structure and ascii files. But I now need better performance and the ability to search on multiple fields. I've looked at the dbm and ndbm libraries and found them to be lacking. (i.e. poor multi-user access, only one key per database and the limit on record size). I imagine I could do like NIS aka YP and have multiple database files but I would really rather not. Also I don't want to have to use ingres, sysbase or any other large database. I want my code to be standalone and free of any legal entanglements. So basically if you've got some database code lying around that you'd like to share, by all means let me know! Otherwise I'd appreciate any references to a practical text on databases. Thanks! -- Paul Lindner, Univ. of MN \ Microcomputer / Pauls Law: You can't IT Sun dude, & UofM ACM pres \ Workstation / fall off the floor. lindner@boombox.micro.umn.edu \ Networks / {...!rutgers!umn-cs!lindner} | | | | | | | | |||||\ Center /||||| | | | | | | | | -- Paul Lindner, Univ. of MN \ Microcomputer / Pauls Law: You can't IT Sun dude, & UofM ACM pres \ Workstation / fall off the floor. lindner@boombox.micro.umn.edu \ Networks / {...!rutgers!umn-cs!lindner} | | | | | | | | |||||\ Center /||||| | | | | | | | |
davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (11/15/90)
The secret to having multiple keys in a dbm database is to use the first byte as the key type. It's that easy, and works very nicely thanks. I did some limited testing of dbm and gdbm, and found that while gdbm kept the file size way down, the build time was a lot longer. -- bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen) sysop *IX BBS and Public Access UNIX moderator of comp.binaries.ibm.pc and 80386 mailing list "Stupidity, like virtue, is its own reward" -me
rang@cs.wisc.edu (Anton Rang) (11/16/90)
In article <2300@sixhub.UUCP> davidsen@sixhub.UUCP (Wm E. Davidsen Jr) writes: > The secret to having multiple keys in a dbm database is to use the >first byte as the key type. It's that easy, and works very nicely >thanks. Actually, I thought the original poster meant 'multiple keys' in the sense of one record having several access paths to it. (For instance, being able to retrieve employees by either last name or SSN.) I don't know of any *dbm package which does this; it's not designed for it. Dbm/ndbm/gdbm works fine if you only have one key per record and don't need the records to be ordered; for slightly more complicated things, there's probably public-domain code floating around. For a real relational DB, unless one can use University INGRES, I suspect one would need to go the commercial route. Anton +---------------------------+------------------+-------------+ | Anton Rang (grad student) | rang@cs.wisc.edu | UW--Madison | +---------------------------+------------------+-------------+
johnl@iecc.cambridge.ma.us (John R. Levine) (11/16/90)
In article <RANG.90Nov15214357@nexus.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes: >In article <2300@sixhub.UUCP> davidsen@sixhub.UUCP (Wm E. Davidsen Jr) writes: >> The secret to having multiple keys in a dbm database is to use the >>first byte as the key type. It's that easy, and works very nicely thanks. > > Actually, I thought the original poster meant 'multiple keys' in the >sense of one record having several access paths to it. You can fake it by making one key the primary key and storing the actual data in its record. The secondary keys point to the primary key. To keep sizes down you might want to make the primary key something small and boring like a three or four byte serial number. Logically each key type is in a separate file, but you can combine them into one big dbm file using the prefix byte trick. It's gross, but no grosser than what real data bases do every day. You may need to keep copies of the secondary keys in the primary record if you want to be able to delete the record or change the primary key. A perfectly workable alternative is to delete the primary key record when you delete the record, be prepared for secondary keys to be left dangling, and delete them in a batch run when you scan the file and pick out the dead wood. If you do anything like this, the first thing you should do is a consistency checker that reports secondary keys with no primary keys and (if it's an error) vice versa. -- John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl "Typically supercomputers use a single microprocessor." -Boston Globe
chris@mimsy.umd.edu (Chris Torek) (11/19/90)
>In article <RANG.90Nov15214357@nexus.cs.wisc.edu> rang@cs.wisc.edu >(Anton Rang) writes: >>... I thought the original poster meant 'multiple keys' in the >>sense of one record having several access paths to it. (This was the impression I got as well.) In article <1990Nov16.145542.22326@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes: >You can fake it by making one key the primary key and storing the actual data >in its record. The secondary keys point to the primary key. To keep sizes >down you might want to make the primary key something small and boring like a >three or four byte serial number. > >Logically each key type is in a separate file, but you can combine them into >one big dbm file using the prefix byte trick. It's gross, but no grosser >than what real data bases do every day. You also need to store the `next serial number', so you have to reserve a key for this as well. This is indeed rather gross. A few years ago I thought about this problem for a while, and came up with a different idea based on the same hashing technique. To explain it I need to describe the idea behind dbm. The way dbm works is to take each `key' string and turn it into a 32-bit number, something like a CRC-32 or a checksum. This number is not necessarily unique (two different strings will produce the same 32-bit value) but collisions should be rare, if the string-to-number algorithm is chosen carefully. In any case, the same string always produces the same number. Now to store a `datum', we turn its `key' into our 32 bit number. This number is taken as the block number in a file of blocks (typically 4096 bytes each). To avoid starting out with 2^32 blocks, we ignore `some' of the bits in this number. Initially we ignore *all* the bits, storing all key/datum pairs in block 0; as the database grows, we ignore fewer bits. We can leave a binary decision tree behind as a record of when we stopped ignoring each bit. This `tree' is simply a string of bits, 0 where we are still ignoring, and 1 where we are not. Given a key hash `h', we start by saying `ignore all the bits', then we ignore fewer bits as long as bit (h&mask)+mask is set: /* * NB: the following is `off the top of my head'. * It may contain errors. */ hash = calchash(key); for (hmask = 0; bit(map, (hash & hmask) + hmask) != 0; hmask = (hmask << 1) | 1) /* void */; Now if the key is anywhere in the database, it is in the block whose number is (hash & hmask). The datum for that key is stored immediately after the key, so all we have to do is read that block and march through every even-numbered entry, comparing against the key. If there is a match, return the corresponding odd-numbered entry. Under this implementation, storing key=value pairs such as mimsy=128.8.128.8 mimsy.umd.edu mimsy.umd.edu=128.8.128.8 mimsy.umd.edu mimsy.cs.umd.edu=128.8.128.8 mimsy.umd.edu (a name to number-and-official-name mapping as might be found in a host number database) requires storing `128.8.128.8 mimsy.umd.edu' three times---51 bytes, at 17 bytes each, counting the host number as 4 bytes. Using John Levine's technique we might instead say Nmimsy=17 Nmimsy.umd.edu=17 Nmimsy.cs.umd.edu=17 #17=128.8.128.8 mimsy.umd.edu so that we can store `128.8.128.8 mimsy.umd.edu' only once, saving 34 bytes (and costing 1+4 per key plus 5 for the intermediate key; we wind up with a net savings of 14 bytes---minus 2 more hidden inside the dbm implementation, used when storing that intermediate key). But this means that to look up a name we have to say: *buf = 'N'; strcpy(buf + 1, name); key.dptr = buf; key.dsize = strlen(buf); datum = fetch(key); if (datum.dptr != NULL) { *buf = '#'; bcopy(datum.dptr, buf + 1, sizeof(int)); key.dsize = 1 + sizeof(int); datum = fetch(key); if (datum.dptr == NULL) panic("lost target"); printf("officially, %s = %.*s\n", name, datum.dsize, datum.dptr); } else printf("%s does not exist\n", name); There *is* another way. Instead of storing keys and data in pairs, we can store each individually. In other words, we can dissociate them---put each in its own block. All we have to do is somehow tie one to the other (otherwise the database is no good!). The 32-bit hash value is almost good enough, but not quite: two different strings can produce the same hash. But what if we augment it? Instead of viewing the database as a series of `key/datum' pairs, suppose we view it as a collection of `items'. Each item has three attributes: A. A number that is unique to this item within its block. (Blocks are typically 4096 bytes or so, so there cannot possibly be more than 4096 or so items; a 16-bit number suffices.) I called this an `index' for want of a better name; this particular index is the item's `in index'. (`Key' might be a better name in some other context :-) .) This exists so that the item can be a datum. B. A `link' record consisting of a 32-bit value and another index. This is only used if the item is considered a `key'. The index here is an `out index'; it is zero if the item is not acting as a key. The 32-bit value is an `out hash'. C. A `link count' telling how many times this item is a key or datum (it can only be a key once). Now to store an item, whether as key or datum, we compute its hash and find its block as above. Then, if the item is a key (has a nonzero `out index') and we are replacing its current content (link target), we `unlink' its current target, store the new datum, and `link' this key to the item produced here. If the item is not a key and is being made one, it simply acquires an `out index'. If the item is a datum, we just increment its link count. To find a key's datum, we `follow the link': the out hash is the hash value for the key's datum. Just as with a key, this (plus the decision bit tree) tells us exactly which block will contain the key's datum (which in this case is certain to exist). Since the item index values are unique to each block, once we have the correct block all we need to do is find the item with the same `in index' number that the key has in its `out index'. Thus, instead of Nmimsy=#17 Nmimsy.umd.edu=#17 Nmimsy.cs.umd.edu=#17 #17=128.8.128.8 mimsy.umd.edu we might have a database that looks like this: <block 1>: <in=1, out=2, outhash=3ab197c4, links=1, "mimsy"> <block 3>: <in=3, out=2, outhash=3ab197c4, links=1, "mimsy.umd.edu"> <in=4, out=2, outhash=3ab197c4, links=1, "mimsy.cs.umd.edu"> <block 4>: <in=2, out=0, outhash=0, links=3, "128.8.128.8 mimsy.umd.edu"> More interestingly, if we create a new database and stick hello=hello into it, we get: <block 0>: <in=1, out=1, outhash=195e094c, links=2, "hello"> This key points to itself! There is no problem with circular references, because a key does not point to another `key', but rather to an `item'. This item has two references (itself as a key, and itself as a datum), not infinitely many. Storing hello=world causes the count to go to 1 (when the as-datum link vanishes); removing `hello' as a key then makes the count go to 0. The main problem with this scheme is that, as with John Levine's, it takes two database accesses to locate an item (though this time the work is hidden from the programmer). There is also a problem with the index values. As each block fills and splits, a new block is created with some index values `used up', and it can become difficult to decide on the next `available in index'. I got around this by recording minimum and maximum `in use' index numbers in each block header, so that typically simply taking the next number suffices. When the min and max values reach 1 and 65535 (actually USHRT_MAX), however, it is necessary to scan the block for a free number. Also, this technique imposes a fair amount of overhead (12 bytes per item for a size offset, link count, two index values, and the out hash, versus 2 bytes per item in dbm/ndbm for the size offset). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris