tkevans@fallst.UUCP (Tim Evans) (12/31/88)
Please forgive me if this has been discussed before. :-) The 'paths' database produced by 'pathalias' has grown to nearly 900K bytes in size. Granted that I'm not real well connected, needing 4 hops to get to a backbone, but I've started wondering whether the 'paths' database couldn't be a _binary_ file of some sort so that searches of it might go more quickly. I'm not very proficient at C, but it seems that since 'paths' is essentially a two-field set of records it ought not be terribly hard to make 'pathalias' generate a binary database file. Can anyone offer any comments on this idea? (One problem that comes to my mind right off is that my 'paths' database is produced for me by a neighbor machine's sysadmin and periodically UUCP'd to me: would one machine's binary 'paths' file be usable on another?) Thanks -- UUCP: ...!{rutgers|ames|uunet}!mimsy!aplcen!wb3ffv!fallst!tkevans INTERNET: tkevans%fallst@wb3ffv.ampr.org OTHER: ...!attmail!fallst!tkevans Tim Evans 2201 Brookhaven Court, Fallston, MD 21047 (301) 965-3286
erik@mpx2.UUCP (Erik Murrey) (01/02/89)
In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: >Please forgive me if this has been discussed before. :-) > >The 'paths' database produced by 'pathalias' has grown to nearly >900K bytes in size. Granted that I'm not real well connected, I think that a lot of people have forgotten about a neat little program called "pathprune", posted a little while back. It essentially strips out the un-needed paths from your pathalias output. For instance, pathalias might produce these two paths for one of my neighbors: .csee.lehigh.edu lehi3b15!%s 500 lehi3b15.csee.lehigh.edu lehi3b15!%s 500 These two paths are redundant, and pathprune would remove the second. The resulting paths file, on my system, shrinks about 20-30%. "pathprune" was posted to net.sources (now comp.unix.sources). Here is part of the header: From: matt@ncr-sd.SanDiego.NCR.COM (Matt Costello) Newsgroups: net.sources Subject: Pathprune (to shorten pathalias files) Date: 7 Mar 87 01:49:31 GMT Hope this helps... -- Erik Murrey /| // /~~~~/ | / MPX Data Systems, Inc. / | / / /____/ |/ erik@mpx2.UUCP / / / / /| Data Systems, Inc. {spl1,vu-vlsi,bpa}!mpx1!erik / / / / |====================
david@ms.uky.edu (David Herron -- One of the vertebrae) (01/03/89)
In article <318@mpx2.UUCP> erik@mpx2.UUCP (Erik Murrey) writes: >In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: >>Please forgive me if this has been discussed before. :-) >> >>The 'paths' database produced by 'pathalias' has grown to nearly >>900K bytes in size. Granted that I'm not real well connected, > >I think that a lot of people have forgotten about a neat little >program called "pathprune", posted a little while back. It >essentially strips out the un-needed paths from your pathalias output. well ok ... but what this guy really wants, I think, is to put the paths file into a dbm file. Then when smail wants to find a path it only has to open up the database and do a few quick things to find the path. (dbm is fast) It could even be fixed up to heuristically do a few different lookups, for instance if it's got a full domain name and the lookup fails it could repeatadly chop off hunks from the LHS until it found a matching gateway. Building a dbm file is real simple ... oh, and one of the nifty things in B-News v3.0 is a public domain implementation of dbm so the old complaint that SysV doesn't have dbm doesn't wash anymore. Yay! (I just looked through the code for the beta version last night and really like what I saw). -- <-- David Herron; an MMDF guy <david@ms.uky.edu> <-- ska: David le casse\*' {rutgers,uunet}!ukma!david, david@UKMA.BITNET <-- Now I know how Zonker felt when he graduated ... <-- Stop! Wait! I didn't mean to!
daveb@gonzo.UUCP (Dave Brower) (01/04/89)
In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: >I'm not very proficient at C, but it seems that since 'paths' is >essentially a two-field set of records it ought not be terribly >hard to make 'pathalias' generate a binary database file. > The first question is "what you would gain?" The only obvious overhead in the existing format are the \n separator between records and a one character field separator. A binary format that was significantly smaller would also be much more cpu and disk i/o intensive to access. The existing format, using "look" usually takes < 10 disk reads to locate a path or give up. Straight compress(1)-ion would be incredcibly slow, since you'd need to uncompress 1/2 the file on average to find a path. Other schemes are possible, but most seem to me to bne performance killers too. -dB -- If life was like the movies, the music would match the picture. {sun,mtxinu,hoptoad}!rtech!gonzo!daveb daveb@gonzo.uucp
soley@ontenv.UUCP (Norman S. Soley) (01/04/89)
In article <318@mpx2.UUCP>, erik@mpx2.UUCP (Erik Murrey) writes: > > I think that a lot of people have forgotten about a neat little > program called "pathprune", posted a little while back. It > essentially strips out the un-needed paths from your pathalias output. > > For instance, pathalias might produce these two paths for one of my > neighbors: > > .csee.lehigh.edu lehi3b15!%s 500 > lehi3b15.csee.lehigh.edu lehi3b15!%s 500 > > These two paths are redundant, and pathprune would remove the second. > The resulting paths file, on my system, shrinks about 20-30%. I have not seen this program but if you plan on putting it in make sure first that it behaves properly, the elimination of the second line in the example above is OK but consider the following example I cooked up: My uucp neighbour is ncrcan, they also have a domain name (ncrcan.toronto.ncr.com) but for the sake of arguement lets imagine that there are no regional subdomains so it's actually ncrcan.ncr.com (lots of large orginzations are like that, AT&T for example) my paths file would look like: .ncr.com ontmoh!mnetor!uunet!ncrlnk!%s 195 ncrcan.ncr.com ncrcan!%s 195 If I just blindly strip out the second line without comparing the second fields for an exact match first then any mail I send to user@ncrcan.ncr.com gets sent all the way to Dayton then back to Toronto again, hardly a good idea, especially considering that someone else is footing the bill for that little trip to Ohio. Sure I can mail to ncrcan!user but this usually requires manual intervention. -- Norman Soley - Data Communications Analyst - Ontario Ministry of the Environment UUCP: uunet!attcan!lsuc!ncrcan!ontenv!soley VOICE: +1 416 323 2623 OR: soley@ontenv.UUCP " Stay smart, go cool, be happy, it's the only way to get what you want"
ulmo@ssyx.ucsc.edu (Brad Allen) (01/05/89)
>Other schemes are possible, but most seem to me to bne performance >killers too. RANDOM MUMBLING MODE: 19643 39286 753869 paths.txt # of !'s: 51132 261207 paths.txt.Z (compact and pack did 40%) 2 bytes for a unique # (less using Hoffman coding techniques) 6 bytes for sitename (if you compress this a little ... comparing is the same cost after 1 quick compression) 8 bytes for its hops 16 bytes total per entry ... oh, that's 320000 bytes. Amazing. Not much of an advantage. That's not even counting the search into the filename. Well, take some of this back. If on every sitename you gave diff from last sitename, this figure might drop to 14 bytes? 10 bytes with path compression? Then the indexes would make it a lot. So, as a rough estimate, pressing really hard, we could get it down to under 200000 bytes. Time to replace UUCP with SLIP ...
erik@mpx2.UUCP (Erik Murrey) (01/06/89)
In article <334@ontenv.UUCP> soley@ontenv.UUCP (Norman S. Soley) writes: >I have not seen this program but if you plan on putting it in make >sure first that it behaves properly, the elimination of the second >line in the example above is OK but consider the following example >I cooked up: > >My uucp neighbour is ncrcan, they also have a domain name >(ncrcan.toronto.ncr.com) but for the sake of arguement lets imagine >that there are no regional subdomains so it's actually ncrcan.ncr.com >(lots of large orginzations are like that, AT&T for example) my paths >file would look like: > >.ncr.com ontmoh!mnetor!uunet!ncrlnk!%s 195 >ncrcan.ncr.com ncrcan!%s 195 No, pathprune would not strip this out. It will only strip out machines if their domain gateway has the same leading path. If you are interested, I'll send you a shar of the whole thing, or you can pick it up from your nearest net.sources archive. Here is an excerpt from the man page: ... This can more easily explained by example. Assume the input of: .com ncr-sd!scubed!seismo!%s .ncr.com ncr-sd!%s .sandiego.ncr.com ncr-sd!%s .other ncr-sd!%s .uucp ncr-sd!%s falstaf.sandiego.ncr.com ncr-sd!falstaf!%s ncr-sd.sandiego.ncr.com ncr-sd!%s se-sd.sandiego.ncr.com %s tower5.sandiego.ncr.com tower5!%s After processing, the output will be: .com ncr-sd!scubed!seismo!%s .ncr.com ncr-sd!%s .other ncr-sd!%s .uucp ncr-sd!%s se-sd.sandiego.ncr.com %s tower5.sandiego.ncr.com tower5!%s The domain gateway .sandiego.ncr.com was removed because .ncr.com will get us to the same place. If the -t option had been specified the .com and .ncr.com gateways would also have been removed. All hosts that lie beyond the .ncr.com (or .other for -t) gateway were also removed as being unnecessary. --- -- Erik Murrey /| // /~~~~/ | / MPX Data Systems, Inc. / | / / /____/ |/ erik@mpx2.UUCP / / / / /| Data Systems, Inc. {spl1,vu-vlsi,bpa}!mpx1!erik / / / / |====================
bill@twwells.uucp (T. William Wells) (01/08/89)
In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes: : In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: : >I'm not very proficient at C, but it seems that since 'paths' is : >essentially a two-field set of records it ought not be terribly : >hard to make 'pathalias' generate a binary database file. : : The first question is "what you would gain?" The answer is: I designed a compression method just for the paths database. While I haven't actually implemented it, my back-of-the-envelope estimate of the compressed size of the database is 175K, as compared to the 873K it currently takes (on my system). : A binary format that was significantly smaller would also be much more : cpu and disk i/o intensive to access. Only if one isn't very clever about it. Here: There are two databases. The first database is an index of addresses/location in the second database. The second database contains entries of: link to predecessor/element of route. You use it as follows: look up the address in the first database. This gives you a location in the second database. Decode the element of the route; that becomes the last element of the route to that address. Follow the predecessor link and decode the route element; that becomes the second-to-last element of the route. Repeat this till the predecessor link points to itself. You also do one other trick: you have two kinds of predecssor links, one that points within the current block and a longer one that points to other blocks. Of course, you arrange things so that most predecessor pointers point within the current block. Oh yes, the names are Huffmann coded before being put into the database. (Each name separately.) Disk accesses? The first database is about 80K; binary searching it with a .5K block size would require about 6 disk accesses. (Actually, there are better ways; consider using the Huffmann code as a hash. Another way uses the first bits of the Huffmann code to index a table of pointers into the first database, each bin of the first database contains only the rest of the code to represent the address. Either should work well.) The second database gets one read per element of a path; however, several of the reads come from the same block. Assuming an average path length of 9 and 3 levels of the tree per block, this is an additional 3 accesses. The total is 9 accesses. (If you use a hash or something similar, this is likely to be about 5.) For the existing database, the number of accesses is 11. CPU time? The original address needs to be Huffmann coded; each element of the path needs to be decoded. The links need to be bit-extracted from the database. These are all trivial, in that about 3 path decompressions + link extractions need to be done per disk access. --- Bill { uunet!proxftl | novavax } !twwells!bill
honey@mailrus.cc.umich.edu (peter honeyman) (01/09/89)
you're trading cpu time for disk space. i think it's a lousy deal. peter
sl@van-bc.UUCP (pri=-10 Stuart Lynne) (01/09/89)
In article <307@twwells.uucp> bill@twwells.UUCP (T. William Wells) writes: >In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes: >: In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: >: >I'm not very proficient at C, but it seems that since 'paths' is >: >essentially a two-field set of records it ought not be terribly >: >hard to make 'pathalias' generate a binary database file. >: >: The first question is "what you would gain?" > >The answer is: > >I designed a compression method just for the paths database. While I >haven't actually implemented it, my back-of-the-envelope estimate of >the compressed size of the database is 175K, as compared to the 873K >it currently takes (on my system). > >: A binary format that was significantly smaller would also be much more >: cpu and disk i/o intensive to access. > Actually there is a simple hack to pathalias and smail type programs which can reduce the size of the paths file substantially, leaving it in ascii. The tradeoff is access time vs. size. Specifically the paths file consists of lines like: a e!d!c!b!a!%s b e!d!c!b!%s c e!d!c!%s d e!d!%s e %s This can be replaced by a recursive definition like: a b!a!%s b c!b!%s c d!%!s d e!%!s e %s To produce this file a small change to pathalias is required to simply trim all prefix's from the beginning of the path (except for one hop). To use the file modify the client program to recursively search until a terminal node is reached (in this case node e). For example mail to a!aperson: b!a!%s -> a!aperson and search for b c!b!%s -> b!a!aperson and search for c d!c!%s -> c!b!a!aperson and search for d e!d!%s -> d!c!b!a!aperson This method is quite suitable for a small site with limited disk resources that doesn't forward a great quantity of mail. I tried this about a year and a half back. It worked fine. Saved about 50%. Took me a couple of hours. Probably take a competant person somewhat less. The mod's to the client program can be done so that it will work with either the original or modified format. -- Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl Vancouver,BC,604-937-7532
bill@twwells.uucp (T. William Wells) (01/09/89)
In article <860@mailrus.cc.umich.edu> honey@citi.umich.edu (Peter Honeyman) writes:
: you're trading cpu time for disk space. i think it's a lousy deal.
I doubt that that is true. Since the thing would reference the disk
less often, thus doing fewer block copies (not to mention kernel traps
and disk reads), it is likely to use less CPU time. Also, the
existing code has to do a fair amount of data-skipping which may well
be as costly as the decoding of my database format.
This, however, is a matter for experiment.
---
Bill
{ uunet!proxftl | novavax } !twwells!bill
jim@tiamat.FSC.COM (Jim O'Connor) (01/09/89)
In article <2126@van-bc.UUCP>, sl@van-bc.UUCP (pri=-10 Stuart Lynne) writes: > > Actually there is a simple hack to pathalias and smail type programs which > can reduce the size of the paths file substantially, leaving it in ascii. > > The tradeoff is access time vs. size. > [ discussion of recursive path search deleted ] > > This method is quite suitable for a small site with limited disk resources > that doesn't forward a great quantity of mail. Yes, it does satisfy those conditions, but IMHO, would it not be easier for a small site as described above to use a paths file that has only neighbor links, and a "smart-host" entry to your nearest friendly "bigger" machine, who can do the full path look-up for you. Granted, you may not always get the "best" path this way, but your mail should get delivered, and you should be able to keep your paths file small. --jim ------------- James B. O'Connor jim@FSC.COM Filtration Sciences Corporation 615/821-4022 x. 651
frank@Morgan.COM (Frank Wortner) (01/09/89)
I beg to differ with Peter Honeyman: trading CPU time for disk space can be a good deal. I've found that a workstation network often has lots of wasted CPU power and too little disk space. To each his own. -- Frank "Computers are mistake amplifiers."
stuart@bms-at.UUCP (Stuart Gathman) (01/11/89)
In article <490@gonzo.UUCP>, daveb@gonzo.UUCP (Dave Brower) writes: > In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes: > >I'm not very proficient at C, but it seems that since 'paths' is > >essentially a two-field set of records it ought not be terribly > >hard to make 'pathalias' generate a binary database file. > The first question is "what you would gain?" The only obvious overhead > in the existing format are the \n separator between records and a one > character field separator. Use a (good) btree (not isam) with leading duplicate compression and variable length records and keys. Use only the unique portion of the keys. Store the path section with leading duplicate compression compared to the previous record in key sequence. (This works well on our system, a more complicated method might be necessary for sites with a large number of connections.) This cuts the size in half, reduces seeks to three, and takes less CPU to boot. (Using our propietary Btree system; the unique key & leading dup compression are done automatically. Sorry, no source. :-( ) A system for sites with large numbers of nodes might assign a binary number to neighboring site names (or the first few tree levels) for storing the paths. -- Stuart D. Gathman <stuart@bms-at.uucp> <..!{vrdxhq|daitc}!bms-at!stuart>
sl@van-bc.UUCP (pri=-10 Stuart Lynne) (01/11/89)
In article <208@tiamat.FSC.COM> jim@tiamat.FSC.COM (Jim O'Connor) writes: >In article <2126@van-bc.UUCP>, sl@van-bc.UUCP (pri=-10 Stuart Lynne) writes: >> >> This method is quite suitable for a small site with limited disk resources >> that doesn't forward a great quantity of mail. > >Yes, it does satisfy those conditions, but >IMHO, would it not be easier for a small site as described above to use a >paths file that has only neighbor links, and a "smart-host" entry to your >nearest friendly "bigger" machine, who can do the full path look-up for you. >Granted, you may not always get the "best" path this way, but your mail >should get delivered, and you should be able to keep your paths file small. I agree totally. But in some cases a smaller site might want to do their own thing and be able to route. If you arn't passing a large amount of mail (ie say less than 100 messages per day) and the system is not otherwise fully loaded the tradeoff off cpu cycles for disk space is very valid one. Extending your arguement would mean that only a few large central sites like uunet would need to run full pathalias files. But that would run them into the ground trying to do everybody's routing. My point was and is that this is a simply hack that maintains a simple ascii database, but is much smaller than the original; and allows a system administrator to save 300 to 500 kb of disk space by consuming some cpu cycles. It isn't for everyone; I don't do it here, I have enough disk storage and a distinct lack of cpu cycles on my poor little 68010. -- Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl Vancouver,BC,604-937-7532