paths a _binary_ database

tkevans@fallst.UUCP (Tim Evans) (12/31/88)

Please forgive me if this has been discussed before.  :-)

The 'paths' database produced by 'pathalias' has grown to nearly
900K bytes in size.  Granted that I'm not real well connected,
needing 4 hops to get to a backbone, but I've started wondering 
whether the 'paths' database couldn't be a _binary_ file of some
sort so that searches of it might go more quickly.

I'm not very proficient at C, but it seems that since 'paths' is
essentially a two-field set of records it ought not be terribly
hard to make 'pathalias' generate a binary database file.

Can anyone offer any comments on this idea?  (One problem that
comes to my mind right off is that my 'paths' database is produced
for me by a neighbor machine's sysadmin and periodically UUCP'd
to me:  would one machine's binary 'paths' file be usable on 
another?)

Thanks

-- 
UUCP:  ...!{rutgers|ames|uunet}!mimsy!aplcen!wb3ffv!fallst!tkevans
INTERNET:  tkevans%fallst@wb3ffv.ampr.org
OTHER: ...!attmail!fallst!tkevans
Tim Evans  2201 Brookhaven Court, Fallston, MD  21047   (301) 965-3286

erik@mpx2.UUCP (Erik Murrey) (01/02/89)

In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
>Please forgive me if this has been discussed before.  :-)
>
>The 'paths' database produced by 'pathalias' has grown to nearly
>900K bytes in size.  Granted that I'm not real well connected,

I think that a lot of people have forgotten about a neat little
program called "pathprune", posted a little while back.  It
essentially strips out the un-needed paths from your pathalias output.

For instance, pathalias might produce these two paths for one of my
neighbors:

.csee.lehigh.edu		lehi3b15!%s	500
lehi3b15.csee.lehigh.edu	lehi3b15!%s	500

These two paths are redundant, and pathprune would remove the second.
The resulting paths file, on my system, shrinks about 20-30%.

"pathprune" was posted to net.sources (now comp.unix.sources).  Here
is part of the header:

	From: matt@ncr-sd.SanDiego.NCR.COM (Matt Costello)
	Newsgroups: net.sources
	Subject: Pathprune (to shorten pathalias files)
	Date: 7 Mar 87 01:49:31 GMT

Hope this helps...

-- 
Erik Murrey                            /|   //  /~~~~/  |  /
MPX Data Systems, Inc.                / | / /  /____/   |/
erik@mpx2.UUCP                       /  /  /  /        /|  Data Systems, Inc. 
{spl1,vu-vlsi,bpa}!mpx1!erik        /     /  /       /  |====================

david@ms.uky.edu (David Herron -- One of the vertebrae) (01/03/89)

In article <318@mpx2.UUCP> erik@mpx2.UUCP (Erik Murrey) writes:
>In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
>>Please forgive me if this has been discussed before.  :-)
>>
>>The 'paths' database produced by 'pathalias' has grown to nearly
>>900K bytes in size.  Granted that I'm not real well connected,
>
>I think that a lot of people have forgotten about a neat little
>program called "pathprune", posted a little while back.  It
>essentially strips out the un-needed paths from your pathalias output.

well ok ...

but what this guy really wants, I think, is to put the paths file
into a dbm file.  Then when smail wants to find a path it only has
to open up the database and do a few quick things to find the path.
(dbm is fast)  It could even be fixed up to heuristically do a
few different lookups, for instance if it's got a full domain name
and the lookup fails it could repeatadly chop off hunks from the LHS
until it found a matching gateway.

Building a dbm file is real simple ...

oh, and one of the nifty things in B-News v3.0 is a public domain
implementation of dbm so the old complaint that SysV doesn't have
dbm doesn't wash anymore.  Yay!  (I just looked through the code
for the beta version last night and really like what I saw).

-- 
<-- David Herron; an MMDF guy                              <david@ms.uky.edu>
<-- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<-- Now I know how Zonker felt when he graduated ...
<--          Stop!  Wait!  I didn't mean to!

daveb@gonzo.UUCP (Dave Brower) (01/04/89)

In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
>I'm not very proficient at C, but it seems that since 'paths' is
>essentially a two-field set of records it ought not be terribly
>hard to make 'pathalias' generate a binary database file.
>

The first question is "what you would gain?"  The only obvious overhead
in the existing format are the \n separator between records and a one
character field separator.

A binary format that was significantly smaller would also be much more
cpu and disk i/o intensive to access.  The existing format, using "look"
usually takes < 10 disk reads to locate a path or give up.

Straight compress(1)-ion would be incredcibly slow, since you'd need to
uncompress 1/2 the file on average to find a path.

Other schemes are possible, but most seem to me to bne performance
killers too.

-dB

-- 
If life was like the movies, the music would match the picture.

{sun,mtxinu,hoptoad}!rtech!gonzo!daveb		daveb@gonzo.uucp

soley@ontenv.UUCP (Norman S. Soley) (01/04/89)

In article <318@mpx2.UUCP>, erik@mpx2.UUCP (Erik Murrey) writes:
> 
> I think that a lot of people have forgotten about a neat little
> program called "pathprune", posted a little while back.  It
> essentially strips out the un-needed paths from your pathalias output.
> 
> For instance, pathalias might produce these two paths for one of my
> neighbors:
> 
> .csee.lehigh.edu		lehi3b15!%s	500
> lehi3b15.csee.lehigh.edu	lehi3b15!%s	500
> 
> These two paths are redundant, and pathprune would remove the second.
> The resulting paths file, on my system, shrinks about 20-30%.

I have not seen this program but if you plan on putting it in make
sure first that it behaves properly, the elimination of the second
line in the example above is OK but consider the following example 
I cooked up:

My uucp neighbour is ncrcan, they also have a domain name
(ncrcan.toronto.ncr.com) but for the sake of arguement lets imagine
that there are no regional subdomains so it's actually ncrcan.ncr.com 
(lots of large orginzations are like that, AT&T for example) my paths 
file would look like: 

.ncr.com	ontmoh!mnetor!uunet!ncrlnk!%s	195
ncrcan.ncr.com	ncrcan!%s	195

If I just blindly strip out the second line without comparing the
second fields for an exact match first then any mail I send to 
user@ncrcan.ncr.com gets sent all the way to Dayton then back to 
Toronto again, hardly a good idea, especially considering that someone
else is footing the bill for that little trip to Ohio. Sure I can mail 
to ncrcan!user but this usually requires manual intervention.
-- 
Norman Soley - Data Communications Analyst - Ontario Ministry of the Environment
UUCP:	uunet!attcan!lsuc!ncrcan!ontenv!soley	VOICE:	+1 416 323 2623
OR:     soley@ontenv.UUCP 
  " Stay smart, go cool, be happy, it's the only way to get what you want"

ulmo@ssyx.ucsc.edu (Brad Allen) (01/05/89)

>Other schemes are possible, but most seem to me to bne performance
>killers too.

RANDOM MUMBLING MODE:

   19643   39286  753869 paths.txt
# of !'s:  51132
		  261207 paths.txt.Z (compact and pack did 40%)

2 bytes for a unique # (less using Hoffman coding techniques)
6 bytes for sitename (if you compress this a little ... comparing is
		      the same cost after 1 quick compression)
8 bytes for its hops

16 bytes total per entry ... oh, that's 320000 bytes.

Amazing.  Not much of an advantage.  That's not even counting the search into
the filename.  Well, take some of this back.  If on every sitename you gave
diff from last sitename, this figure might drop to 14 bytes?  10 bytes with
path compression?  Then the indexes would make it a lot.  So, as a rough
estimate, pressing really hard, we could get it down to under 200000 bytes.

Time to replace UUCP with SLIP ...

erik@mpx2.UUCP (Erik Murrey) (01/06/89)

In article <334@ontenv.UUCP> soley@ontenv.UUCP (Norman S. Soley) writes:
>I have not seen this program but if you plan on putting it in make
>sure first that it behaves properly, the elimination of the second
>line in the example above is OK but consider the following example 
>I cooked up:
>
>My uucp neighbour is ncrcan, they also have a domain name
>(ncrcan.toronto.ncr.com) but for the sake of arguement lets imagine
>that there are no regional subdomains so it's actually ncrcan.ncr.com 
>(lots of large orginzations are like that, AT&T for example) my paths 
>file would look like: 
>
>.ncr.com	ontmoh!mnetor!uunet!ncrlnk!%s	195
>ncrcan.ncr.com	ncrcan!%s	195

No, pathprune would not strip this out.  It will only strip out machines
if their domain gateway has the same leading path.  If you are interested,
I'll send you a shar of the whole thing, or you can pick it up from
your nearest net.sources archive.

Here is an excerpt from the man page:
...
          This can more easily explained by example.  Assume the input
          of:
               .com           ncr-sd!scubed!seismo!%s
               .ncr.com       ncr-sd!%s
               .sandiego.ncr.com   ncr-sd!%s
               .other              ncr-sd!%s
               .uucp               ncr-sd!%s
               falstaf.sandiego.ncr.com ncr-sd!falstaf!%s
               ncr-sd.sandiego.ncr.com  ncr-sd!%s
               se-sd.sandiego.ncr.com   %s
               tower5.sandiego.ncr.com  tower5!%s
          After processing, the output will be:
               .com           ncr-sd!scubed!seismo!%s
               .ncr.com       ncr-sd!%s
               .other              ncr-sd!%s
               .uucp               ncr-sd!%s
               se-sd.sandiego.ncr.com   %s
               tower5.sandiego.ncr.com  tower5!%s
          The domain gateway  .sandiego.ncr.com  was  removed  because
          .ncr.com  will  get  us to the same place.  If the -t option
          had been specified the .com and .ncr.com gateways would also
          have  been  removed.  All hosts that lie beyond the .ncr.com
          (or .other for  -t)  gateway  were  also  removed  as  being
          unnecessary.
---




-- 
Erik Murrey                            /|   //  /~~~~/  |  /
MPX Data Systems, Inc.                / | / /  /____/   |/
erik@mpx2.UUCP                       /  /  /  /        /|  Data Systems, Inc. 
{spl1,vu-vlsi,bpa}!mpx1!erik        /     /  /       /  |====================

bill@twwells.uucp (T. William Wells) (01/08/89)

In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes:
: In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
: >I'm not very proficient at C, but it seems that since 'paths' is
: >essentially a two-field set of records it ought not be terribly
: >hard to make 'pathalias' generate a binary database file.
:
: The first question is "what you would gain?"

The answer is:

I designed a compression method just for the paths database. While I
haven't actually implemented it, my back-of-the-envelope estimate of
the compressed size of the database is 175K, as compared to the 873K
it currently takes (on my system).

: A binary format that was significantly smaller would also be much more
: cpu and disk i/o intensive to access.

Only if one isn't very clever about it. Here:

There are two databases. The first database is an index of
addresses/location in the second database. The second database
contains entries of: link to predecessor/element of route.

You use it as follows: look up the address in the first database.
This gives you a location in the second database. Decode the element
of the route; that becomes the last element of the route to that
address. Follow the predecessor link and decode the route element;
that becomes the second-to-last element of the route. Repeat this
till the predecessor link points to itself.

You also do one other trick: you have two kinds of predecssor links,
one that points within the current block and a longer one that points
to other blocks. Of course, you arrange things so that most
predecessor pointers point within the current block.

Oh yes, the names are Huffmann coded before being put into the
database. (Each name separately.)

Disk accesses? The first database is about 80K; binary searching it
with a .5K block size would require about 6 disk accesses.
(Actually, there are better ways; consider using the Huffmann code as
a hash.  Another way uses the first bits of the Huffmann code to index
a table of pointers into the first database, each bin of the first
database contains only the rest of the code to represent the address.
Either should work well.) The second database gets one read per
element of a path; however, several of the reads come from the same
block. Assuming an average path length of 9 and 3 levels of the tree
per block, this is an additional 3 accesses. The total is 9 accesses.
(If you use a hash or something similar, this is likely to be about
5.)

For the existing database, the number of accesses is 11.

CPU time? The original address needs to be Huffmann coded; each
element of the path needs to be decoded. The links need to be
bit-extracted from the database. These are all trivial, in that about
3 path decompressions + link extractions need to be done per disk
access.

---
Bill
{ uunet!proxftl | novavax } !twwells!bill

honey@mailrus.cc.umich.edu (peter honeyman) (01/09/89)

you're trading cpu time for disk space.  i think it's a lousy deal.

	peter

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (01/09/89)

In article <307@twwells.uucp> bill@twwells.UUCP (T. William Wells) writes:
>In article <490@gonzo.UUCP> daveb@gonzo.UUCP (Dave Brower) writes:
>: In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
>: >I'm not very proficient at C, but it seems that since 'paths' is
>: >essentially a two-field set of records it ought not be terribly
>: >hard to make 'pathalias' generate a binary database file.
>:
>: The first question is "what you would gain?"
>
>The answer is:
>
>I designed a compression method just for the paths database. While I
>haven't actually implemented it, my back-of-the-envelope estimate of
>the compressed size of the database is 175K, as compared to the 873K
>it currently takes (on my system).
>
>: A binary format that was significantly smaller would also be much more
>: cpu and disk i/o intensive to access.
>

Actually there is a simple hack to pathalias and smail type programs which
can reduce the size of the paths file substantially, leaving it in ascii.

The tradeoff is access time vs. size.

Specifically the paths file consists of lines like:

	a 	e!d!c!b!a!%s
	b	e!d!c!b!%s
	c	e!d!c!%s
	d	e!d!%s
	e	%s

This can be replaced by a recursive definition like:

	a	b!a!%s
	b	c!b!%s
	c	d!%!s
	d	e!%!s
	e	%s

To produce this file a small change to pathalias is required to simply
trim all prefix's from the beginning of the path (except for one hop).

To use the file modify the client program to recursively search until a
terminal node is reached (in this case node e).

For example mail to a!aperson:

	b!a!%s	->	a!aperson and search for b
	c!b!%s	->	b!a!aperson and search for c
	d!c!%s	->	c!b!a!aperson and search for d
	e!d!%s	->	d!c!b!a!aperson

This method is quite suitable for a small site with limited disk resources
that doesn't forward a great quantity of mail.

I tried this about a year and a half back. It worked fine. Saved about 50%.
Took me a couple of hours. Probably take a competant person somewhat less.

The mod's to the client program can be done so that it will work with
either the original or modified format.

-- 
Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl     Vancouver,BC,604-937-7532

bill@twwells.uucp (T. William Wells) (01/09/89)

In article <860@mailrus.cc.umich.edu> honey@citi.umich.edu (Peter Honeyman) writes:
: you're trading cpu time for disk space.  i think it's a lousy deal.

I doubt that that is true. Since the thing would reference the disk
less often, thus doing fewer block copies (not to mention kernel traps
and disk reads), it is likely to use less CPU time.  Also, the
existing code has to do a fair amount of data-skipping which may well
be as costly as the decoding of my database format.

This, however, is a matter for experiment.

---
Bill
{ uunet!proxftl | novavax } !twwells!bill

jim@tiamat.FSC.COM (Jim O'Connor) (01/09/89)

In article <2126@van-bc.UUCP>, sl@van-bc.UUCP (pri=-10 Stuart Lynne) writes:
> 
> Actually there is a simple hack to pathalias and smail type programs which
> can reduce the size of the paths file substantially, leaving it in ascii.
> 
> The tradeoff is access time vs. size.
> 
  [ discussion of recursive path search deleted ]
> 
> This method is quite suitable for a small site with limited disk resources
> that doesn't forward a great quantity of mail.

Yes, it does satisfy those conditions, but
IMHO, would it not be easier for a small site as described above to use a
paths file that has only neighbor links, and a "smart-host" entry to your
nearest friendly "bigger" machine, who can do the full path look-up for you.
Granted, you may not always get the "best" path this way, but your mail
should get delivered, and you should be able to keep your paths file small.

--jim
------------- 
James B. O'Connor			jim@FSC.COM
Filtration Sciences Corporation		615/821-4022 x. 651

frank@Morgan.COM (Frank Wortner) (01/09/89)

I beg to differ with Peter Honeyman:  trading CPU time for disk space can be
a good deal.  I've found that a workstation network often has lots of wasted
CPU power and too little disk space.

To each his own.
-- 
						Frank

"Computers are mistake amplifiers."

stuart@bms-at.UUCP (Stuart Gathman) (01/11/89)

In article <490@gonzo.UUCP>, daveb@gonzo.UUCP (Dave Brower) writes:
> In article <486@fallst.UUCP> tkevans@fallst.UUCP (Tim Evans) writes:
> >I'm not very proficient at C, but it seems that since 'paths' is
> >essentially a two-field set of records it ought not be terribly
> >hard to make 'pathalias' generate a binary database file.

> The first question is "what you would gain?"  The only obvious overhead
> in the existing format are the \n separator between records and a one
> character field separator.

Use a (good) btree (not isam) with leading duplicate compression and
variable length records and keys.  Use only the unique portion of the
keys.  Store the path section with leading duplicate
compression compared to the previous record in key sequence.  (This works
well on our system, a more complicated method might be necessary for sites
with a large number of connections.)  This cuts the size in half, reduces
seeks to three, and takes less CPU to boot.  (Using our propietary Btree
system; the unique key & leading dup compression are done automatically.  
Sorry, no source. :-( )

A system for sites with large numbers of nodes might assign a binary number
to neighboring site names (or the first few tree levels) for storing
the paths.
-- 
Stuart D. Gathman	<stuart@bms-at.uucp>
			<..!{vrdxhq|daitc}!bms-at!stuart>

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (01/11/89)

In article <208@tiamat.FSC.COM> jim@tiamat.FSC.COM (Jim O'Connor) writes:
>In article <2126@van-bc.UUCP>, sl@van-bc.UUCP (pri=-10 Stuart Lynne) writes:
>> 
>> This method is quite suitable for a small site with limited disk resources
>> that doesn't forward a great quantity of mail.
>
>Yes, it does satisfy those conditions, but
>IMHO, would it not be easier for a small site as described above to use a
>paths file that has only neighbor links, and a "smart-host" entry to your
>nearest friendly "bigger" machine, who can do the full path look-up for you.
>Granted, you may not always get the "best" path this way, but your mail
>should get delivered, and you should be able to keep your paths file small.

I agree totally. But in some cases a smaller site might want to do their own
thing and be able to route. If you arn't passing a large amount of mail (ie
say less than 100 messages per day) and the system is not otherwise fully
loaded the tradeoff off cpu cycles for disk space is very valid one.

Extending your arguement would mean that only a few large central sites like
uunet would need to run full pathalias files. But that would run them into
the ground trying to do everybody's routing.

My point was and is that this is a simply hack that maintains a simple ascii
database, but is much smaller than the original; and allows a system
administrator to save 300 to 500 kb of disk space by consuming some cpu
cycles.

It isn't for everyone; I don't do it here, I have enough disk storage and a
distinct lack of cpu cycles on my poor little 68010.

-- 
Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl     Vancouver,BC,604-937-7532