[comp.lang.perl] inverted indexes -- help in optimizing sorta LONG

rodney@sun.ipl.rpi.edu (Rodney Peck II) (02/03/91)
Since I got the perl book, I've been playing with some ideas that I
have had for a few months.  One of them was to see how perl would 
do at generating and manipulating inverted indices.  A small group of
us have been working on this idea (in C) for some time now.  If you
are interested in searching large free text databases quickly, you
might want to drop a line to "para-request@cs.cmu.edu" to get added
to the list.

If you aren't interested in searching large files, or in how this 
could be done better in perl, go to the next article now.  There's
a shar file in here. ;-)

The meat of the question for the perl experts is: how can I make the
indexing routine faster?

Basically, it does a sysread to get a big chunk of text, then splits
it into words with split(/\b/).  Then, the words are put into the
dbm array and their position in the file is remembered by packing
the offset into a "L" and appending it to the string of other offsets.
When the string gets too long and there is a danger of overruning the
ndbm limit, it makes a new key by prepending a number.  Words are 
guarenteed not to start with numbers, so this works out well.

Printing out the context around a word is simple.  You just get the
string that is the "L" packed list of pointers into the file, stick
them together, unpack it, and fseek to the positions.  (most of this
fseeking stuff is not my idea originally, the current para browser/
indexer uses this method as well.  The perl part and ndbm part is mine
though.)

In addition to making the indexing as fast as possible, I'm interested
in using the ndbm file as efficiently as possible.  I don't see why
it should be twice as long as the input file.  Maybe there's a better
way to save this info in ndbm records?

btw: my Programming Perl book is officially worn and nearly dogeared
after two days of being in my hands.  Thanks again guys!

ok.  here's the shar...
unshar it, find a nice file to index, and say "index file" then say
"findw file words...."

------- 8< --------- cut here -------------- 8< --------------------
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  findw index
# Wrapped by rodney@sun.ipl.rpi.edu on Sun Feb  3 03:01:47 1991
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'findw' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'findw'\"
else
echo shar: Extracting \"'findw'\" \(1733 characters\)
sed "s/^X//" >'findw' <<'END_OF_FILE'
X#!/usr/local/bin/perl
X'di';
X'ig00';
X#
X# $Header: /h1/rodney/para/RCS/findw,v 1.1 91/02/03 02:59:32 rodney Net $
X#
X# $Log:	findw,v $
X# Revision 1.1  91/02/03  02:59:32  rodney
X# Initial revision
X# 
X
X$infile = shift(@ARGV);
X
Xdie "usage: $0 file words...\n" if ($infile eq '');
X
X# open the file
Xdbmopen(%word,$infile,undef) || die "$0: couldn't open $infile index: $!\n";
Xopen (TEXT,$infile) || die "$0: couldn't open $infile: $!";
X
Xwhile ($_ = shift (@ARGV))
X{
X  &printword ($_);
X}
X
Xsub printword {
X  local($keyw) = @_;
X  $keynum = '';
X  $str = '';
X  while ($_ = $word{$keynum++.$keyw})
X  {
X    $str .= $_;
X  }
X  @offsets = unpack ("L*",$str);
X  print $keyw,"  occurs ",1+$#offsets," times.\n----\n";
X
X  for (@offsets)
X  {
X     seek(TEXT,$_-30,0);
X     sysread (TEXT,$_,60);
X     tr/\n\t/  /;
X     printf ("%7d: ",$_-30);
X     print $_,"\n";
X  }
X  print "----\n\n";
X}
X
X###############################################################
X
X    # These next few lines are legal in both Perl and nroff.
X
X.00;                       # finish .ig
X 
X'di           \" finish diversion--previous line must be blank
X.nr nl 0-1    \" fake up transition to first page again
X.nr % 0         \" start at page 1
X'; __END__ ##### From here on it's a standard manual page #####
X
X.TH FINDW 1 "February 3, 1991"
X.AT 3
X.SH NAME
Xfindw \- find word in inverted index
X.SH SYNOPSIS
X.B findw 
X.I index words...
X.SH DESCRIPTION
X.I Findw
XPrints context surrounding words from the text file index.
XThis text file must have been indexed with the index program.
X.SH ENVIRONMENT
XNo environment variables are used.
X.SH FILES
Xindex.dir, index.pag   the inverted index data
X
Xindex  the straight text
X.SH AUTHOR
XRodney Peck II
X.SH "SEE ALSO"
Xindex
X.SH DIAGNOSTICS
X
X.SH BUGS
X
END_OF_FILE
if test 1733 -ne `wc -c <'findw'`; then
    echo shar: \"'findw'\" unpacked with wrong size!
fi
chmod +x 'findw'
# end of 'findw'
fi
if test -f 'index' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'index'\"
else
echo shar: Extracting \"'index'\" \(2342 characters\)
sed "s/^X//" >'index' <<'END_OF_FILE'
X#!/usr/local/bin/perl
X'di';
X'ig00';
X#
X# $Header: /h1/rodney/para/RCS/index,v 1.1 91/02/03 02:59:38 rodney Net $
X#
X# $Log:	index,v $
X# Revision 1.1  91/02/03  02:59:38  rodney
X# Initial revision
X# 
X
X# inverted tree indexer in perl
X
Xwhile ($_ = shift(@ARGV))
X{
X  &buildindex($_);
X}
X
X# basic plan.  read a chunk, split into words. remember the byte position
X# of each word.  words are held in an assoc array which is a dbm file.
X# positions of words could get tricky since there are only 4096 bytes per
X# dbm record.  If we start to use up a block, create a new key with a 
X# number in front of the word. Increment that number and keep moving.
X
Xsub buildindex {
Xlocal($infile) = @_;
Xunlink "$infile.dir", "$infile.pag";
X# open the file
Xdbmopen(%word,$infile,0666);
X
Xopen (TEXT,$infile) || die "$0: couldn't open $infile: $!";
Xwhile ($len = sysread(TEXT, $_, 500000))
X{
X  if (!defined $len) {
X    next if $! =~ /^Interrupted/;
X    die "System read error: $!\n";
X  }
X
X  tr/A-Z/a-z/;
X  @words = split(/\b/);
X  for (@words) 
X  {
X    if (/^[a-z]\w*$/)
X    {
X					# the record might be filling up...
X      $k = $rec{$_};
X      if (length $word{$k.$_} > 200)
X      {
X	$rec{$_} = ++$k;
X      }
X      $word{$k.$_} .= pack("L",$offset);
X    }
X    $offset += length $_;
X  }
X  dbmclose(%word);
X  close(TEXT);
X}
X}###############################################################
X
X    # These next few lines are legal in both Perl and nroff.
X
X.00;                       # finish .ig
X 
X'di           \" finish diversion--previous line must be blank
X.nr nl 0-1    \" fake up transition to first page again
X.nr % 0         \" start at page 1
X'; __END__ ##### From here on it's a standard manual page #####
X
X.TH INDEX 1 "February 3, 1991"
X.AT 3
X.SH NAME
Xindex \- generates an inverted index of a file
X.SH SYNOPSIS
X.B index file
X.SH DESCRIPTION
X.I Index
Xmakes file.dir and file.pag which are the inverted index for file.  That is,
Xthere is a pointer to every occurance of every word in the file.  This file
Xcan be searched and displayed with the accompanying script findw.
X.SH ENVIRONMENT
XNo environment variables are used.
X.SH FILES
Xfile.dir, file.pag    generated by this program
X
Xfile	the text
X.SH AUTHOR
XRodney Peck II
X.SH "SEE ALSO"
Xfindw
X.SH DIAGNOSTICS
X
X.SH BUGS
X
X.SH WARNINGS
XThe index is typically about twice as big as the input file.  But it's
Xworth it.
X
END_OF_FILE
if test 2342 -ne `wc -c <'index'`; then
    echo shar: \"'index'\" unpacked with wrong size!
fi
chmod +x 'index'
# end of 'index'
fi
echo shar: End of shell archive.
exit 0

------------------- cut here ----------------------------------
-- 
Rodney