rodney@sun.ipl.rpi.edu (Rodney Peck II) (02/03/91)
Since I got the perl book, I've been playing with some ideas that I have had for a few months. One of them was to see how perl would do at generating and manipulating inverted indices. A small group of us have been working on this idea (in C) for some time now. If you are interested in searching large free text databases quickly, you might want to drop a line to "para-request@cs.cmu.edu" to get added to the list. If you aren't interested in searching large files, or in how this could be done better in perl, go to the next article now. There's a shar file in here. ;-) The meat of the question for the perl experts is: how can I make the indexing routine faster? Basically, it does a sysread to get a big chunk of text, then splits it into words with split(/\b/). Then, the words are put into the dbm array and their position in the file is remembered by packing the offset into a "L" and appending it to the string of other offsets. When the string gets too long and there is a danger of overruning the ndbm limit, it makes a new key by prepending a number. Words are guarenteed not to start with numbers, so this works out well. Printing out the context around a word is simple. You just get the string that is the "L" packed list of pointers into the file, stick them together, unpack it, and fseek to the positions. (most of this fseeking stuff is not my idea originally, the current para browser/ indexer uses this method as well. The perl part and ndbm part is mine though.) In addition to making the indexing as fast as possible, I'm interested in using the ndbm file as efficiently as possible. I don't see why it should be twice as long as the input file. Maybe there's a better way to save this info in ndbm records? btw: my Programming Perl book is officially worn and nearly dogeared after two days of being in my hands. Thanks again guys! ok. here's the shar... unshar it, find a nice file to index, and say "index file" then say "findw file words...." ------- 8< --------- cut here -------------- 8< -------------------- #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of shell archive." # Contents: findw index # Wrapped by rodney@sun.ipl.rpi.edu on Sun Feb 3 03:01:47 1991 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'findw' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'findw'\" else echo shar: Extracting \"'findw'\" \(1733 characters\) sed "s/^X//" >'findw' <<'END_OF_FILE' X#!/usr/local/bin/perl X'di'; X'ig00'; X# X# $Header: /h1/rodney/para/RCS/findw,v 1.1 91/02/03 02:59:32 rodney Net $ X# X# $Log: findw,v $ X# Revision 1.1 91/02/03 02:59:32 rodney X# Initial revision X# X X$infile = shift(@ARGV); X Xdie "usage: $0 file words...\n" if ($infile eq ''); X X# open the file Xdbmopen(%word,$infile,undef) || die "$0: couldn't open $infile index: $!\n"; Xopen (TEXT,$infile) || die "$0: couldn't open $infile: $!"; X Xwhile ($_ = shift (@ARGV)) X{ X &printword ($_); X} X Xsub printword { X local($keyw) = @_; X $keynum = ''; X $str = ''; X while ($_ = $word{$keynum++.$keyw}) X { X $str .= $_; X } X @offsets = unpack ("L*",$str); X print $keyw," occurs ",1+$#offsets," times.\n----\n"; X X for (@offsets) X { X seek(TEXT,$_-30,0); X sysread (TEXT,$_,60); X tr/\n\t/ /; X printf ("%7d: ",$_-30); X print $_,"\n"; X } X print "----\n\n"; X} X X############################################################### X X # These next few lines are legal in both Perl and nroff. X X.00; # finish .ig X X'di \" finish diversion--previous line must be blank X.nr nl 0-1 \" fake up transition to first page again X.nr % 0 \" start at page 1 X'; __END__ ##### From here on it's a standard manual page ##### X X.TH FINDW 1 "February 3, 1991" X.AT 3 X.SH NAME Xfindw \- find word in inverted index X.SH SYNOPSIS X.B findw X.I index words... X.SH DESCRIPTION X.I Findw XPrints context surrounding words from the text file index. XThis text file must have been indexed with the index program. X.SH ENVIRONMENT XNo environment variables are used. X.SH FILES Xindex.dir, index.pag the inverted index data X Xindex the straight text X.SH AUTHOR XRodney Peck II X.SH "SEE ALSO" Xindex X.SH DIAGNOSTICS X X.SH BUGS X END_OF_FILE if test 1733 -ne `wc -c <'findw'`; then echo shar: \"'findw'\" unpacked with wrong size! fi chmod +x 'findw' # end of 'findw' fi if test -f 'index' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'index'\" else echo shar: Extracting \"'index'\" \(2342 characters\) sed "s/^X//" >'index' <<'END_OF_FILE' X#!/usr/local/bin/perl X'di'; X'ig00'; X# X# $Header: /h1/rodney/para/RCS/index,v 1.1 91/02/03 02:59:38 rodney Net $ X# X# $Log: index,v $ X# Revision 1.1 91/02/03 02:59:38 rodney X# Initial revision X# X X# inverted tree indexer in perl X Xwhile ($_ = shift(@ARGV)) X{ X &buildindex($_); X} X X# basic plan. read a chunk, split into words. remember the byte position X# of each word. words are held in an assoc array which is a dbm file. X# positions of words could get tricky since there are only 4096 bytes per X# dbm record. If we start to use up a block, create a new key with a X# number in front of the word. Increment that number and keep moving. X Xsub buildindex { Xlocal($infile) = @_; Xunlink "$infile.dir", "$infile.pag"; X# open the file Xdbmopen(%word,$infile,0666); X Xopen (TEXT,$infile) || die "$0: couldn't open $infile: $!"; Xwhile ($len = sysread(TEXT, $_, 500000)) X{ X if (!defined $len) { X next if $! =~ /^Interrupted/; X die "System read error: $!\n"; X } X X tr/A-Z/a-z/; X @words = split(/\b/); X for (@words) X { X if (/^[a-z]\w*$/) X { X # the record might be filling up... X $k = $rec{$_}; X if (length $word{$k.$_} > 200) X { X $rec{$_} = ++$k; X } X $word{$k.$_} .= pack("L",$offset); X } X $offset += length $_; X } X dbmclose(%word); X close(TEXT); X} X}############################################################### X X # These next few lines are legal in both Perl and nroff. X X.00; # finish .ig X X'di \" finish diversion--previous line must be blank X.nr nl 0-1 \" fake up transition to first page again X.nr % 0 \" start at page 1 X'; __END__ ##### From here on it's a standard manual page ##### X X.TH INDEX 1 "February 3, 1991" X.AT 3 X.SH NAME Xindex \- generates an inverted index of a file X.SH SYNOPSIS X.B index file X.SH DESCRIPTION X.I Index Xmakes file.dir and file.pag which are the inverted index for file. That is, Xthere is a pointer to every occurance of every word in the file. This file Xcan be searched and displayed with the accompanying script findw. X.SH ENVIRONMENT XNo environment variables are used. X.SH FILES Xfile.dir, file.pag generated by this program X Xfile the text X.SH AUTHOR XRodney Peck II X.SH "SEE ALSO" Xfindw X.SH DIAGNOSTICS X X.SH BUGS X X.SH WARNINGS XThe index is typically about twice as big as the input file. But it's Xworth it. X END_OF_FILE if test 2342 -ne `wc -c <'index'`; then echo shar: \"'index'\" unpacked with wrong size! fi chmod +x 'index' # end of 'index' fi echo shar: End of shell archive. exit 0 ------------------- cut here ---------------------------------- -- Rodney