[alt.sources] lq-text Full Text Retrieval Database Part 01/13

lee@sq.sq.com (Liam R. E. Quin) (03/04/91)

: cut here --- cut here --
: To unbundle, sh this file
#! /bin/sh
# make the directory structure:
test -d lq-text || mkdir lq-text
test -d lq-text/doc || mkdir lq-text/doc
test -d lq-text/Sample || mkdir lq-text/Sample
test -d lq-text/src || mkdir lq-text/src
test -d lq-text/src/filters || mkdir lq-text/src/filters
test -d lq-text/src/h || mkdir lq-text/src/h
test -d lq-text/src/liblqtext || mkdir lq-text/src/liblqtext
test -d lq-text/src/lqtext || mkdir lq-text/src/lqtext
test -d lq-text/src/menu || mkdir lq-text/src/menu
test -d lq-text/src/test || mkdir lq-text/src/test
test -d lq-text/src/ozmahash || mkdir lq-text/src/ozmahash

echo x - lq-text/README 1>&2
sed 's/^X//' >lq-text/README <<'@@@End of lq-text/README'
XLiam Quin's text retrieval package (lq-text) Sun Mar  3 17:18:26 EST 1991
Xsrc/h/Revision.h defines this as Revision 1.10.
X
Xlq-text is copyright 1990, 1991 Liam R. E. Quin; see src/COPYRIGHT for details.
X
X
XWhat It Does:
X    Lets you search for phrases in text that you previously indexed.
X    The necessary indexing program (lqaddfile) is enclosed.  Indexes are
X    usually less than the size of the data, and sometimes half that.
X    There is a browser (lqtext) for System V, and a shell script (lq) for
X    any Unix system.  There is also a program (lqkwik) that turns the
X    output of lqphrase or "lqword -l" into a keyword in context-style list.
X
XHow to Install It
X    unpack this tar
X    cd lq-text/src
X    edit h/globals.h (following the instructions in there.  Use ozmahash)
X    edit Makefile
X    make depend # If you have mkdep.  If you don't, and you can't get it
X		# -- e.g. from the tahoe BSD distribution -- you'll have
X		# to edit all of the makefiles to delete everything
X		# below the DO NOT DELETE pair of lines (leave the ones
X		# that say "DO NOT DELETE", though).
X    make all # this will put things in src/bin and src/lib
X    make install # This will put things in $BINDIR and $LIBDIR.
X
X    You might want to try
X    make local   # This will put stripped executables in src/bin and src/lib;
X		 # I find this convenient for testing.
X    before doing a make install.
X
X    See below for possible problems.
X
X
XHow to Use It
X    (see doc/*)
X    Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the
X    (currently empty) directory  you want to put there.
X    Make lq-text/src/bin and lq-text/src/lib be in your path
X    Put a README file in $LQTEXTDIR:
X	docpath /my/login/directory:/or/somewhere/else
X	common Common
X    and make an empty file called Common (or include words like "uucp"
X    that you don't want indexed) in the same directory.
X    Find some files (e.g. your mailbox) and say
X	lqaddfile -t2 file [...]
X    You should see some diagnostic output... (this is what -t2 does).
X    lqaddfile may take several minutes to write out its data, depending
X    on the system.  Try a small file first -- you can add more later!
X    Another fun thing to try is setting DOCPATH to /usr/man and running
X	cd /usr/man
X	find man* -type f -print | lqaddfile -t2 -f -
X    to make an index of the manual pages (use cat* instead of man* if you
X    prefer).  If you have less than 10 meg or so of RAM, give lqaddfile the
X    -w100000 option -- this is the number of words to keep in memory before
X    writing to the database.  The idea is that the number should be small
X    enough to prevent frantic paging activity!
X
X
X    Now try
X	lqword		---> an unsorted list of all known words
X	lq		---> type phrases and browse through them
X	lqtext		---> curses-based browser, if it compiled.
X
X	lqshow `lqphrase "floppy disk"`   ---> lq does this for you
X	lqkwik `lqphrase "floppy disk"`   ---> this is the most fun.
X
X
X    If the files you are indexing have pathnmames with leading bits in
X    common (e.g. indexing a directory such as  /usr/spool/news, or
X    /home/lee/text/humour), make use of DOCPATH.  This is searched
X    linearly, so a dozen or so entries is the practical limit at the
X    moment.
X
X    Every indexed pathname must fit into a dbm page, which is 4KBytes
X    with sdbm but probably much less (e.g. 512) with dbm.  With ozmahash
X    this problem has gone away.
X
X
XKnown Problems
X    lqaddfile runs extraordinarily slowly if the database directory is
X    mounted over a network with NFS.  Run lqaddfile on the NFS server --
X    there's no problem with having the data files on a remote system.
X
X    With this distribution I am including both Ozan Yigit's sdbm package
X    and the BSD hash package written by Ozan Yigit and Margo Seltzer.  The
X    latter is called "ozmahash" here, to avoid confusion with System V hash.
X    Try using ozmahash first, and if that doesn't work use sdbm.  The hash
X    package seems to work on all the systems here, but it might not do so
X    well on system V.  Sdbm has been ported extensively, but is slower.
X
X    If you end up with one or more empty .dir or .pag files in the
X    LQTEXTDIR directory, you probably have a broken sdbm/ndbm/dbm.  Try
X    recompiling with a different dbm package if possible.  In particular,
X    early versions of sdbm had this problem.
X
X    There are some tests, but it is not always
X    clear how to run them.  I intend to make a little test suite...
X    If you get strange error messages, try
X	testbin/dbmtry 5000
X    (this will make and leave behind either one or two files in /tmp).
X    Then try testbin/dbmtry 10000.  If that gives errors, the most likely
X    problem is that you have a faulty bcopy.  I have included a version
X    of bcopy() that is linked in by default -- perhaps you aren't using
X    it?  Do _not_ use memcpy(), as it doesn't handle overlapping regions
X    correctly.
X
X    If -lmalloc fails, simply remove it in Makefile.
X    If you don't have <malloc.h>, you can make an empty file called
X    h/malloc.h (ugh).  I ship a Makefile with -lmalloc because it's such a
X    big win when it is available, and I wouldn't want anyone to forget it!
X
X    On a sun, gcc might have some strange problems with libraries.  If so,
X    use cc.  Sorry.
X    You can use -O on all systems I've tried, and -O4 seems OK on the Sun --
X    at any rate I have done this on my Sun 4/110 under SunOS 4.0.3 here.
X
X    In ancient history, I used gcc -Wall under 386/ix.  I still port
X    lq-text to 386/ix (2.0.2 most recently, October 1990), but can no
X    longer use gcc there because of disk space, so I don't know if gcc
X    will produce messages.  Versions of Unix predating the Norman Conquest
X    may cause problems too.
X
X    For serious debugging, I have included "saber.project", so Saber-C
X    users can get started quickly.  If you are debugging without Saber-C,
X    the first thing to do is to buy it.  It's worth it...
X
X
X    Otherwise, compile with -DASCIITRACE.  You could also use
X    -DMALLOCTRACE, which makes the malloc() routines print messages to
X    stderr, which can be processed with awk -- see test/malloctrace.
X
X
X    Oh, and the common word list is searched linearly, so it is worth
X    keeping it fairly short.  Usually about a dozen words is plenty.
X
X
XLee
X
Xlee@sq.com
Xlee%sq.com@cs.toronto.edu
X{uunet,utzoo,cs.toronto.edu}!sq!lee
@@@End of lq-text/README
echo x - lq-text/doc/lqtext.1 1>&2
sed 's/^X//' >lq-text/doc/lqtext.1 <<'@@@End of lq-text/doc/lqtext.1'
X.\" use sqtbl % | troff -man
X.de r2
X.RS
X.RS
X..
X.de re
X.RE
X.RE
X..
X.TH LQ-TEXT 1 "\(co copyright Liam Quin 1989, 1990"
X.SH NAME
Xlqtext, lqword, lqphrase, lqaddfile, lqfile, lqkwik, lqshow, lq \- text retrieval package
X.SH SYNOPSIS
X.B lqtext
X[
X.B \-vVx
X] [
X.BI \-c cfile
X] [
X.BI \-d dir
X] [
X.BI \-m c
X]
X.br
X.B lqword
X[
X.B \-aAlsvVx
X] [
X.BI \-c cfile
X] [
X.BI \-d dir
X] [
X.BI \-m c
X] [
X.BI \-t n
X]
X.I word
X\&.\|.\|.
X.br
X.B lqphrase
X[
X.B \-lsvVx
X] [
X.BI \-c cfile
X] [
X.BI \-d dir
X] [
X.BI \-t n
X] [
X.BI \-m c
X]
X.I phrase
X\&.\|.\|.
X.br
X.B lqaddfile
X[
X.BI \-xvV
X] [
X.BI \-d dir
X] [
X.BI \-c cfile
X] [
X.BI \-t n
X]
X.I file
X\&.\|.\|.
X.br
X.B lqfile
X[
X.BI \-aAxvV
X] [
X.I file
X]
X\&.\|.\|.
X.br
X.B lqshow
X.I match
X\&.\|.\|.
X.br
X.B lq
X.SH DESCRIPTION
X.I lq-text
Xis a text retrieval database.
XYou can retrieve files based on words (with
X.IR lqword )
Xor phrases (with
X.IR lqphrase ).
X.I Lq-text
Xkeeps a database containing all of the known words and listing the
Xfiles in which they are found.
XThis database is typically between half and three-quarters of the total
Xsize of the actual data, and enables searching to be rapid.
XFiles can be added to the database at any time (with
X.IR lqaddfile ).
X.PP
XThe retrieval programs will give you the names of files containing the
Xwords about which you enquired, but will not show the actual text.
XThis means that you can archive or remove files, and
X.I lq-text
Xcan still find them.
X.I lqshow
Xwill display the matches directly.
XIf it is installed on your system,
X.I lqtext
Xprogram provides an interactive front end.  (This program is generally
Xonly available under System V Release 3.2 or later at the time of writing).
XIf not, there is a shell-script called
X.I lq
Xwhich is rather slower but which provides much of the same functionality.
X.I Lqkwik
Xtakes a list of matches as produced by
X.I lqword
Xor
X.I lqphrase
Xand prints a few words either side of each match, formatted so that the
Xmatched phrases are all in the same column.
XThere are options to alter the sizes of the various columns.
XSince
X.I lqkwik
Xis new and experimentatl it is not yet otherwise documented here.
X.SH "OPTIONS (all programs)"
X.TP
X.BI \-m c
XSet the matching level.  If
X.I c
Xis
X.BR p ,
Xprecise matching is used;
X.B \-mh
Xinvokes heuristic matching, and
X.B \-ma
Xallows approximate matching.
XSee below for more explanation of word and phrase matching.
X.TP
X.BI \-t n
XSet the trace-level to
X.IR n .
XThis is mainly used for debugging.
XThe default trace level is zero, giving no debugging trace at all.
X.TP
X.B \-v
Xverbose mode \- this is exactly the same as using
X.BR \-t1 .
X.TP
X.B \-V
Xprint version information.
X.TP
X.B \-x
XPrint an explanation of options.  This is the single most important
Xoption to remember (and arguably the
X.I only
Xone worth remembering!), as the programs may get updated more
Xoften than the documentation.
X.br
XThe
X.B \-x
Xand
X.B \-v
Xoptions can be combined, so that
X.B \-xv
Xgives a slightly longer explanation.
X.TP
X.BI \-d dir
XLook in the named directory
Xfor the database files.
XIf this is not given, the environment variable
X.SM LQTEXTDIR
Xis inspected, and either this or a built-in default is used.
X.TP
X.BI \-c file
XThe named file should contain a list of words that will not
Xbe included in the index.  A good starting point might
Xbe
X.I /usr/lib/eign
Xif your system has it.  If not, see the
X.I FindCommon
Xscript for a way to generate one.
XIf this option is not given, the programs search for the file named
Xin the
X.SM LQCOMMON
Xenvironment variable, and then look in the file
X.SM README
Xin the
X.I lq-text
Xdatabase directory for a line of the form
X.br \" is this .br needed?
X.r2
X.B common
X.I " filename"
X.re
X.br\" and this one too?
Xbefore the first
X.I end
Xkeyword.
X.SH "LQADDFILE OPTIONS"
X.TP
X.BI \-w n
XNormally
X.I lqaddfile
Xkeeps a cache of the words it has seen, and writes them out only
Xoccasionally.  The less often the cache is written, the faster the
Xprogram will run.  On the other hand, as soon as
X.I lqaddfle
Xgrows large enough to fill all of available physical memory, it starts
Xto run very, very slpwly and to impose a noticeable system overhead.
X.PP
XThe total number of words in the cache determines (approximetely) the
Xtotal size of
X.I lqtext
Xwhen it runs.  Allow about twenty bytes per word.
XValues of from 30,000 to 100,000 appear suitable for machines with from
Xfour to twelve megabytes of memory, as a rough guide.
X.TP
X.BI \-f \^file
XThe list of files to index is read from the named
X.IR file .
XIf this file is `\-', standard input is read.
X.SH "LQFILE OPTIONS"
X.TP
X.B \-a
Xproduces a list of all files in the database
X.TP
X.B \-A
Xtreats each of the remaining arguments as files to add to the file list.
XNo indexing is done, so the main effect of this is that the named files
Xwill not be added to the database until they have changed.
X.SH "LQSHOW OPTIONS"
X.TP
X.BI \-a above
XDisplay
X.I above
Xlines of text above each match.
XThe default is to display up to six lines preceding each match from each file.
X.TP
X.BI \-b below
XDisplay
X.I below
Xlines of text following each match.
XThe default is to display an extra six lines.
XIf there are too many lines to fit on the screen, they will wrap around
Xto the top of the screen.
XThe default is to display six lines of text after the line containing
Xthe first matched keyword in a phrase.
X.TP
X.BI \-f \^file
XThe named
X.I file
Xis assumed to contain a list of matches in the form produced
Xby
X.I lqtext
X.IR \-l ,
Xand allows browsing of a much longer list of matches than the form given below.
X.PP
XRemaining arguments are taken to be matches.
XThese are groups of three strings:
Xa number representing the block in the file, another representing the
Xword in the block, and finally a file name (or path).
XFiles will be found if they are absolute (starting with a /), or if they
Xare in a directory which is specified in
Xthe
X.SM DOCPATH
Xenvironment variable, as described under `environment' below.
X.PP
XThere are also some (deliberately) undocumented options used by the
X.I lq
Xshell script.
X.SH "LQWORD OPTIONS"
XWith no options at all,
X.I lqword
Xwill list all of the words in the database, one per line.
XIf it is given the
X.B \-a
Xflag, it will print statistics about each word as well as the word itself.
XIf given the
X.B \-a
Xoption, it will print out the pathname, block and word-in-block of every
Xoccurrence of every word in the database.
XThis can take some time (from one to two minutes per megabyte of database
Xon a typical 386/ix system, for example).
X.PP
XOther options are
X.TP
X.B \-l
Xlist format \- list matches without attempting to format them for human
Xreadability.  This allows one to use
X.r2
Xlqshow  \`lqword  \-l  word1  word2 ...\`
X.re
Xto view files immediately.
X.PP
XOther options to
X.I lqword
Xare:
X.TP
X.BI \-d word
Xdelete mode \- delete the given
X.I word
Xfrom the database.
XThis should be used with caution, and will be removed to the
Xnew (and unreleased)
X.I lqadmin
Xcommand in the next release.
X.TP
X.B \-s
Xsilent mode.
XIn this mode,
X.I lqword
Xdoes not produce any output, but the exit status is zero if at least
Xone of the given words was found, and one otherwise.
XIf no words are given in this mode,
X.I lqword
Xwill exit with non-zero status.
X.SH MATCHING
XThe
X.I "matching level"
Xwas mentioned briefly under
X.SM OPTIONS
Xabove.
XThe following table summarises the differences between the three available
Xlevels.
X.br
X.\" the .ne lines are for broken versions of tbl...
X.TS
Xallbox doublebox;
XlB lB lB
Xl l l.
X\-m Option	Meaning	Description
X=
X.ne 4
X\-m\^p	Precise	T{
X.ll 3i
XPhrases in the data must have the same
XCapitalisation as you type, and words must be the same distance apart.
X.br
XUse this only if you get too many matches otherwise.
XT}
X.ne 4
X\-m\^h	Heuristic	T{
X.ll 3i
XWords that you give starting with Capital Letters will only match
Xsimilar words in the database; lower case words will match either.
X.br
XPlurals that you give will only match plurals in the data, but
Xa singular word will match either.  For example, if you type `sock',
Xyou will find both `sock' and `socks'.
XT}
X.ne 4
X\-m\^\&a	Any	T{
X.ll 3i
XWith this option, \&\fIlq-text\fP
Xmatching programs will try as hard as possible to match words.
X.br
XPlurals, possessives, case and word separation are all treated loosely.
XT}
X.TE
X.SH EXAMPLES
X.r2
Xlqword martin
X.re
Xfinds all matches of the word `martin' in the default database
X.r2
Xlqphrase -d sources/unix "text retrieval" "word searching"
X.re
Xlooks for the two named phrases.
X.SH ENVIRONMENT
XAll of the programs recognise the environment variables
X.BR LQTEXTDIR ,
X.B  DOCPATH
Xand
X.BR LQCOMMON .
XThe first of these contains the directory in which to look for the
Xdatabase files.  If this is not given, you may find that there is
Xa further default built in to the programs when they were compiled.
XThe
X.B -d
Xoption overrides the
X.SM LQTEXTDIR
Xvariable.
X.br
XThe second,
X.SM DOCPATH ,
Xis a colon-separated list of places to look for files when adding or
Xdisplaying.
XThis is normally set in the file
X.SM README
Xin the database directory.
X.br
XFinally,
X.SM LQCOMMON
Xcan be set to a list of words to ignore when adding or retrieving files.
XAgain, this is usually set in the
X.SM README
Xfile, but you might want to treat certain files differently.
XThe default Common Word List is called `Common', and lives in the
Xdatabase directory.
X.SM LQCOMMON
Xstarts with a `/', it will be taken to be an absolute pathname;
Xotherwise, it is assumed to name a file in the database directory.
X.SH BUGS
XThis is a beta (test) release.
XPlease don't hesitate to fix bugs and let me know what you did\^.\^.\^.
X.sp
XThis documentation is at best preliminary.
X.SH AUTHOR
XLiam R. Quin, 1990
X.
X.\" $Log:	lqtext.1,v $
X.\" Revision 1.3  91/03/03  00:23:19  lee
X.\" Mentioned lqkwik
X.\" 
X.\" Revision 1.2  90/10/06  02:32:22  lee
X.\" Prepared for first beta release.
X.\" 
X.\" Revision 1.1  90/10/04  17:38:34  lee
X.\" Initial revision
X.\" 
X.\"
@@@End of lq-text/doc/lqtext.1
echo x - lq-text/Sample/CommonWords 1>&2
sed 's/^X//' >lq-text/Sample/CommonWords <<'@@@End of lq-text/Sample/CommonWords'
X# lq-text common word stop-list
X
X# Keep this list short -- at most 50 words --  or you will pay a penalty
X# in performance when you add documents to the index -- the list is
X# searched linearly (but is kept sorted internally, so it's OK to have
X# duplicated in here). 
X
X# First index some text with everything commented out, and then use
X# FindCommon to determine which are very common words.  You don't gain
X# all that much space by deleting them, so I don't usually bother.
X
X# the	# 27880 <-- number of times this word appeared in a sample run
X# and	# 23857 <-- on part (or all? I forget) of the King James Bible..
X# that	# 4705
X# for	# 3011
@@@End of lq-text/Sample/CommonWords
echo x - lq-text/Sample/README 1>&2
sed 's/^X//' >lq-text/Sample/README <<'@@@End of lq-text/Sample/README'
X# This file is read (up to "end") by all lq-text programs.
X
Xcommon CommonWords
X
X# where to find documents:
Xdocpath /usr/spool/news:/home/lee/text:
X
Xend
X# end of machine-readable configuration (the computer reads no further! --
X# this is an optimisation for start-up speed...!)
X
X# Common common-file
X# 	--- giving the name of a file of common words
X# Docpath "path" (the quotes are optional)
X#	--- giving a list of places to look for files, separated by ":"
X#	    Useful tip: avoid putting :: or "." in DOCPATH, as you'll
X#	    then get files that might or might not be found, depending
X#	    on where you happen to be.  $HOME is NOT understood in here.
X#
X#	    Docpath can be replaced by the environment variable $DOCPATH.
X#
X# I'll be adding other keywords soon...  and I am open to suggestions!
X# Lee
X# Liam R. E. Quin    lee@sq.com
X
X/*
X * LQ-TEXT Copyright 1990 Liam Russell Eric Quin.  All rights reserved.
X * Written by Liam Quin.
X *
X * This software is not subject to any license of the American Telephone
X * and Telegraph Company or of the Regents of the University of California,
X * or of the X Consortium, or of the Free Software Foundation.
X *
X * Permission is granted to anyone to use this software for any purpose on
X * any computer system, and to alter it and redistribute it freely, subject
X * to the following restrictions:
X *
X * 1. The author is not responsible for the consequences of use of this
X *    software, no matter how awful, even if they arise from flaws in it.
X *
X * 2. The origin of this software must not be misrepresented, either by
X *    explicit claim or by omission.  Since few users ever read sources,
X *    credits must appear in the documentation.
X *
X * 3. Altered versions must be plainly marked as such, and must not be
X *    misrepresented as being the original software.  Since few users
X *    ever read sources, credits must appear in the documentation.
X *
X * 4. Permission must be obtained for any commercial use of this software
X *    which involves resale of all or part of the software, whether
X *    modified or not.
X *
X * 5. This notice may not be removed or altered.
X *
X */
X
X/*
X * Acknowledgements to Henry Spencer for permission to modify and use his
X * and Geoff Collyer's C News copyright notice.
X *
X */
X
X
@@@End of lq-text/Sample/README
echo x - lq-text/src/COPYRIGHT 1>&2
sed 's/^X//' >lq-text/src/COPYRIGHT <<'@@@End of lq-text/src/COPYRIGHT'
X/*
X * Copyright 1989 Liam Russell Eric Quin.  All rights reserved.
X * Written by Liam Quin.
X *
X * This software is not subject to any license of the American Telephone
X * and Telegraph Company or of the Regents of the University of California,
X * or of the X Consortium, or of the Free Software Foundation.
X *
X * Permission is granted to anyone to use this software for any purpose on
X * any computer system, and to alter it and redistribute it freely, subject
X * to the following restrictions:
X *
X * 1. The author is not responsible for the consequences of use of this
X *    software, no matter how awful, even if they arise from flaws in it.
X *
X * 2. The origin of this software must not be misrepresented, either by
X *    explicit claim or by omission.  Since few users ever read sources,
X *    credits must appear in the documentation.
X *
X * 3. Altered versions must be plainly marked as such, and must not be
X *    misrepresented as being the original software.  Since few users
X *    ever read sources, credits must appear in the documentation.
X *
X * 4. Permission must be obtained for any commercial use of this software
X *    which involves resale of all or part of the software, whether
X *    modified or not.
X *
X * 5. This notice may not be removed or altered.
X *
X */
X
X/*
X * Acknowledgements to Henry Spencer for permission to modify and use his
X * and Geoff Collyer's C News copyright notice.
X *
X */
X
X
@@@End of lq-text/src/COPYRIGHT
echo x - lq-text/src/Makefile 1>&2
sed 's/^X//' >lq-text/src/Makefile <<'@@@End of lq-text/src/Makefile'
X# Makefile for LQ-Text, a full text retrieval package by Liam R. Quin
X#
X# $Id: Makefile,v 1.13 91/03/02 20:23:02 lee Exp $
X#
X
X# Do this first for sanity...:
XSHELL=/bin/sh
X
X### Some global configuration options.
X
X# You should also look at h/globals.h for more things to change.
X#
X# DEFS are included in CFLAGS, passed to the C compiler:
X# If ASCIITRACE is defined, you can get extra debugging output using -t99
X# (or some other number), but there is a slight performance penalty for
X# including this, and you'd need to understand the code.
X
X# DEFS:
X# Use either -UBSD -DSYSV or -USYSV -DBSD as appropriate...
X# This affects
X# * the choice of default pager ($PAGER) in globals.h
X# * whether some extra declarations are used to make lint and gcc -Wall
X#   happy about SysV stdio.h
X# It isn't very important... if you have System V stdio and curses, you
X# might as well use -DSYSV -UBSD even on SysV.  Ultrix diffs are included
X# inside #ifdef ultrix; use BSD on Ultrix.
X# Lqtext doesn't do explicit locking or signal handling at the moment.
X# If you are using sdbm (this is what I use) and get messages about L_SET
X# or L_SEEK being undefined, add -DSVID (there are other changes, but
X# this is a useful symptom...)
X#
X# -DMALLCTRACE makes malloc.c produce masses of output...
X#
X# -DCURSESX, if present, says that we have the System V.3.1 or later
X# curses that has A_STANDOUT and in which box(win, 0, 0) draws a neat box
X# with vt100 characters.  If you're not sure, if the string ACS appears
X# in /usr/include/curses.h you should probably use -DCURSESX.
X#
X# DEFS= -DASCIITRACE -UBSD -DSYSV -DMALLOCTRACE -DCURSESX ### for BIG testing
XDEFS= -UASCIITRACE -DBSD -USYSV   ### Try this on BSD-like Unix ...
X# DEFS= -UASCIITRACE -UBSD -DSYSV -DCURSESX -DSVID ### ...and this on Sys V
X
X# Who owns the installed binaries?
XOWNER=lee
X# and what group are they in?
XGROUP=other
X# and where do they go?
XBINDIR=/usr/local/bin
XLIBDIR=/usr/local/lib/lqtext
X# and the file mode for executables?
XMODE=751
X
X# NewsFilter and MailFilter are programs which read news/mail files and
X# turn unwanted words (e.g. Received-By lines inside mail headers) into
X# "qxxxxx", with the right number of x's so that the total byte count is
X# unchanged....
X# Lqshow is the document browser.
XMAILFILTER=$(LIBDIR)/MailFilter
XNEWSFILTER=$(LIBDIR)/NewsFilter
XLQSHOW=$(BINDIR)/lqshow
XLQFILE=$(BINDIR)/lqfile
X
X# If you have -lmalloc, use it...
XMALLOC=-lmalloc # faster version of malloc
X# MALLOC= # BSD Unix doesn't have malloc(3X), only malloc(3)
X
X# Choose between ozmahash, ndbm, sdbm, gdbm or dbm -- if you only have dbm,
X# you'll have some work to do -- see PORTING for gdbm or dbm.
X# The necessary changes are in h/smalldb.h and h/Liamdbm.h if you need them.
X# If you use ozmahash or sdbm you must use the fixed versions -- sdbm was
X# posted to netnews in 1991, and ozmahash is included with this distribution.
X# If you use ozmahash, copy ozmahash/*.h into h.
X
X# WHICHDBM=ndbm
X# DBMLIBS=-lndbm
X# MKDBM= # this is the target if we have to build ndbm...
XWHICHDBM=ozmahash
XDBMLIBS=../lib/libhash.a
XMKDBM=mkozmahash # this is the target if we have to build ndbm...
X
X# On BSD systems you need -ltermcap as well as libcurses for "lqshow".
X# TERMCAP=-lcursesX -ltermcap # ultrix -- cursesx
XTERMCAP=-lcurses -ltermcap
X# TERMCAP=-lcurses
X
X# on SYSV ranlib is usually "echo"
X# RANLIB=echo
XRANLIB=ranlib
X
X# Choose a C compiler -- GNU's gcc if you have it, or the standard cc.
X# GNU cc won't compile lqaddfile.c on some machines, but I don't know why.
X## for gcc:
X#CC=gcc
X#GCCF= -Fwriteable-strings -Wall -I/usr/include
X## for anything else:
XCC=cc
XGCCF=
X##
X# Use -O or -O -g for the optimiser.  Or -ql or -p for profiling (sysV)
X# -O3 (or -O4 if you are feeling brave) is for SunOS
X# With gcc or recent System V compilers, you can use OPT=-O -g
X# NOTE to profilers: do not mix -g with -p -- this is often broken!
XOPT=-O
X
XCFLAGS= $(OPT) $(DEFS) $(GCCF) -D$(WHICHDBM) $$(EXTRA)
X
X# Lint flags vary wildly between systems.
X# LINTFLAGS=-xv
XLINTFLAGS=-a -b -c -h -x 
X
X
X### End of configuration section.  See also PORTING in this directory.
X
XTARGETS=mklib mkbin libs mkfilters mktest mkmenu
XDIRS=mkfilters mkliblqtext mklqtext mktest mkmenu
XMKTARGETS=$(MKDBM) $(DIRS)
X
X# Make all does a local install (in src/bin src/lib src/testbin) too...
Xall: local
X
X.SUFFIXES: .c .o .src .obj
X
X.c.src:
X	#load $(CFLAGS) $<
X
X.o.obj:
X	#load $(CFLAGS) $<
X
Xsaber_src:
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=saber_src $(MKTARGETS)
X
Xsaber_obj:
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=saber_obj $(MKTARGETS)
X
Xlqaddfile.src:
X	#cd lqtext
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' RANLIB='$(RANLIB)' DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' lqaddfile.src
X	#cd ..
X
Xtidy:
X	$(MAKE) -i$(MAKEFLAGS) MAKEWHAT=tidy $(MKTARGETS)
X
Xclean:
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=clean $(MKTARGETS)
X	rm -f lib/* bin/* testbin/* core *.o m.log
X
Xdepend:
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=depend $(MKTARGETS)
X
Xlocal: mklib mkbin libs
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKTARGETS)
X
Xinstall: libs
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKTARGETS)
X	( cd bin ; for i in *; do \
X	    test -f ${BINDIR}/$$i && /bin/mv ${BINDIR}/$$i ${BINDIR}/$$i.old; \
X	    /bin/cp $$i ${BINDIR}; \
X	    chmod 711 ${BINDIR}/$$i; chgrp ${GROUP} ${BINDIR}/$$i; \
X	    chown ${OWNER} ${BINDIR}/$$i;\
X	  done; \
X	)
X	( cd lib ; for i in `ls | grep -v '\.a$'`; do \
X	    test -f ${LIBDIR}/$$i && /bin/mv ${LIBDIR}/$$i ${LIBDIR}/$$i.old; \
X	    /bin/cp $$i ${LIBDIR}; \
X	    chmod 711 ${LIBDIR}/$$i; chgrp ${GROUP} ${LIBDIR}/$$i; \
X	    chown ${OWNER} ${LIBDIR}/$$i; \
X	  done; \
X	)
X	@-echo Binary Installation complete
X	@-echo Now install manual pages from ../doc if appropriate.
X
Xlint:
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=lint $(MKTARGETS)
X
Xlibs:
X	-/bin/test -d ${LIBDIR} || mkdir ${LIBDIR}
X	$(MAKE) -$(MAKEFLAGS) MAKEWHAT=install $(MKDBM) mkliblqtext
X	-@echo libraries up to date
X
X# Note to mklib and mkbin:
X# If the mkdir -p bombs out, there is a shell-script mkdir you can use
X# in the utils directory.  The -p means to create parent directores as needed.
X
Xmklib: # see note above about mkdir
X	-@test -d lib || mkdir lib
X	-@test -d $(LIBDIR) || mkdir -p $(LIBDIR)
X
Xmkbin:  # see note above about mkdir
X	-@test -d bin || mkdir bin
X	-@test -d $(BINDIR) || mkdir -p $(BINDIR)
X
Xmkfilters:
X	cd filters; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
X	CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	RANLIB='$(RANLIB)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT) 
X
Xmkliblqtext:
X	cd liblqtext; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
X	CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	RANLIB='$(RANLIB)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
Xmklqtext:
X	cd lqtext; \
X	RANLIB='$(RANLIB)' \
X	CFLAGS='$(CFLAGS) -DMAILFILTER=\"$(MAILFILTER)\" -DNEWSFILTER=\"$(NEWSFILTER)\" ' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
Xmksdbm:
X	cd sdbm; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	RANLIB='$(RANLIB)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
Xmkozmahash:
X	cd ozmahash; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='-I. $(CFLAGS)' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	RANLIB='$(RANLIB)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
Xmktest:
X	cd test; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' CFLAGS='$(CFLAGS)' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	OWNER='$(OWNER)' RANLIB='$(RANLIB)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
Xmkmenu:
X	cd menu; \
X	$(MAKE) -$(MAKEFLAGS) MALLOC='$(MALLOC)' RANLIB='$(RANLIB)' \
X	CFLAGS='$(CFLAGS) -DLQSHOW=\"$(LQSHOW)\" -DLQFILE=\"$(LQFILE)\" ' \
X	CC='$(CC)' WHICHDBM='$(WHICHDBM)' TERMCAP='$(TERMCAP)' \
X	DBMLIBS="${DBMLIBS}" LINTFLAGS='$(LINTFLAGS)' $(MAKEWHAT)
X
X
X#
X# $Log:	Makefile,v $
X# Revision 1.13  91/03/02  20:23:02  lee
X# Improved install entry.
X# 
X# Revision 1.12  91/03/02  19:34:13  lee
X# More comments and changed some defaults.
X# 
X# Revision 1.11  91/03/02  19:16:27  lee
X# Now makes ozmahash if necessary, and uses SHELL=/bin/sh.
X# 
X# Revision 1.10  91/02/20  19:33:37  lee
X# Removed duplicate definitions of LIB/LIBDIR, BIB/BINDIR and OWNER;
X# OWNER now passed down in mktest correctly.
X# 
X# Revision 1.9  90/10/05  23:40:41  lee
X# More comments for easier configuration -- and added more examples.
X# 
X# Revision 1.8  90/10/03  21:39:41  lee
X# added CURSESX and more comments.
X# ;.
X# 
X# Revision 1.7  90/10/03  21:14:08  lee
X# Now passes MAILFILTER and NEWSFILTER.
X# 
X# Revision 1.6  90/10/01  18:28:43  lee
X# Added MAILFLTER and NEWSFILTER to mkfilter
X# 
X# Revision 1.5  90/09/28  21:52:14  lee
X# Now does installation itself...
X# 
X# Revision 1.4  90/09/10  13:25:31  lee
X# Added some saber-C hooks.
X# 
X# Revision 1.3  90/07/27  17:50:51  lee
X# alpha test version shipped
X# 
X# Revision 1.2  90/03/23  18:57:13  lee
X# Added entries for lint and depend.
X# 
X# Revision 1.1  90/03/23  15:11:30  lee
X# Initial revision
X# 
X#
@@@End of lq-text/src/Makefile
echo x - lq-text/src/PORTING 1>&2
sed 's/^X//' >lq-text/src/PORTING <<'@@@End of lq-text/src/PORTING'
XNotes for porting lq-text.
X
XThis is the free version.  It is not public domain, but you can use it
Xfreely for non-commercial purposes.
XIf you want to sell it, or something derived from it, you should get in
Xtouch with the author (me!), who will almost always give permission.
X
XYou can contact me as follows:
X  lee@sq.com,
X  Liam Quin, SoftQuad Inc. 720 Spadina Ave., Toronto, ONT., Canada
X  (+1) 416 963-8337
X
X==============================
X
XPORTING NOTES
X
XWell, I haven't done much porting.
XCurrently the stuff works on
X* System V Release 3.2 (Interactive's 386/ix 2.0.2), 80386
X* System V Release 2 (Honeywell Bull XPX 100 X20), 68020
X* SunOS 3 and 4 (Sun 3/60 and 3/75), 68020
X
XSince the 68020 and 80386 differ radically, most of the work has probably
Xbeen done.  Don't even *think* about non-Unix systems, though.
X
XLikely problems:
X* the calls to lockf() in FileList.c and WordInfo.c
X  You could comment them out on a single-user system.
X  On BSD you could use flock() instead.  It should be mandatory locking.
X  Actually, since the individual word entries are not locked, you could
X  simply delete the locking code.
X
X* if you have multiple machines sharing the same database, and they do not
X  all use the same byte-ordering, you will need to do some hacking.
X  The headers in pblock.{c,h}, and Filelist/WordInfo.c will all need
X  changing.  They have to read and write fixed-length unsigned longs,
X  and very quickly too!
X  Alternatively, use sReadNumber() and sWriteNumber(), and always allow
X  four bytes.
X
X* you need ndbm.  If you don't have it, you can use sdbm.
X  Look at  Liamdb.h and smalldb.h, and compile without -DNDBM.
X  If you are on Xenix, you can buy 386/ix from your nearest Interactive
X  dealer... Or use dbm().
X  If you are on 386/ix, well, as 386/ix doesn't include dbm,
X  use sdbm or gdbm.
X
X* On mixed model architectures, make sure that you can have arrays larger
X  then 64KBytes (if poss.), that a pointer (char *) fits into an unsigned
X  long, and that you have a good supply of coffee.
X
XBy all means mail me with questions, providing that you have at least tried
Xto get somewhere youself.
X
XLee
Xutzoo!sq!lee
Xlee@sq.com
@@@End of lq-text/src/PORTING
echo x - lq-text/src/TODO 1>&2
sed 's/^X//' >lq-text/src/TODO <<'@@@End of lq-text/src/TODO'
XKEY:
X* -- easy change
X** - harder, needs more understanding
X@ -- needs understanding of internals
X@@ - mail me if you need this!
X
X* give lqshow the ability to page a file
X
X** make a list of matches showing words nearby in a KWIC style
X   so that lqtext can show (say) a dozen at a time...
X   (see lqkwik... this will appear fairlysoon, I expect)
X
X@@ special treatment of dates
X
X** table of pagers for browsing by file/type
X
X**@@ Better ranking of queries
X
X**@@ write a manual :-(
X
X**@ "this" can't be accessed by lqword, but can be by lqshow[???].  The
X   entire plural code (Root.c) needs a rethink.
X   I have started Plurals.c, but it's not ready yet.  Yell if you have any
X   ideas, I need them!
X
X* The various Filter routines should be incorporated into showfile. 
X* Automatic uncompression should be added.  Should look at the magic number
X  as well as the file extension
X  ** -- lqshow would need big changes in one (obvious) routine... as it
X	currently uses lseek...
X
X** Showfile should be made a routine (BrowseList() I suppose) that takes
X** a list of Phrases with their matches...
X
X**@ should abandon dbm for the list of filenames.  A better approach would
X   be to store path components as words in the database!  This would make
X   / a common-word, though.  Needs some thought.
X   A btree might be a good comprimise.  For now, at least ozmahash doesn't
X   have overflow problems.
X
X*@ should use six-bit encoding for strings.  This would save a lot of space
X   with relatively little overhead.
X   Actually it doesn't save space at all, based on my experiments.  Sigh.
X
X**@@ the ability to delete a file.  Two ways:
X     1) read the file -- could be an option to addfile, in fact.  This
X	would have to check the time-stamps, of course.
X     2) schedule (perhaps overnight) a process to go through the entire
X	database and delete old files.
X	Also, addfile (and SortWordPlaces()) could remove deleted FIDs 
X	automatically, which would help.
X
X*@ the algorithm to add a new entry to a WID is too slow, because of the
X  requirement that the list be kept sorted.  I should instead keep a
X  SortedToHere counter in the header, and simply append new words.
X  The next time someone does a getpblock() and a sort, it could be written
X  back sorted.  Or there could be a daemon sorter!!
X
X* need better documentation!
X
X* README should be used more, allowing more configuration.
X  (README can now be called something else by recompiling)
X
X** allow dynamic definition of word start/mid/end, in README.
X   Must be at least as fast as isupper() etc.
X   Perhaps per-file-type rules, though?  Makes Phrase Matching hard.
X
X** Better file locking
X   (no file locking or signal handling at all at the moment -- I ripped it
X    all out when I discovered that it was broken on many systems, and
X    this gave a false sense of security.)
X
X* Finish the ReadAhead daemon.   See if it makes an improvement.
X  The idea is that whenever ReadBlock reads a block, it should ask the
X  daemon to read the next block, thus ensuring that it is in the buffer
X  cache.
X  It might be better simply to give it the WID and have it read the entire
X  chain itself.
X
X* Phrase Matching would be orders of magnitude faster if it did not involve
X  reading the tables of matches until they are needed, as many of them
X  won't be!  It should extend the lists of matches for each word in the
X  phrase only as necessary.
X
X** save large FIDs for large files.
X
X** Use a better number scheme (numbers.c)
X   make some more of the number routines inline -- especially use
X   a #define for the common case of sReadNumber ad sWriteNumber, eliminating
X   millions (!) of funtion calls when the numbers are only 1 byte long.
X   Note that most numbers turn out to fit in one byte (more than 90%) at
X   the moment, so if I had another bit I could improve the flags stuff..
X   The current scheme sets the top bit in each byte if there is more to
X   follow.  Another alternative would be to use magic values (e.g.
X   256 - (number of bytes)) for the first byte if there's more than one
X   byte. Hence 255 255 would be used to store 255.  This would roughly
X   double the number of numbers (!) fitting into one byte... hmm...
X
XKnown Bugs
X==========
X* lqshow only marks the first word of the phrase.
X* lqshow does not know about file types!!!
X* there is no troff (or sqtroff) file type
X* the C filter got lost in history (sigh)
X* I cannot distribute the CDMS and Uniplex filters
@@@End of lq-text/src/TODO
echo x - lq-text/src/filters/FilterMain.c 1>&2
sed 's/^X//' >lq-text/src/filters/FilterMain.c <<'@@@End of lq-text/src/filters/FilterMain.c'
X/* FilterMain.c -- Copyright 1989, 1990 Liam R. Quin.  All Rights Reserved.
X * This code is NOT in the public domain.
X * See the file COPYRIGHT for full details.
X */
X
X/* $Id: FilterMain.c,v 1.3 90/10/06 00:57:16 lee Rel1-10 $
X */
X
X/* FilterMain is intended to make writing filters easier; one
X * simply writes the Filter() routine and links with FilterMain.o to
X * produce a new filter.
X *
X * The filter should use "wordrules.h", and should transform its input
X * into words, spaces and newlines, with all other characters turnded into
X * spaces.
X *
X * A simple filter might be something like (on System V):
X *
X *	system("tr -c '[a-z][A-Z][0-9]_' '[ *]'");
X *
X * except that a word shouldn't start with a digit or _.
X *
X * Addfile itself maps upper case to lower, and may also check on the length
X * of words (min is currently 3, max 20, for example).
X *
X * A News or Mail filter might delete things from the header (turning them
X * into spaces to preserve file offsets), so that the index doesn't fill
X * up with ihnp4!decwrl!seismo!utzoo!henry everwhere.  Of course, it
X * would retain the utzoo!henry at the end of the From line.
X *
X * A filter for the Crystal Word Processor might turn accented characters
X * into their ASCII non-accented equivalents, (although NX-Text is 8-bit
X * transparent, so one could also decide to use an 8-bit character set),
X * and remove style information, non-local object banners, etc.
X *
X * This file must fork "compress -d" if appropriate, to read compressed files.
X * Note -- compress -d is the same as uncompress, but more likely to work.
X * Some sites also have zcat, but this is even rarer.
X *
X */
X
X/** Unix system calls used in this file: **/
Xextern void exit();
X/** C Library functions used in this file: **/
Xextern void perror();
X
X#include <stdio.h>
X
Xchar *progname;
Xvoid Filter();
X
Xextern int AsciiTrace;
X
Xint
Xmain(ac, av)
X    int ac;
X    char *av[];
X{
X    progname = av[0];
X
X    if (ac ==1) {
X	Filter(stdin, "(standard input)");
X    } else {
X	while (--ac) {
X	    FILE *f = fopen(*++av, "r");
X
X	    if (f == (FILE *) 0) {
X		fprintf(stderr, "%s: can't open ", progname);
X		perror(*av);
X		exit(1);
X	    }
X
X	    Filter(f, *av);
X
X	    (void) fclose(f);
X	}
X    }
X    return 0;
X}
X
X/*
X * $Log:	FilterMain.c,v $
X * Revision 1.3  90/10/06  00:57:16  lee
X * Prepared for first beta release.
X * 
X * Revision 1.2  90/09/20  18:32:39  lee
X * Removed extra variable declarations...
X * 
X * Revision 1.1  90/08/09  19:17:54  lee
X * Initial revision
X * 
X * Revision 1.2  89/09/16  21:15:58  lee
X * First demonstratable version.
X * 
X * Revision 1.1  89/09/07  21:01:54  lee
X * Initial revision
X * 
X */
@@@End of lq-text/src/filters/FilterMain.c
echo x - lq-text/src/filters/MailFilter.c 1>&2
sed 's/^X//' >lq-text/src/filters/MailFilter.c <<'@@@End of lq-text/src/filters/MailFilter.c'
X/* MailFilter.c -- Copyright 1989 Liam R. Quin.  All Rights Reserved.
X * This code is NOT in the public domain.
X * See the file COPYRIGHT for full details.
X */
X
X/* $Id: MailFilter.c,v 1.5 90/10/06 00:57:24 lee Rel1-10 $
X */
X
X/* Filter for mail articles.
X * Throw away all of the header except
X *	Subject
X *	From
X *	Date
X *	Cc:
X *	Organi[sz]ation
X *	To:
X *
X * See FilterMain and wordrules.h for more info.
X *
X */
X
X#ifdef SYSV
X extern int _filbuf(), _flsbuf();
X#endif
X#include <stdio.h>
X#include <malloc.h>
X#include <ctype.h>
X#include "wordrules.h"
X
X#include "emalloc.h"
X
X#define STREQ(boy, girl) ((*(boy) == *(girl)) && !strcmp(boy, girl))
X
Xextern char *progname;
X
X/** Unix system calls used in this file **/
X    /* (none) */
X/** Unix Library Functions used in this file: **/
X#ifndef tolower
X extern int tolower();
X#endif
Xextern int strcmp();
X
X/** Functions within this file used before they're defined: **/
Xvoid Header(), Body();
Xint GetChar();
X
X/** **/
X
Xvoid Filter();
X
Xchar *KeepThese[] = { /* keep this list in lower case, sorted! */
X    "cc",
X    "date",
X    "from",
X    "organisation",
X    "organization",
X    "subject",
X    "to",
X    0
X};
X
Xint icstreq(s1, s2) /* case insensitive strcmp */
X    char *s1, *s2;
X{
X    register char ch1, ch2;
X
X    while (*s1 && *s2) {
X	if (*s1 != *s2) {
X	    if (isupper(*s1)) {
X		ch1 = tolower(*s1);
X		ch2 = (*s2);
X	    } else  if (isupper(*s2)) {
X		/* Note that we only have to test one character for case! */
X		ch1 = (*s1);
X		ch2 = tolower(*s2);
X	    } else {
X		return 0; /* not the same */
X	    }
X	    if (ch1 != ch2) return 0; /* the strings differ */
X	}
X	s1++; s2++;
X    }
X    if (!*s1 && !*s2) {
X	return 1;
X    }
X    return 0; /* they are different */
X}
X
Xint
XIsWanted(String)
X    char *String;
X{
X    char **pp;
X    int ch = String[0];
X
X    if (isupper(ch)) ch = tolower(ch);
X
X    for (pp = KeepThese; *pp && **pp; pp++) {
X	if (**pp > ch) break; /* gone too far */
X	else if (icstreq(String, *pp)) return 1;
X    }
X    return 0;
X}
X
Xvoid
XFilter(InputFile, Name)
X    FILE *InputFile;
X    char *Name;
X{
X    Header(InputFile, Name);
X    Body(InputFile, Name);
X}
X
Xtypedef enum {
X    F_NotSeenAnythingYet,
X    F_InTheFirstWord,
X    F_AfterTheFirstWord
X} t_FirstWord;
X
Xint InWord = 0;
X
X/* For a mail article, the Header ends at the first line which is not
X * a valid mail header -- i.e., is not indented and doesn't start with
X * a capitalised word followed by a single space (uucp) or colon (RFC822).
X * A blank line also ends the header.
X */
Xvoid
XHeader(InputFile, Name)
X    FILE *InputFile;
X    char *Name;
X{
X    int AtStartOfLine = 1;
X    int IgnoreLine = 0; /* initialised for lint... */
X    t_FirstWord FirstWord = F_NotSeenAnythingYet;
X    int ch;
X    static int BufLen;
X    static char *Buffer = 0;
X    int AtStartOfWord;
X    register char *q;
X
X    if (Buffer == 0) {
X	BufLen = 24;
X	Buffer = emalloc(BufLen);
X    }
X
X    q = Buffer;
X    InWord = 0;
X
X    while ((ch = GetChar(InputFile)) != EOF) {
X	if (ch == '\n') {
X	    if (AtStartOfLine) { /* a blank line */
X		putchar('\n');
X		return;
X	    }
X	}
X
X	InWord = InWord ? WithinWord(ch) : StartsWord(ch);
X
X	switch (FirstWord) {
X	case F_NotSeenAnythingYet:
X	    if (InWord) {
X		FirstWord = F_InTheFirstWord;
X		if (q - Buffer >= BufLen - 1) {
X		    int where = q - Buffer;
X
X		    BufLen += 24;
X		    Buffer = erealloc(Buffer, BufLen);
X		    q = &Buffer[where];
X		}
X		*q++ = ch;
X	    } else {
X		if (AtStartOfLine && ch != ' ' && ch != '\t') {
X		    putchar(ch);
X		    return;
X		}
X		putchar(' ');
X	    }
X	    break;
X	case F_InTheFirstWord:
X	    if (InWord) {
X		if (q - Buffer >= BufLen - 1) {
X		    int where = q - Buffer;
X
X		    BufLen += 24;
X		    Buffer = erealloc(Buffer, BufLen);
X		    q = &Buffer[where];
X		}
X		*q++ = ch;
X		break;
X	    } else { /* reached the end of the first word on the line */
X		*q = '\0';
X		/* See if it's a keyword */
X		if ((IgnoreLine = !IsWanted(Buffer)) != 0) {
X		    /* Turn the word into one that won't get indexed,
X		     * so that word counmts are unaffected:
X		     * We use qxxxxxxx (any number of x's) for this.
X		     */
X		    for (q = Buffer; *q; q++) {
X			putchar((q == Buffer) ? 'q' : 'x');
X		    }
X		    putchar (ch == '\n' ? '\n' : ' ');
X		} else {
X		    printf("%s%c", Buffer, ch == '\n' ? ch : ' ');
X		}
X		FirstWord = F_AfterTheFirstWord;
X	    }
X	    break;
X	default:
X	    if ((AtStartOfLine = (ch == '\n'))) {
X		IgnoreLine = 0;
X		q = Buffer;
X		FirstWord = F_NotSeenAnythingYet;
X		AtStartOfWord = 1;
X	    }
X	    if (InWord && !IgnoreLine) {
X		putchar(ch);
X	    } else {
X		if (AtStartOfWord && InWord) {
X		    putchar('q');
X		    AtStartOfWord = 0;
X		} else if (InWord) {
X		    putchar('x');
X		} else if (isspace(ch)) {
X		    putchar(ch);
X		} else {
X		    putchar(' ');
X		}
X	    }
X	    if (!InWord) AtStartOfWord = 1;
X	}
X	if ((AtStartOfLine = (ch == '\n'))) {
X	    IgnoreLine = 0;
X	    q = Buffer;
X	    FirstWord = F_NotSeenAnythingYet;
X	    AtStartOfWord = 1;
X	}
X    }
X    if (ch == EOF) {
X	fprintf(stderr, "%s: warning: Mail folder %s has no message body\n",
X			progname, Name);
X    }
X}
X
Xvoid
XBody(InputFile, Name)
X    FILE *InputFile;
X    char *Name;
X{
X    int ch;
X
X    while ((ch = GetChar(InputFile)) != EOF) {
X	if (InWord = InWord ? WithinWord(ch) : StartsWord(ch)) {
X	    putchar(ch);
X	} else {
X	    putchar((ch == '\n') ? '\n' : ' ');
X	}
X    }
X}
X
X#ifdef __GNU__
Xinline
X#endif
Xint
XGetChar(fd)
X    FILE *fd;
X{
X    static int LastChar = 0;
X
X    if (LastChar) {
X	int ch = LastChar;
X	LastChar = 0;
X	return ch;
X    }
X
X    /* Only return a single quote if it is surrounded by letters */
X    if ((LastChar = getc(fd)) == '\'') {
X	LastChar = getc(fd);
X	if (InWord && isalpha(LastChar)) return '\'';
X	else return ' ';
X    } else {
X	int ch = LastChar;
X	LastChar = 0;
X	return ch;
X    }
X}
X
X/*
X * $Log:	MailFilter.c,v $
X * Revision 1.5  90/10/06  00:57:24  lee
X * Prepared for first beta release.
X * 
X * Revision 1.4  90/09/20  16:35:40  lee
X * Fixed icstrcmp() and IsWanted() so that the unwanted parts of headers
X * get deleted again.... (oops!)
X * 
X * Revision 1.3  90/09/19  21:11:54  lee
X * Improved end-of-header detection.
X * Now supports turning unindexed stuff into qxxxxx-words.
X * 
X * Revision 1.2  90/08/29  21:55:57  lee
X * Now handles mh mail better.
X * 
X * Revision 1.1  90/08/09  19:17:56  lee
X * Initial revision
X * 
X * Revision 1.2  89/09/16  21:16:01  lee
X * First demonstratable version.
X * 
X * Revision 1.1  89/09/07  21:05:48  lee
X * Initial revision
X * 
X */
@@@End of lq-text/src/filters/MailFilter.c
echo end of part 01
-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337