[comp.sources.misc] v09i078: newsclip 1.1, part 9 of 15

brad@looking.ON.CA (Brad Templeton) (12/20/89)

Posting-number: Volume 9, Issue 78
Submitted-by: brad@looking.ON.CA (Brad Templeton)
Archive-name: newsclip/part09

#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of archive 9 (of 15)."
# Contents:  patch/filter.man scanbody.c
# Wrapped by allbery@uunet on Tue Dec 19 20:10:02 1989
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'patch/filter.man' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'patch/filter.man'\"
else
echo shar: Extracting \"'patch/filter.man'\" \(18501 characters\)
sed "s/^X//" >'patch/filter.man' <<'END_OF_FILE'
X.if n .ds La '
X.if n .ds Ra '
X.if t .ds La `
X.if t .ds Ra '
X.if n .ds Lq "
X.if n .ds Rq "
X.if t .ds Lq ``
X.if t .ds Rq ''
X.de Ch
X\\$3\\*(Lq\\$1\\*(Rq\\$2
X..
X.TH FILTER 5 "May 10, 1989"
X.ds ]W News Filter
X.SH NAME
Xfilter \- reader/newsfilter communications protocol.
X.SH SYNOPSIS
XThis document describes version 1.00 of the protocol used to implement
Xnewsreader communication with newsfilter processes. The intent is to support
Xconstruction of newsfilters that can communicate with any standard reader.
X.SH DESCRIPTION
XDuring initialization, newsreaders may attempt to establish communication
Xwith a newsfilter process. If it succeeds, information on each article will
Xbe passed to the newsfilter using a protocol described in this document;
Xthe newsfilter will pass back an `interest score' which the reader may use
Xto determine whether and how the article should be presented.
X.PP
XThe news sources distribution provides C libraries for both the `top' (reader)
Xend, and the `bottom' (filter) end.
X.SH THEORY OF OPERATION
XProtocol execution may be thought of a series of \fRtransactions\fR; in each
Xtransaction, the newsreader sends command down to the newsfilter, and
Xblocks waiting a response. Optionally, the reader may time out if a response
Xis not received within some maximum time.
X.PP
XSome transactions group into \fIdialogues\fR; these are logical sequences 
Xof transactions which share state (i.e. affect shared data in the protocol
Xservice routines).
X.PP
XA protocol session consists of a \fIstart dialogue\fR, followed by any
Xnumber of \fIcommand dialogues\fR, terminated by an \fIend dialogue\fR.
XThere are presently two kinds of command dialogue;
X\fInewsgroup dialogues\fR and \fIarticle dialogues\fR. Specifications for
Xeach of these are given below.
X.SH COMMAND AND RESPONSE FORMAT
XAll messages start with a fixed-size \fIheader\fR. Some messages
Xmay be followed by an \fIargument list\fR, the length of which is provided as
Xpart of the header. In a few cases, the argument list may be followed by an
Xadditional variable-length \fItext section\fR.
X.PP
XHere is the format of a header:
X.nf
X	0	Type: `C' = call, `R' = response, `Q' = query, `A' = answer
X	1	A command code letter.
X	2	A space ` '
X	3-8	A 6-digit decimal command sequence number.
X	9	A space ` '
X	10-12	A 3-digit decimal argument list length.
X	13	A terminating 0 (NUL) or newline byte.
X.fi
X.PP
XIn a call, the sequence number is 1 for the first command issued and
Xincreases by one in each following command.
XIn a response, the sequence number field is the number of the command to which
Xthe response pertains.  The newsreader is not required to provide meaningful
Xsequence numbers, but whatever number the newsreader sends will be used in
Xall filter responses to that command.
X.PP
XIf a numeric field is smaller than the full field width,
Xit should be either left justified and
Xspace-filled to the right or 0 filled from the left. The latter form is
Xrecommended and is the one shown in this document's examples.
X.PP
XThe \fIargument list\fR (if any) of a message is interpreted as a
Xsequence of NUL-separated character strings. The
Xlength field in the header must count the the terminating zero byte found at
Xthe end of the last argument.  The length of
Xthe argument list is limited to 256 characters counting the NUL bytes.
X.P
XThe \fItext section\fR (in the non-implemented pipe mode) may consist of
Xeither (a) an RFC-822 message
Xheader followed and terminated by two carriage returns (a blank line), or b)
XASCII data.  Both are presented in a special packet format described below.
X.SH STARTUP
XWhile it is possible that this protocol may be implemented using other
Xforms of inter process communication, this first version of the protocol
Xis expected to be implemented using two nameless pipes.   News filter
Xprograms will take their commands from the master newsreading program
Xby reading from the standard input.  They will give their responses and
Xqueries to the standard output.
X.PP
XA typical newsreader will fork a child news filter process, create pipes
Xto talk to the standard input of the child and read from the standard
Xoutput of the child, and then execute a news filter executable program.
X.PP
XWhile a newsreader may use any means to decide the location of the
Xnews filter program it executes, the standard name is
X.B nclip.
XThe nclip program should be found in the same directory the newsreader
Xuses to keep user files, such as the
X.N .newsrc
Xfile.  The newsreader may also search the directories listed in the
Xuser's PATH environment variable for this executable.
X.PP
XWhen the filter program is executed, it should be executed with the
Xfollowing argument:
X.TP 0
Xmode=pipe
X.PP
XOptionally, if the newsreader has a directory in which it places
Xuser files, it should pass the name of that directory in a second argument
Xof the form:
X.TP 0
Xdot=<dirname>
X.SS INITIAL SEQUENCE
XWhen a news filter starts up operation, it should immediately send
Xa ``response'' with the OK message (see below) to its standard output.  This
Xis not a response to any command, just a response to being executed.
XThis will indicate that the news filter program has started correctly.
X.PP
XNewsreading programs which start up a news filter and do not get
Xthis immediate initial response should assume the filter has failed
Xto start, and act as though it is not present.
X.PP
XIf the OK message is detected, the newsreader should then send a
XVersion command to the news filter and await a Version response.  If
Xall goes well, general operation may then continue.
X.PP
XNews filter programs should be sure to handle signals properly.  For
Xexample, a filter program should probably ignore INT (break) signals, as
Xit is not talking to a terminal but will still be in the newsreader's
Xprocess group (on Unix).
X.PP
XAt the end of a session, the newsreader should send the Quit command
Xto the filter, await an OK, and then terminate or go on with the knowledge
Xthe news filter program is not in operation.
X
X.SH DIALOGUE SPECIFICATIONS
XIn the following specifications, the form of a message is given 
Xas a pseudo-BNF listing the code byte of the header and the meaning of the
Xarguments following. Required and computed format elements such as the leading
Xtype byte, the embedded spaces, the sequence number, the argument list length
Xfield, and various NUL separators should be understood from the request format
Xdescription above.
X.PP
XEach command or response specification is followed by an example. The first
Xline of each example shows a sample header of the given type, and the second
Xline an argument list.
X.SS The `Start' Dialogue
XThe `start dialogue' consists of a single transaction. The reader sends a
X`V' (Version) command and expects a `V' (Version) response. These are defined
Xas follows:
X.TP 0
XCOMMAND: V <version-string>
X.TP 0
X	CV 000001 005\\0
X	V100\\0
X.PP
XThis command can be sent to a newsfilter to establish the newsfilter
Xlanguage protocol understood by both programs.  The newsfilter
Xprogram will respond with a version line of its own, including
Xthe list of valid commands it understands and the list of responses it
Xcan give back.  The newsreader should only send those commands in
Xthe list given; others will produce an `error' response.  All newsfilters must
Xaccept the set of commands listed in this document -- the specification for
XV100 of the newsfilter interface language.
X.TP 0
XRESPONSE: V <version> <commands> <responses> <plang> <pversion>
X.TP 0
X	RV 000001 029\\0
X	V100\\0ABHNPQV\\0ABEHORV\\0newsclip\\0100\\0
X.PP
XThe <vnum> argument is a Version number for the command
Xlanguage understood by the newsfiltering program.  This language is
Xversion V100.  Later releases will have a higher number.
X.PP
XThe <commands> arg is the set of command codes the newsfilter understands.
XThe <responses> arg is the set of responses that it knows to send back.
X.PP
XThe <plang> argument is the name of the filter language ('P' commands) that
Xthe newsfilter understands, and the <pversion> number is a version number.
XIf the newsfilter does not understand any language for P commands, it should
Xuse the name `NULL' and a version number of 0.
X.PP
XIf the newsreader gets an `error' response to this message, it should assume
Xthat the filter is present but cannot handle the language specified, or has
Xfailed to initialize properly.  The reader may attempt different Version
Xcommands, or may decide to send a Quit command.
X.PP
XDefined names currently are:
X		NULL
X		newsclip
X		rnkill
X.PP
XNames can be registered via email to newsfilters@looking.on.ca
X.SS The `End' Dialogue
XThe \fIend dialogue\fR consists of a single transaction; the reader sends
Xa `Q' (Quit) command and expects an `O' (Ok) response.
X.TP
XCOMMAND: Q
X.TP
X	CQ 000236 000\\0
X.PP
XThe newsfilter program should terminate.  An response of `Ok' is
Xexpected, after which the pipes will close.
X.PP
XClosing the command pipe, causing EOF for the newsfilter, should
Xalso cause the newsfilter to terminate.  If a newsreader detects EOF
Xon the answer pipe, it should assume the newsfilter has terminated and
Xact accordingly, possibly giving an error message to the user.
X.TP 0
XRESPONSE: O
X.TP 0
X	RO 052317 000\\0
X.PP
XThis response confirms that the newsfilter is exiting gracefully.
X.SS The `Program' Dialogue
X.PP
XThe \fIprogram dialogue\fR begins with an 'P' (Program) command and ends with
Xone of the responses 'O' (Ok) or `E' (Error).
X.TP 0
XCOMMAND: P <command-string>
X.TP 0
X	CK 012321 016\\0
X	kill From: Eric\\0
X.PP
XThe first arg passes free format commands to the newsfilter.  Normally these
Xwill be things to add to the newsfilter's "kill files" or other such
Xcommands.  The format of the commands is entirely up to the newsfilter.
X.P
XThe above command might be request to kill all articles that include the string
X"Eric" in their From line.  It would be generated by a newsreader that
Xknew how to translate user requests into commands to this particular
Xnewsfilter.
X.PP
XThe newsfilter should interpret the command.  If it is a valid command, it
Xshould execute it and issue an OK response.  If it is not a valid command,
Xit should issue an Error response.  The newsreader may decide to issue
Xan error message because of the error response, or do further
Xanalysis of the command.
X.TP 0
XRESPONSE: O
X.TP 0
X	RO 12321 000\\0
X.PP
XThis response confirms that the argument was accepted as a valid command
Xby the newsfilter program.
X.TP 0
XRESPONSE: E <error-message>
X.TP 0
X	RE 12321 017\\0
X	No such command!\\0
X.PP
XThis response tells the reader the argument was rejected as an invalid command
Xby the newsfilter program.
X.SS The `Newsgroup' Dialogue 
XThe \fINewsgroup dialogue\fR consists of a single transaction; the reader sends
Xan `N' (Newsgroup) command, and expects an Accept, Reject or Ok response.
X.TP 0
XCOMMAND: N <newsgroup>
X.TP 0
X	CN 00005 015\\0
X	news.groups\\0
X.PP
XThis command asks the newsfilter for general information on
Xa newsgroup.  Responses can be `A' (Accept), which indicates that
Xall articles in this group should be accepted without consulting the
Xnewsfilter, `R' (Reject) which means that all articles should be rejected
Xwithout consulting the newsfilter or `O' (Ok - Consult), which means that
Xarticles should be fed to the newsfilter for examination.
X.PP
XNote that even in the case of an `A' or `R', articles in that group
Xmay still be sent for examination.  It is just less efficient to do so.
X.TP 0
XRESPONSE: A <score>
X.TP 0
X	RA 000005 003\\0
X	22\\0
X.PP
XThis response indicates that the article should be accepted. The single
Xargument is an `interest score' computed by the newskiller; it may be omitted
Xto indicate a value of 1.
X.TP 0
XRESPONSE: R <score>
X.TP 0
X	RR 000005 003\\0
X	-2\\0
X.PP
XThis response indicates that the article should be rejected. The single
Xargument is an `interest score' computed by the newskiller; it may be omitted
Xto indicate a value of -1.
X.PP
XBy convention, the `R' response carries a zero or negative score and the `A'
Xresponse a positive one.
X.PP
XThe `O' response is as documented for the `Q' (Quit) command above.
X.SS The `Article' Dialogue
XThe \fIarticle dialogue\fR begins with an 'A' (Article) command and ends with
Xone of the responses 'A' (Accept) or `R' (Reject). There may be one or more
Xtransactions in this dialogue.
X.TP 0
XCOMMAND: A <newsgroup> <number> <mode> [<filename>]
X.TP 0
X	CA 000006 048\\0
X	news.groups\\034\\0R\\0/usr/spool/news/news/groups/43\\0
X.PP
XThe first two arguments are a normal newsgroup and article-number pair; if
Xthe newsfilter
Xcan deduce a final interest score from these, it will do so and return accept
Xor reject immediately. Otherwise, the newsfilter can return article information
Xrequests to see portions of the article; see the following description of
Xarticle information exchange. The <filename> argument, if present, is used
Xin resolving the article information request.  The <mode> argument
Xindicates whether the article is present in the file, or must be
Xrequested.
X.PP
XText may be passed down to the newsfilter in one of two modes; \fIpipe mode\fR
Xor \fIfile mode\fR.  The mode is triggered by the signle character <mode>
Xbyte argument.  File modes require a file name
Xname argument on the reader command that triggered the article information
Xrequest.  For pipe mode, no file name is given and a mode character of
X'P' is provided.  The two file modes use mode characters of
X'F' (full) and 'R' (request).
X.PP
XPipe mode is currently not
Ximplemented in any of the news filtering or newsreading programs using
Xthis protocol.  It is defined for future expansion.
X.PP
XIn \fIpipe mode\fR article portions are passed down in the text sections of
XHeader and Body replies from the newsreader in accordance with text query 
Xsequences started by the newsfilter. Text query sequence protocol is specified
Xbelow.
X.PP
XIn \fIfile mode\fR the article information is passed in the file named.
XThis file may be either the permanent location of the article, or a tempfile.
X.PP
XIn the former case, it is likely (but not necessary) that the 'F' (full)
Xfile mode will be used.  In this case, the entire article is already present
Xin the file, and no further queries should be issued by the newsfilter.
XThe only acceptable responses to a full file mode Article command are
XAccept, Reject and Error.
X.PP
XIn the latter case of request mode, the newsfilter should issue text queries
Xbefore attempting
Xto read first the header, and later the body of the article.  After
Xissuing such queries, the filter should wait for a response from the
Xnewsreader before reading into the file.
X.PP
XA text query sequence consists of a series of 'H' (Header) and 'B' (Body)
Xrequests sent by the newsfilter to the newsreader, and corresponding responses
Xby the newsreader.
X.TP 0
XQUERY: H
X.TP 0
X	QH 000340 000\\0
X.PP
XThis query requests the RFC-822 header of the article selected by a previous
X`A' command, or of the article selected by the newsreader at the time of
Xissuance of a 'P' command.
X.TP 0
XANSWER: H [<size> [<asize>]]
X.TP 0
X	AH 000340 004\\0
X	364\\0
X.PP
XThis answer signifies that the newsreader has header data ready in response
Xto a previous 'H' query. The optional <size> argument specifies the total
Xlength of the header.  The body of the article may exist beyond the header.
XFilters should not assume they will read EOF at the end of a header.
X.PP
XIn pipe mode, this answer is followed immediately by a text section in RFC-822
Xformat (see above), using message packet format. 
XMessage packets consist of a length byte, followed
Xby 0 to 255 text data bytes.  A 0 length packet indicates the end of the
Xheader, which must be preceded by a blank line.
X.PP
XThe optional <asize> is the size of the entire article, but only if the
Xnewsreader has it handy.
X.TP 0
XQUERY: B [<size>]
X.TP 0
X	QB 000341 004\\0
X	125\\0
X.PP
XThis query asks the reader to send down all or a portion of the body of the
Xcurrent article.   The filter may optionally communicate the most it wants
Xof the article with the <size> argument.  If this is present, the newsreader
Xneed not transmit or place more than <size> bytes of the body.  The
Xnewsreader is still always free to send the entire body -- this is merely
Xan optimization.  If the <size> argument is not present, the entire body
Xmust be made available.
X.TP 0
XANSWER: B [<size>]
X.TP 0
X	AB 000341 004\\0
X	125\\0
X.PP
XThe 'B' response signifies that the newsreader has text data ready for the
Xnewsfilter.  The argument optionally gives the length of the data in bytes. This
Xlength may be smaller or larger than the requested length.  It will only
Xbe smaller if the article body itself is smaller than the requested length.
X.PP
XNote that a newsreader can use request mode even if it always has
Xcomplete article files ready for the filter.  It should merely respond
Xto queries immediately, doing nothing.  The 'F' (full) mode simply allows
Xa reader to be sure it will never receive queries.  This allows very
Xsimple reader implementations of this protocol.
X.PP
XNNTP readers and other readers that do not have access to single
Xarticle files should use the request mode, building a temporary file
Xfor the filter as requested.
X.PP
XIn pipe mode, this answer is followed immediately by a text section in
Xmessage packet format.   Message packets consist of a length byte, followed
Xby 0 to 255 text data bytes.  A 0 length packet indicates EOF.
X.SH NOTES
XThe protocol is designed for use over the most primitive IPC common on UNIX,
Xa pair of nameless pipes. Some older UNIXes (V7 in particular) feature pipe
Ximplementations that behave rather badly (as in, cause a lockup or sudden
Xprocess death) if reads and writes are not carefully synchronized. Thus the
Xrigid alternation of fixed-size with variable-size transmissions, and the
Xcare in specifying lengths of variable parts in fixed-part blocks.
X.PP
XUnder a more forgiving IPC implementation (such as System V message queues),
Xthe fixed and variable-length parts might be sent in one transmission; this
Xis an implementation detail left up to service libraries.
X.PP
XThe all-ASCII format avoids potential alignment problems.
X.PP
XIs is expected that the protocol service libraries will automatically choose
Xpipe or file mode for text query sequence, depending on whether the calling
Xnewsreader browses a file hierarchy or talks to some sort of network daemon.
X.PP
XThe 'F' (full) file mode is the simplest mode, intended for use on
Xsystems where the article files reside in normal format on the user's
Xmachine.  A newsreader can be adapted to this mode of operation with
Xminimal changes.
X.SH AUTHORS
XThis protocol was developed by Brad Templeton and Eric Raymond.
END_OF_FILE
if test 18501 -ne `wc -c <'patch/filter.man'`; then
    echo shar: \"'patch/filter.man'\" unpacked with wrong size!
fi
# end of 'patch/filter.man'
fi
if test -f 'scanbody.c' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'scanbody.c'\"
else
echo shar: Extracting \"'scanbody.c'\" \(19691 characters\)
sed "s/^X//" >'scanbody.c' <<'END_OF_FILE'
X/*
X * scanbody.c
X *
X * Code to perform the scanning of the body of news articles, searching for
X * precompiled regular expressions, and collecting facts about the article,
X * such as number of lines in the body, number of quoted lines, size of
X * the signature, and so on.
X *
X * The main body of the article is read into a chunk of temporary memory,
X * in a completely raw, unformatted manner.  This chunk of text is then
X * massaged according to the settings of some global flags, and data
X * structures indicating lines, paragraphs, and specific areas of the
X * article are constructed.
X */
X
X /*
X  * Newsclip(TM) Library Source Code.
X  * Copyright 1989 Looking Glass Software Limited.  All Rights Reserved.
X  * Unless otherwise licenced, the only authorized use of this source
X  * code is compilation into a binary of the newsclip library for the
X  * use of licenced Newsclip customers.  Minor source code modifications
X  * are allowed.
X  * Use of this code for a short term evaluation of the product, as defined
X  * in the associated file, 'Licence', is permitted.
X  */
X
X#include "common.h"
X#include "body.h"
X#include "rei.h"
X
X#define talloc		temp_alloc	/* Some convenient abbreviations. */
X#define palloc		perm_alloc
X
Xextern char *temp_alloc AC(( int ));
Xextern char *perm_alloc AC(( int ));
Xextern FILE *get_body_desc AC(( int ));
X
Xextern int preserve_case;	/* If set: treat paragraphs as the basic unit */
Xextern int white_compress;	/* If set: strip leading/trailing whitespace */
Xextern int paragraph_scan;	/* If set: map all text to lower case */
Xextern char *TREBuff;		/* Temporary buffer for compiled temp REs */
X
X/*
X * We define the following terms:
X * 
X * line -- a line as it was originally typed in the article, and as such,
X * 	terminated with a newline.
X * 
X * area -- a particular subset of the article lines.  The recognized areas
X * 	currently are: 
X * 		"signature"
X * 		"text"
X *		"included" (text lines starting with "include_prefix")
X * 		"newtext" (original text only -- not include-prefixed)
X * 		"body" (text and signature)
X * 
X * unit -- an ASCIIZ string of text, with newlines removed.  A unit is the
X * 	basic search unit for R.E.s, and currently can be either
X * 		"lines" -- corresponding to the true lines of the article area,
X *  	 or
X * 		"paragraphs" -- multiple lines joined without newlines
X *		 according to the rules
X * 		   i) New paragraphs commence at the beginning of a new area.
X * 		  ii) New paragraphs commence after an empty line.
X * 		 iii) New paragraphs commence when the indentation of the
X * 			current line exceeds the indentation level of the
X * 			previous line.
X * 
X * Basically, we look through the article a byte at a time, looking for
X * areas.  Once an area is identified, then the area is divided up into
X * units.  Hopefully, all this can be done with a minimum of movement of
X * bytes and characters.
X * 
X */ 
X
Xarea_type *Article;		/* Ptr to area-formatted version of article */
Xarea_type *RawText;		/* Ptr to raw text version of article */
X
Xlong ArticleStats[2][AS_ARR_SIZE]; /* Storage for various article statistics. */
X
X/* A lot of static variables are used to maintain information about
X * the current state of reading of the article body.  These permit the
X * article body to be read in an incremental manner. */
X
Xstatic int	FirstLine;	/* First complete line of article parsed */
Xstatic int	LastLine;	/* Last complete line of article parsed */
Xstatic int	PfxScan;	/* Is line identification to be done? */
Xstatic int	IWhiteCompress;	/* Internal version of white_compress flag */
Xstatic rxp_type InclRxp = (rxp_type) NULL; /* Compiled RE to id incl. lines */
Xstatic rxp_type SignRxp = (rxp_type) NULL; /* Compiled RE to id sig. lines */
X
X/* Variables used to store parsing status information over buffer breaks. */
X
Xstatic area_type *Ap;		/* Save: current area. */
Xstatic int	  Lind;		/* Save: indent level of last complete line. */
Xstatic int	  Ltyp;		/* Save: type of the last complete line. */
Xstatic u_list	 *Lul;		/* Save: last u_list of current area. */
Xstatic char	 *Bptr;		/* Save: pointer into current buffer. */
Xstatic unsigned int Blen;	/* Save: remaining bytes in current buffer. */
Xstatic int 	InSignature;	/* Are we in the signature of the article? */
Xstatic int	ArticleParsed;	/* Indicates if article has been parsed. */
Xstatic int	StatsDone;	/* Indicates if all article stats are done. */
Xstatic char	LineSplit;	/* Was long line split? (warning ctl only) */
X
Xstatic FILE *Fptr;		/* File pointer for current article. */
Xint ArticleEoF;			/* Article end-of-file indicator. */
X
Xstatic u_list 	   *flush_lines AC(( area_type *, u_list *, char **, int ));
Xstatic unsigned int block_read AC(( char ** ));
Xstatic char 	   *copy_line AC(( char *, char * ));
X
X/* prepare_body() is called once per article, and resets the
X * variables of interest to the article body parsing routines. */
X
Xvoid
Xprepare_body( io_mode )
Xint io_mode;
X{
X	FirstLine = LastLine = Blen = Lind = 0;
X	Ltyp = LT_NONE;
X	LineSplit = StatsDone = ArticleParsed = InSignature = ArticleEoF = 0;
X	Fptr = (FILE *) NULL;
X
X	Ap = Article = (area_type *) NULL;
X	RawText = (area_type *) talloc( sizeof(area_type) );
X	zero( RawText, sizeof(area_type) );
X	RawText->txt_typ = LT_BODY;
X	Lul = (u_list *) NULL;
X
X	TREBuff = (char *) NULL;
X
X	zero( ArticleStats, sizeof(ArticleStats) );
X
X	/* Set the internal version of the whitespace compression flag
X	 * on the basis of whether or not paragraph scanning mode is on. */
X
X	IWhiteCompress = paragraph_scan ? 1 : white_compress;
X
X	PfxScan = InclRxp || SignRxp;
X}
X
X/* set_include_prefix() sets the include prefix, used for determining which
X * lines of the article are included, to the given string argument. */
X
Xvoid
Xset_include_prefix( user_pfx )
Xchar *user_pfx;
X{
X	char *inc_pfx;
X
X	if( InclRxp )
X		/* Free old compiled version of the include prefix string */
X		perm_free( InclRxp );
X
X	if( '^' == *user_pfx ) {
X		inc_pfx = user_pfx;
X		}
X	else {
X		/* Need to make a copy of the user's string in order
X		 * to insert the left-end anchor character. */
X		inc_pfx = talloc( strlen( inc_pfx ) + 2 );
X		inc_pfx[0] = '^';
X		strcpy( inc_pfx + 1, user_pfx );
X		}
X
X	InclRxp = *inc_pfx ? REG_COMP_P( inc_pfx ) : (rxp_type) NULL;
X
X	/* Update the prefix scanning flag to indicate whether or not
X	 * scanning is to be done for included or signature lines. */
X
X	PfxScan = InclRxp || SignRxp;
X}
X
X/* set_signature_start() sets the signature pattern, used for determining where
X * the signature of the article starts, to the given string argument. */
X
Xvoid
Xset_signature_start( sig_strt )
Xchar *sig_strt;
X{
X	if( SignRxp )
X		/* Free old compiled version of the include prefix string */
X		perm_free( SignRxp );
X
X	SignRxp = *sig_strt ? REG_COMP_P( sig_strt ) : (rxp_type) NULL;
X
X	/* Update the prefix scanning flag to indicate whether or not
X	 * scanning is to be done for included or signature lines. */
X
X	PfxScan = InclRxp || SignRxp;
X}
X
X/* read_body() performs the first-level analysis of the article body, by
X * reading the article and counting bytes and lines.  Lines are stored
X * individually as ASCIIZ strings in preparation for any necessary
X * higher-level processing, such as article section parsing (handled by
X * parse_body() below). */
X
Xvoid
Xread_body( start, end )
Xunsigned int start, end;
X{
X	register char *nbptr;	/* Updated pointer into buffer */
X	register int nblen;	/* Updated size of buffer */
X	char *lps;		/* Ptr to start of the current line. */
X	char *lpe;		/* Ptr to the end of the current line. */
X	int len;		/* Length of the current line. */
X	char *lptr[ULB_SIZE];	/* Temporary storage for located line ptrs. */
X	short lidx = 0;		/* Index into lptr array. */
X
X	if( !Fptr ) {
X		/* If incremental reading is ever implemented, the
X		 * argument to get_body_desc() will need to be computed
X		 * intelligently, probably based somehow on end. */
X		Fptr = get_body_desc( MAXINT );
X		}
X
X	while( end > LastLine ) {
X
X		if( !Blen ) {
X			/* The current block has been completely scanned;
X			 * if EoF hasn't been reached, attempt to read
X			 * another block.  If nothing further is read,
X			 * then EoF has been encountered, and the reading
X			 * process is complete. */
X
X			if( ArticleEoF || !(Blen = block_read( &Bptr )) ) {
X				LastLine = MAXINT;
X				break;
X				}
X			}
X
X		lps = nbptr = Bptr;
X		if( preserve_case ) {
X			for( nblen = Blen; nblen-- && *nbptr++ != '\n'; )
X				;
X			}
X		else {
X			for( nblen = Blen; nblen-- && *nbptr != '\n'; nbptr++ )
X				*nbptr = lowcase( *nbptr );
X			nbptr++;
X			}
X
X		if( nblen < 0 ) {
X			/* End of buffer reached; it is necessary to read
X			 * another block from the article and continue
X			 * the search for the end of this line. */
X			register char *tptr;		
X			int tlen, spl;
X			char *addressptr;
X
X			if( !ArticleEoF ) {
X				nblen = block_read( &addressptr );
X				nbptr = addressptr;
X				}
X			else {
X				/* Hmm.  End of file has been encountered, but
X				 * the end of the current line has not.  Handle
X				 * this by just faking in the final newline. */
X				warning(2, "premature end-of-file encountered");
X				nblen = 1;
X				nbptr = "\n";
X				}
X
X			tlen = 0;
X			if( preserve_case ) {
X				for( tptr = nbptr; tlen++ <= nblen &&
X							 *tptr++ != '\n'; )
X					;
X				}
X			else {
X				for( tptr = nbptr; tlen++ <= nblen &&
X						  *tptr != '\n'; tptr++ )
X					*tptr = lowcase( *tptr );
X				tptr++;
X				}
X
X			if( tlen > nblen ) {
X				/* Hmm.  It appears that the current line
X				 * spans more than one full buffer.  Sorry,
X				 * but the line will have to be split. */
X				if( !LineSplit ) {
X					warning( 2,
X					 "line(s) too long -- line(s) split" );
X					LineSplit = 1;
X					}
X				
X				tlen = nblen;	/* Truncate line length. */
X				spl = 1;	/* Byte for added "newline". */
X				}
X			else {
X				spl = 0;	/* Extra byte not required. */
X				}
X
X			lps = talloc( (Blen + tlen + spl)*sizeof(char) );
X			memcpy( lps, Bptr, Blen );
X			memcpy( lps + Blen, nbptr, tlen );
X			len = Blen + tlen + spl - 1;
X			lpe = lps + len;
X			Bptr = tptr;
X			Blen = nblen - tlen;
X			}
X		else {
X			Bptr = nbptr;
X			Blen = nblen;
X			lpe = Bptr - 1;
X			len = 1 + (lpe - lps);
X			}
X
X		LastLine++;		/* We've found one more line... */
X		ArticleStats[ID_LINES][LT_BODY]++;
X		ArticleStats[ID_BYTES][LT_BODY] += len;
X
X		/* Insert the line into the current area. */
X		
X		lptr[lidx++] = lps;  /* Stash ptr to start of line. */
X		*lpe = '\0';	     /* Stomp null over end newline. */
X		RawText->size += len; /* Add line length to area size. */
X
X		if( lidx >= ULB_SIZE ) {
X			/* Line buffer is full; allocate "permanent"
X			 * storage and copy ptrs to the new buffer. */
X
X			Lul = flush_lines( RawText, Lul, &lptr[0], lidx );
X			lidx = 0;
X			}
X		}
X
X	if( lidx ) 
X		Lul = flush_lines( RawText, Lul, &lptr[0], lidx );
X}
X
X/* parse_body() divides the [already-read] article body into its component
X * subsections of included text, new text, and signature text, and updates
X * statistics on the various areas.  The first level of paragraphing is also
X * performed here -- lines are associated into paragraphs. */
X
Xvoid
Xparse_body( start, end )
Xunsigned int start, end;
X{
X	char *lps;
X	int ind, len, j;
X	char *lptr[ULB_SIZE];	/* Temporary storage for located line ptrs. */
X	short lidx = 0;		/* Index into lptr array. */
X	u_list *ul;
X	int typ;
X	area_type *nap;
X
X	if( ArticleParsed )
X		return;
X
X	if( !ArticleEoF )
X		/* Ensure that the entire article body has been read. */
X		read_body( 1, MAXINT );
X
X	for( ul = RawText->list; ul; ul = ul->next ) 
X	    for( j = 0; j < ul->size; j++ ) {
X
X		typ = IDLine( ul->u_txt[j], &lps, &ind, &len );
X
X		if( InSignature ) {
X			/* If "signature_start" has already been seen, then
X			 * all subsequent lines are really signature lines. */
X			typ = LT_SIGNATURE;
X			}
X		else if( typ == LT_SIGNATURE ) {
X			/* The signature line has just been seen.  Set the
X			 * flag to indicate that all subsequent lines are
X			 * part of the article signature, and set PfxScan
X			 * to zero so that no further line identification
X			 * will be attempted. */
X			PfxScan = 0;
X			InSignature++;
X			}
X
X		ArticleStats[ID_LINES][typ]++;
X		ArticleStats[ID_BYTES][typ] += 1 + len;
X
X		if( (typ != Ltyp) || (paragraph_scan && (ind > Lind)) ) {
X			/* A new paragraph or text area has been encountered. */
X
X			if( Ltyp != LT_NONE && lidx ) {
X				/* Flush accumulated lines into old area. */
X				(void) flush_lines( Ap, Lul, &lptr[0], lidx );
X				lidx = 0;
X				}
X
X			Lul = (u_list *) NULL;
X
X			if( !Ap || Ap->list ) {
X				/* Allocate new area/paragraph encountered, and
X				 * link it onto the list of article areas. */
X				nap = (area_type *) talloc( sizeof(area_type) );
X				if( Article )
X					Ap = Ap->next = nap;
X				else
X					Ap = Article = nap;
X				}
X
X			zero( Ap, sizeof(area_type) );
X			Ltyp = Ap->txt_typ = typ;
X			}
X
X		if( len ) {
X			/* Insert the line into the current area, if it
X			 * isn't an empty line. */
X		
X			lptr[lidx++] = lps;  /* Stash ptr to start of line. */
X			Ap->size += len + 1; /* Add line length to area size. */
X			}
X		else {
X			/* Set indentation to ensure next
X			 * line will start a new paragraph. */
X			ind = -1;
X			}
X
X		Lind = ind;		/* Save the level of line indent. */
X
X		if( lidx >= ULB_SIZE ) {
X			/* Line buffer is full; allocate "permanent"
X			 * storage and copy ptrs to the new buffer. */
X
X			Lul = flush_lines( Ap, Lul, &lptr[0], lidx );
X			lidx = 0;
X			}
X	    }
X
X	if( lidx ) 
X		Lul = flush_lines( Ap, Lul, &lptr[0], lidx );
X
X	ArticleParsed = TRUE;
X}
X
X/* init_stats() ensures that the article statistics table is up-to-date,
X * by performing whatever level of analysis is required to obtain the
X * statistics for the specified article region. */
X
Xvoid
Xinit_stats( statid )
Xint statid;
X{
X	if( statid == LT_BODY )
X		read_body( 1, MAXINT );
X	else if( !StatsDone ) {
X		if( !ArticleParsed )
X			parse_body( 1, MAXINT );
X		ArticleStats[ID_LINES][LT_TEXT] = 
X			ArticleStats[ID_LINES][LT_NEWTEXT] +
X			ArticleStats[ID_LINES][LT_INCLUDED];
X		ArticleStats[ID_BYTES][LT_TEXT] = 
X			ArticleStats[ID_BYTES][LT_NEWTEXT] +
X			ArticleStats[ID_BYTES][LT_INCLUDED];
X		StatsDone++;
X		}
X}
X
X/* IDLine() "parses" the next line starting from rptr, which points to
X * the line to be parsed, as an ASCIIZ string.  Pointers to the actual
X * start of the string (white-space and/or include_prefix trimmed), the
X * level of indentation and the line length are returned; the "type" of
X * the line -- included, signature or newtext -- is returned explictly. */
X
Xstatic int
XIDLine( rptr, start, indent, length )
Xchar *rptr;				/* Start of unidentified text line */
Xchar **start;	 			/* Returned ptr to line start. */
Xint *indent, *length;			/* Returned line indent, length */
X{
X	register char *ptr = rptr;	/* Line scanning pointer */
X	int id = 0;			/* Level of line indentation */
X	int ltype;			/* Area to which the line belongs */
X	int llen = 0;			/* Line length computed. */
X	char *eptr;			/* Ptr to last character in line. */
X
X	/* The first task is to determine the level
X	 * of indentation of this particular line. */
X
X	while( *ptr ) {
X		if( *ptr == ' ' )
X			id++;
X		else if( *ptr == '\t' )
X			id = (id + 8) % 8;
X		else if( *ptr == '\f' ) {
X			id = MAXINT;
X			break;
X			}
X		else
X			break;
X		ptr++;
X		}
X
X	/* Indicate where the line starts (either stripped or actual). */
X	*start = IWhiteCompress ? ptr : rptr;
X
X	/* The next task is to identify the area with which the line is
X	 * associated.  If it is known that both of the include_prefix and
X	 * signature_start strings are empty, then the line type is
X	 * automatically assigned as NEWTEXT. */
X
X	if( PfxScan ) {
X		/* Look for a match on "include_prefix" */
X	
X		/* Search starts after whitespace is skipped, always. */
X		if( InclRxp && REG_EXEC( InclRxp, ptr ) ) {
X			ltype = LT_INCLUDED;
X			if( paragraph_scan ) {
X				*start = InclRxp->endp[0];
X				}
X			}
X		/* Search starts at true beginning of line, always. */
X		else if( SignRxp && REG_EXEC( SignRxp, rptr ) )
X			ltype = LT_SIGNATURE;
X		else
X			ltype = LT_NEWTEXT;
X		}
X	else {
X		ltype = LT_NEWTEXT;
X		}
X
X
X	llen = strlen( *start );
X	if( IWhiteCompress && llen ) {
X		/* If IWhiteCompress is set, then ensure that ptr is updated
X		 * to point at the end of the line in preparation for the
X		 * trimming of any trailing whitespace from the line. */
X		while( *ptr )
X			*ptr++;
X		eptr = ptr;
X		/* Scan back over any end-of-line whitespace */
X		for( ; llen && isspace(*(eptr-1)); llen--, eptr-- )
X			;
X		}
X
X	/* Line successfully scanned... update the return
X	 * values and explicitly return the line type. */
X
X	*length = llen;
X	*indent = id;
X
X	return( ltype );
X}
X
X/* flush_lines() just copies accumulated pointers into "permanent" storage. */
X
Xstatic
Xu_list *
Xflush_lines( ap, lul, lptr, size )
Xarea_type *ap;				/* Area to which the lines belong. */
Xu_list *lul;				/* Last unit belonging to area. */
Xchar **lptr;				/* Pointer to the buffered lines. */
Xint size;				/* Number of lines in buffer */
X{
X	u_list *ul;
X
X	if( !size ) {
X		/* Return immediately if there are no buffered lines. */
X		return( lul );
X		}
X
X	/* Fetch memory to hold the line list, and fill in the fields. */
X	ul = (u_list *) talloc( sizeof(u_list) + (size-1)*sizeof(char *) );
X	ul->size = size;
X	ul->next = (u_list *) NULL;
X
X	/* Copy the buffered pointers to the allocated memory. */
X	memcpy( &ul->u_txt[0], lptr, size*sizeof(char *) );
X	if( lul )
X		lul->next = ul;	/* Extend list of lines */
X	else
X		ap->list = ul;	/* First list of lines */
X
X	return( ul );
X}
X
X/* block_read() obtains a block of article text at a time, for analysis. */
X
Xstatic
Xunsigned int
Xblock_read( ptr )
Xchar **ptr;
X{
X	unsigned int len;
X
X	*ptr = talloc( BLOCK_SIZE*sizeof(char) );
X
X	if( !(len = fread( *ptr, sizeof(char), BLOCK_SIZE, Fptr ))
X	    && ferror( Fptr ) )
X		error( "file error during read" );
X
X	if( len < BLOCK_SIZE )
X		ArticleEoF++;
X
X	return( len );
X}
X
X/* paragraphize() copies the lines of the given area into a consecutive
X * ASCIIZ string, suitable for regular expression searches in paragraph
X * scanning mode. */
X
Xvoid
Xparagraphize( ap )
Xarea_type *ap;
X{
X	register char *wptr;
X	register u_list *ul;
X	register int l;
X
X	wptr = ap->para = talloc( (ap->size + 1)*sizeof(char) );
X	for( ul = ap->list; ul; ul = ul->next ) 
X		for( l = 0; l < ul->size; l++ )
X			wptr = copy_line( wptr, ul->u_txt[l] );
X
X	if( ap->para != wptr )
X		/* Remove the unnecessary trailing blank. */
X		wptr--;
X
X	*wptr = '\0';
X}
X
X/* copy_line() copies characters starting from rptr to wptr, compressing
X * internal whitespace regardless of the setting of the IWhiteCompress
X * flag.  It is known that having paragraph_scan set implies that
X * IWhiteCompress is als set. */
X
Xstatic char *
Xcopy_line( wptr, rptr )
Xregister char *wptr;		/* Ptr to target memory for the line copy. */
Xregister char *rptr;		/* Ptr to the line to copy. */
X{
X	char *origptr = wptr;
X
X	while( *rptr && isspace(*rptr) )
X		rptr++;
X
X	while( *rptr ) {
X		while( *rptr && !isspace(*rptr) )
X			*wptr++ = *rptr++;
X		while( *rptr && isspace( *rptr ) )
X			rptr++;
X		if( *rptr ) {
X			*wptr++ = ' ';
X			}
X		}
X
X	if( origptr != wptr ) {
X		/* Add a trailing space only if we moved the write pointer. */
X		*wptr++ = ' ';
X		}
X
X	return( wptr );
X}
X
X#ifdef DEBUG
X
Xchar *AreaNames[] = {
X	"LT_NONE",
X	"LT_SIGNATURE",
X	"LT_INCLUDED",
X	"** illegal **",
X	"LT_NEWTEXT",
X	"** illegal **",
X	"** illegal **",
X	"LT_BODY"
X	};
X
Xvoid
Xdump_body()
X{
X	area_type *ap;
X	u_list *ul;
X	int i = 1, j;
X
X	for( ap = Article ? Article : RawText; ap; ap = ap->next, i++ ) {
X		printf("{Paragraph %d (type %s)}\n",i,AreaNames[ap->txt_typ]);
X		if( !paragraph_scan ) {
X			for( ul = ap->list; ul; ul = ul->next ) 
X				for( j = 0; j < ul->size; j++ )
X					printf( ">%s<\n", ul->u_txt[j] );
X					/* puts( ul->u_txt[j] ); */
X			}
X		else {
X			if( !ap->para )
X				paragraphize( ap );
X			printf( ">%s<\n", ap->para );
X			/* puts( ap->para ); */
X			}
X		}
X}
X
X#endif /*DEBUG*/
END_OF_FILE
if test 19691 -ne `wc -c <'scanbody.c'`; then
    echo shar: \"'scanbody.c'\" unpacked with wrong size!
fi
# end of 'scanbody.c'
fi
echo shar: End of archive 9 \(of 15\).
cp /dev/null ark9isdone
MISSING=""
for I in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ; do
    if test ! -f ark${I}isdone ; then
	MISSING="${MISSING} ${I}"
    fi
done
if test "${MISSING}" = "" ; then
    echo You have unpacked all 15 archives.
    rm -f ark[1-9]isdone ark[1-9][0-9]isdone
else
    echo You still need to unpack the following archives:
    echo "        " ${MISSING}
fi
##  End of shell archive.
exit 0