brad@looking.ON.CA (Brad Templeton) (12/20/89)
Posting-number: Volume 9, Issue 78 Submitted-by: brad@looking.ON.CA (Brad Templeton) Archive-name: newsclip/part09 #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of archive 9 (of 15)." # Contents: patch/filter.man scanbody.c # Wrapped by allbery@uunet on Tue Dec 19 20:10:02 1989 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'patch/filter.man' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'patch/filter.man'\" else echo shar: Extracting \"'patch/filter.man'\" \(18501 characters\) sed "s/^X//" >'patch/filter.man' <<'END_OF_FILE' X.if n .ds La ' X.if n .ds Ra ' X.if t .ds La ` X.if t .ds Ra ' X.if n .ds Lq " X.if n .ds Rq " X.if t .ds Lq `` X.if t .ds Rq '' X.de Ch X\\$3\\*(Lq\\$1\\*(Rq\\$2 X.. X.TH FILTER 5 "May 10, 1989" X.ds ]W News Filter X.SH NAME Xfilter \- reader/newsfilter communications protocol. X.SH SYNOPSIS XThis document describes version 1.00 of the protocol used to implement Xnewsreader communication with newsfilter processes. The intent is to support Xconstruction of newsfilters that can communicate with any standard reader. X.SH DESCRIPTION XDuring initialization, newsreaders may attempt to establish communication Xwith a newsfilter process. If it succeeds, information on each article will Xbe passed to the newsfilter using a protocol described in this document; Xthe newsfilter will pass back an `interest score' which the reader may use Xto determine whether and how the article should be presented. X.PP XThe news sources distribution provides C libraries for both the `top' (reader) Xend, and the `bottom' (filter) end. X.SH THEORY OF OPERATION XProtocol execution may be thought of a series of \fRtransactions\fR; in each Xtransaction, the newsreader sends command down to the newsfilter, and Xblocks waiting a response. Optionally, the reader may time out if a response Xis not received within some maximum time. X.PP XSome transactions group into \fIdialogues\fR; these are logical sequences Xof transactions which share state (i.e. affect shared data in the protocol Xservice routines). X.PP XA protocol session consists of a \fIstart dialogue\fR, followed by any Xnumber of \fIcommand dialogues\fR, terminated by an \fIend dialogue\fR. XThere are presently two kinds of command dialogue; X\fInewsgroup dialogues\fR and \fIarticle dialogues\fR. Specifications for Xeach of these are given below. X.SH COMMAND AND RESPONSE FORMAT XAll messages start with a fixed-size \fIheader\fR. Some messages Xmay be followed by an \fIargument list\fR, the length of which is provided as Xpart of the header. In a few cases, the argument list may be followed by an Xadditional variable-length \fItext section\fR. X.PP XHere is the format of a header: X.nf X 0 Type: `C' = call, `R' = response, `Q' = query, `A' = answer X 1 A command code letter. X 2 A space ` ' X 3-8 A 6-digit decimal command sequence number. X 9 A space ` ' X 10-12 A 3-digit decimal argument list length. X 13 A terminating 0 (NUL) or newline byte. X.fi X.PP XIn a call, the sequence number is 1 for the first command issued and Xincreases by one in each following command. XIn a response, the sequence number field is the number of the command to which Xthe response pertains. The newsreader is not required to provide meaningful Xsequence numbers, but whatever number the newsreader sends will be used in Xall filter responses to that command. X.PP XIf a numeric field is smaller than the full field width, Xit should be either left justified and Xspace-filled to the right or 0 filled from the left. The latter form is Xrecommended and is the one shown in this document's examples. X.PP XThe \fIargument list\fR (if any) of a message is interpreted as a Xsequence of NUL-separated character strings. The Xlength field in the header must count the the terminating zero byte found at Xthe end of the last argument. The length of Xthe argument list is limited to 256 characters counting the NUL bytes. X.P XThe \fItext section\fR (in the non-implemented pipe mode) may consist of Xeither (a) an RFC-822 message Xheader followed and terminated by two carriage returns (a blank line), or b) XASCII data. Both are presented in a special packet format described below. X.SH STARTUP XWhile it is possible that this protocol may be implemented using other Xforms of inter process communication, this first version of the protocol Xis expected to be implemented using two nameless pipes. News filter Xprograms will take their commands from the master newsreading program Xby reading from the standard input. They will give their responses and Xqueries to the standard output. X.PP XA typical newsreader will fork a child news filter process, create pipes Xto talk to the standard input of the child and read from the standard Xoutput of the child, and then execute a news filter executable program. X.PP XWhile a newsreader may use any means to decide the location of the Xnews filter program it executes, the standard name is X.B nclip. XThe nclip program should be found in the same directory the newsreader Xuses to keep user files, such as the X.N .newsrc Xfile. The newsreader may also search the directories listed in the Xuser's PATH environment variable for this executable. X.PP XWhen the filter program is executed, it should be executed with the Xfollowing argument: X.TP 0 Xmode=pipe X.PP XOptionally, if the newsreader has a directory in which it places Xuser files, it should pass the name of that directory in a second argument Xof the form: X.TP 0 Xdot=<dirname> X.SS INITIAL SEQUENCE XWhen a news filter starts up operation, it should immediately send Xa ``response'' with the OK message (see below) to its standard output. This Xis not a response to any command, just a response to being executed. XThis will indicate that the news filter program has started correctly. X.PP XNewsreading programs which start up a news filter and do not get Xthis immediate initial response should assume the filter has failed Xto start, and act as though it is not present. X.PP XIf the OK message is detected, the newsreader should then send a XVersion command to the news filter and await a Version response. If Xall goes well, general operation may then continue. X.PP XNews filter programs should be sure to handle signals properly. For Xexample, a filter program should probably ignore INT (break) signals, as Xit is not talking to a terminal but will still be in the newsreader's Xprocess group (on Unix). X.PP XAt the end of a session, the newsreader should send the Quit command Xto the filter, await an OK, and then terminate or go on with the knowledge Xthe news filter program is not in operation. X X.SH DIALOGUE SPECIFICATIONS XIn the following specifications, the form of a message is given Xas a pseudo-BNF listing the code byte of the header and the meaning of the Xarguments following. Required and computed format elements such as the leading Xtype byte, the embedded spaces, the sequence number, the argument list length Xfield, and various NUL separators should be understood from the request format Xdescription above. X.PP XEach command or response specification is followed by an example. The first Xline of each example shows a sample header of the given type, and the second Xline an argument list. X.SS The `Start' Dialogue XThe `start dialogue' consists of a single transaction. The reader sends a X`V' (Version) command and expects a `V' (Version) response. These are defined Xas follows: X.TP 0 XCOMMAND: V <version-string> X.TP 0 X CV 000001 005\\0 X V100\\0 X.PP XThis command can be sent to a newsfilter to establish the newsfilter Xlanguage protocol understood by both programs. The newsfilter Xprogram will respond with a version line of its own, including Xthe list of valid commands it understands and the list of responses it Xcan give back. The newsreader should only send those commands in Xthe list given; others will produce an `error' response. All newsfilters must Xaccept the set of commands listed in this document -- the specification for XV100 of the newsfilter interface language. X.TP 0 XRESPONSE: V <version> <commands> <responses> <plang> <pversion> X.TP 0 X RV 000001 029\\0 X V100\\0ABHNPQV\\0ABEHORV\\0newsclip\\0100\\0 X.PP XThe <vnum> argument is a Version number for the command Xlanguage understood by the newsfiltering program. This language is Xversion V100. Later releases will have a higher number. X.PP XThe <commands> arg is the set of command codes the newsfilter understands. XThe <responses> arg is the set of responses that it knows to send back. X.PP XThe <plang> argument is the name of the filter language ('P' commands) that Xthe newsfilter understands, and the <pversion> number is a version number. XIf the newsfilter does not understand any language for P commands, it should Xuse the name `NULL' and a version number of 0. X.PP XIf the newsreader gets an `error' response to this message, it should assume Xthat the filter is present but cannot handle the language specified, or has Xfailed to initialize properly. The reader may attempt different Version Xcommands, or may decide to send a Quit command. X.PP XDefined names currently are: X NULL X newsclip X rnkill X.PP XNames can be registered via email to newsfilters@looking.on.ca X.SS The `End' Dialogue XThe \fIend dialogue\fR consists of a single transaction; the reader sends Xa `Q' (Quit) command and expects an `O' (Ok) response. X.TP XCOMMAND: Q X.TP X CQ 000236 000\\0 X.PP XThe newsfilter program should terminate. An response of `Ok' is Xexpected, after which the pipes will close. X.PP XClosing the command pipe, causing EOF for the newsfilter, should Xalso cause the newsfilter to terminate. If a newsreader detects EOF Xon the answer pipe, it should assume the newsfilter has terminated and Xact accordingly, possibly giving an error message to the user. X.TP 0 XRESPONSE: O X.TP 0 X RO 052317 000\\0 X.PP XThis response confirms that the newsfilter is exiting gracefully. X.SS The `Program' Dialogue X.PP XThe \fIprogram dialogue\fR begins with an 'P' (Program) command and ends with Xone of the responses 'O' (Ok) or `E' (Error). X.TP 0 XCOMMAND: P <command-string> X.TP 0 X CK 012321 016\\0 X kill From: Eric\\0 X.PP XThe first arg passes free format commands to the newsfilter. Normally these Xwill be things to add to the newsfilter's "kill files" or other such Xcommands. The format of the commands is entirely up to the newsfilter. X.P XThe above command might be request to kill all articles that include the string X"Eric" in their From line. It would be generated by a newsreader that Xknew how to translate user requests into commands to this particular Xnewsfilter. X.PP XThe newsfilter should interpret the command. If it is a valid command, it Xshould execute it and issue an OK response. If it is not a valid command, Xit should issue an Error response. The newsreader may decide to issue Xan error message because of the error response, or do further Xanalysis of the command. X.TP 0 XRESPONSE: O X.TP 0 X RO 12321 000\\0 X.PP XThis response confirms that the argument was accepted as a valid command Xby the newsfilter program. X.TP 0 XRESPONSE: E <error-message> X.TP 0 X RE 12321 017\\0 X No such command!\\0 X.PP XThis response tells the reader the argument was rejected as an invalid command Xby the newsfilter program. X.SS The `Newsgroup' Dialogue XThe \fINewsgroup dialogue\fR consists of a single transaction; the reader sends Xan `N' (Newsgroup) command, and expects an Accept, Reject or Ok response. X.TP 0 XCOMMAND: N <newsgroup> X.TP 0 X CN 00005 015\\0 X news.groups\\0 X.PP XThis command asks the newsfilter for general information on Xa newsgroup. Responses can be `A' (Accept), which indicates that Xall articles in this group should be accepted without consulting the Xnewsfilter, `R' (Reject) which means that all articles should be rejected Xwithout consulting the newsfilter or `O' (Ok - Consult), which means that Xarticles should be fed to the newsfilter for examination. X.PP XNote that even in the case of an `A' or `R', articles in that group Xmay still be sent for examination. It is just less efficient to do so. X.TP 0 XRESPONSE: A <score> X.TP 0 X RA 000005 003\\0 X 22\\0 X.PP XThis response indicates that the article should be accepted. The single Xargument is an `interest score' computed by the newskiller; it may be omitted Xto indicate a value of 1. X.TP 0 XRESPONSE: R <score> X.TP 0 X RR 000005 003\\0 X -2\\0 X.PP XThis response indicates that the article should be rejected. The single Xargument is an `interest score' computed by the newskiller; it may be omitted Xto indicate a value of -1. X.PP XBy convention, the `R' response carries a zero or negative score and the `A' Xresponse a positive one. X.PP XThe `O' response is as documented for the `Q' (Quit) command above. X.SS The `Article' Dialogue XThe \fIarticle dialogue\fR begins with an 'A' (Article) command and ends with Xone of the responses 'A' (Accept) or `R' (Reject). There may be one or more Xtransactions in this dialogue. X.TP 0 XCOMMAND: A <newsgroup> <number> <mode> [<filename>] X.TP 0 X CA 000006 048\\0 X news.groups\\034\\0R\\0/usr/spool/news/news/groups/43\\0 X.PP XThe first two arguments are a normal newsgroup and article-number pair; if Xthe newsfilter Xcan deduce a final interest score from these, it will do so and return accept Xor reject immediately. Otherwise, the newsfilter can return article information Xrequests to see portions of the article; see the following description of Xarticle information exchange. The <filename> argument, if present, is used Xin resolving the article information request. The <mode> argument Xindicates whether the article is present in the file, or must be Xrequested. X.PP XText may be passed down to the newsfilter in one of two modes; \fIpipe mode\fR Xor \fIfile mode\fR. The mode is triggered by the signle character <mode> Xbyte argument. File modes require a file name Xname argument on the reader command that triggered the article information Xrequest. For pipe mode, no file name is given and a mode character of X'P' is provided. The two file modes use mode characters of X'F' (full) and 'R' (request). X.PP XPipe mode is currently not Ximplemented in any of the news filtering or newsreading programs using Xthis protocol. It is defined for future expansion. X.PP XIn \fIpipe mode\fR article portions are passed down in the text sections of XHeader and Body replies from the newsreader in accordance with text query Xsequences started by the newsfilter. Text query sequence protocol is specified Xbelow. X.PP XIn \fIfile mode\fR the article information is passed in the file named. XThis file may be either the permanent location of the article, or a tempfile. X.PP XIn the former case, it is likely (but not necessary) that the 'F' (full) Xfile mode will be used. In this case, the entire article is already present Xin the file, and no further queries should be issued by the newsfilter. XThe only acceptable responses to a full file mode Article command are XAccept, Reject and Error. X.PP XIn the latter case of request mode, the newsfilter should issue text queries Xbefore attempting Xto read first the header, and later the body of the article. After Xissuing such queries, the filter should wait for a response from the Xnewsreader before reading into the file. X.PP XA text query sequence consists of a series of 'H' (Header) and 'B' (Body) Xrequests sent by the newsfilter to the newsreader, and corresponding responses Xby the newsreader. X.TP 0 XQUERY: H X.TP 0 X QH 000340 000\\0 X.PP XThis query requests the RFC-822 header of the article selected by a previous X`A' command, or of the article selected by the newsreader at the time of Xissuance of a 'P' command. X.TP 0 XANSWER: H [<size> [<asize>]] X.TP 0 X AH 000340 004\\0 X 364\\0 X.PP XThis answer signifies that the newsreader has header data ready in response Xto a previous 'H' query. The optional <size> argument specifies the total Xlength of the header. The body of the article may exist beyond the header. XFilters should not assume they will read EOF at the end of a header. X.PP XIn pipe mode, this answer is followed immediately by a text section in RFC-822 Xformat (see above), using message packet format. XMessage packets consist of a length byte, followed Xby 0 to 255 text data bytes. A 0 length packet indicates the end of the Xheader, which must be preceded by a blank line. X.PP XThe optional <asize> is the size of the entire article, but only if the Xnewsreader has it handy. X.TP 0 XQUERY: B [<size>] X.TP 0 X QB 000341 004\\0 X 125\\0 X.PP XThis query asks the reader to send down all or a portion of the body of the Xcurrent article. The filter may optionally communicate the most it wants Xof the article with the <size> argument. If this is present, the newsreader Xneed not transmit or place more than <size> bytes of the body. The Xnewsreader is still always free to send the entire body -- this is merely Xan optimization. If the <size> argument is not present, the entire body Xmust be made available. X.TP 0 XANSWER: B [<size>] X.TP 0 X AB 000341 004\\0 X 125\\0 X.PP XThe 'B' response signifies that the newsreader has text data ready for the Xnewsfilter. The argument optionally gives the length of the data in bytes. This Xlength may be smaller or larger than the requested length. It will only Xbe smaller if the article body itself is smaller than the requested length. X.PP XNote that a newsreader can use request mode even if it always has Xcomplete article files ready for the filter. It should merely respond Xto queries immediately, doing nothing. The 'F' (full) mode simply allows Xa reader to be sure it will never receive queries. This allows very Xsimple reader implementations of this protocol. X.PP XNNTP readers and other readers that do not have access to single Xarticle files should use the request mode, building a temporary file Xfor the filter as requested. X.PP XIn pipe mode, this answer is followed immediately by a text section in Xmessage packet format. Message packets consist of a length byte, followed Xby 0 to 255 text data bytes. A 0 length packet indicates EOF. X.SH NOTES XThe protocol is designed for use over the most primitive IPC common on UNIX, Xa pair of nameless pipes. Some older UNIXes (V7 in particular) feature pipe Ximplementations that behave rather badly (as in, cause a lockup or sudden Xprocess death) if reads and writes are not carefully synchronized. Thus the Xrigid alternation of fixed-size with variable-size transmissions, and the Xcare in specifying lengths of variable parts in fixed-part blocks. X.PP XUnder a more forgiving IPC implementation (such as System V message queues), Xthe fixed and variable-length parts might be sent in one transmission; this Xis an implementation detail left up to service libraries. X.PP XThe all-ASCII format avoids potential alignment problems. X.PP XIs is expected that the protocol service libraries will automatically choose Xpipe or file mode for text query sequence, depending on whether the calling Xnewsreader browses a file hierarchy or talks to some sort of network daemon. X.PP XThe 'F' (full) file mode is the simplest mode, intended for use on Xsystems where the article files reside in normal format on the user's Xmachine. A newsreader can be adapted to this mode of operation with Xminimal changes. X.SH AUTHORS XThis protocol was developed by Brad Templeton and Eric Raymond. END_OF_FILE if test 18501 -ne `wc -c <'patch/filter.man'`; then echo shar: \"'patch/filter.man'\" unpacked with wrong size! fi # end of 'patch/filter.man' fi if test -f 'scanbody.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'scanbody.c'\" else echo shar: Extracting \"'scanbody.c'\" \(19691 characters\) sed "s/^X//" >'scanbody.c' <<'END_OF_FILE' X/* X * scanbody.c X * X * Code to perform the scanning of the body of news articles, searching for X * precompiled regular expressions, and collecting facts about the article, X * such as number of lines in the body, number of quoted lines, size of X * the signature, and so on. X * X * The main body of the article is read into a chunk of temporary memory, X * in a completely raw, unformatted manner. This chunk of text is then X * massaged according to the settings of some global flags, and data X * structures indicating lines, paragraphs, and specific areas of the X * article are constructed. X */ X X /* X * Newsclip(TM) Library Source Code. X * Copyright 1989 Looking Glass Software Limited. All Rights Reserved. X * Unless otherwise licenced, the only authorized use of this source X * code is compilation into a binary of the newsclip library for the X * use of licenced Newsclip customers. Minor source code modifications X * are allowed. X * Use of this code for a short term evaluation of the product, as defined X * in the associated file, 'Licence', is permitted. X */ X X#include "common.h" X#include "body.h" X#include "rei.h" X X#define talloc temp_alloc /* Some convenient abbreviations. */ X#define palloc perm_alloc X Xextern char *temp_alloc AC(( int )); Xextern char *perm_alloc AC(( int )); Xextern FILE *get_body_desc AC(( int )); X Xextern int preserve_case; /* If set: treat paragraphs as the basic unit */ Xextern int white_compress; /* If set: strip leading/trailing whitespace */ Xextern int paragraph_scan; /* If set: map all text to lower case */ Xextern char *TREBuff; /* Temporary buffer for compiled temp REs */ X X/* X * We define the following terms: X * X * line -- a line as it was originally typed in the article, and as such, X * terminated with a newline. X * X * area -- a particular subset of the article lines. The recognized areas X * currently are: X * "signature" X * "text" X * "included" (text lines starting with "include_prefix") X * "newtext" (original text only -- not include-prefixed) X * "body" (text and signature) X * X * unit -- an ASCIIZ string of text, with newlines removed. A unit is the X * basic search unit for R.E.s, and currently can be either X * "lines" -- corresponding to the true lines of the article area, X * or X * "paragraphs" -- multiple lines joined without newlines X * according to the rules X * i) New paragraphs commence at the beginning of a new area. X * ii) New paragraphs commence after an empty line. X * iii) New paragraphs commence when the indentation of the X * current line exceeds the indentation level of the X * previous line. X * X * Basically, we look through the article a byte at a time, looking for X * areas. Once an area is identified, then the area is divided up into X * units. Hopefully, all this can be done with a minimum of movement of X * bytes and characters. X * X */ X Xarea_type *Article; /* Ptr to area-formatted version of article */ Xarea_type *RawText; /* Ptr to raw text version of article */ X Xlong ArticleStats[2][AS_ARR_SIZE]; /* Storage for various article statistics. */ X X/* A lot of static variables are used to maintain information about X * the current state of reading of the article body. These permit the X * article body to be read in an incremental manner. */ X Xstatic int FirstLine; /* First complete line of article parsed */ Xstatic int LastLine; /* Last complete line of article parsed */ Xstatic int PfxScan; /* Is line identification to be done? */ Xstatic int IWhiteCompress; /* Internal version of white_compress flag */ Xstatic rxp_type InclRxp = (rxp_type) NULL; /* Compiled RE to id incl. lines */ Xstatic rxp_type SignRxp = (rxp_type) NULL; /* Compiled RE to id sig. lines */ X X/* Variables used to store parsing status information over buffer breaks. */ X Xstatic area_type *Ap; /* Save: current area. */ Xstatic int Lind; /* Save: indent level of last complete line. */ Xstatic int Ltyp; /* Save: type of the last complete line. */ Xstatic u_list *Lul; /* Save: last u_list of current area. */ Xstatic char *Bptr; /* Save: pointer into current buffer. */ Xstatic unsigned int Blen; /* Save: remaining bytes in current buffer. */ Xstatic int InSignature; /* Are we in the signature of the article? */ Xstatic int ArticleParsed; /* Indicates if article has been parsed. */ Xstatic int StatsDone; /* Indicates if all article stats are done. */ Xstatic char LineSplit; /* Was long line split? (warning ctl only) */ X Xstatic FILE *Fptr; /* File pointer for current article. */ Xint ArticleEoF; /* Article end-of-file indicator. */ X Xstatic u_list *flush_lines AC(( area_type *, u_list *, char **, int )); Xstatic unsigned int block_read AC(( char ** )); Xstatic char *copy_line AC(( char *, char * )); X X/* prepare_body() is called once per article, and resets the X * variables of interest to the article body parsing routines. */ X Xvoid Xprepare_body( io_mode ) Xint io_mode; X{ X FirstLine = LastLine = Blen = Lind = 0; X Ltyp = LT_NONE; X LineSplit = StatsDone = ArticleParsed = InSignature = ArticleEoF = 0; X Fptr = (FILE *) NULL; X X Ap = Article = (area_type *) NULL; X RawText = (area_type *) talloc( sizeof(area_type) ); X zero( RawText, sizeof(area_type) ); X RawText->txt_typ = LT_BODY; X Lul = (u_list *) NULL; X X TREBuff = (char *) NULL; X X zero( ArticleStats, sizeof(ArticleStats) ); X X /* Set the internal version of the whitespace compression flag X * on the basis of whether or not paragraph scanning mode is on. */ X X IWhiteCompress = paragraph_scan ? 1 : white_compress; X X PfxScan = InclRxp || SignRxp; X} X X/* set_include_prefix() sets the include prefix, used for determining which X * lines of the article are included, to the given string argument. */ X Xvoid Xset_include_prefix( user_pfx ) Xchar *user_pfx; X{ X char *inc_pfx; X X if( InclRxp ) X /* Free old compiled version of the include prefix string */ X perm_free( InclRxp ); X X if( '^' == *user_pfx ) { X inc_pfx = user_pfx; X } X else { X /* Need to make a copy of the user's string in order X * to insert the left-end anchor character. */ X inc_pfx = talloc( strlen( inc_pfx ) + 2 ); X inc_pfx[0] = '^'; X strcpy( inc_pfx + 1, user_pfx ); X } X X InclRxp = *inc_pfx ? REG_COMP_P( inc_pfx ) : (rxp_type) NULL; X X /* Update the prefix scanning flag to indicate whether or not X * scanning is to be done for included or signature lines. */ X X PfxScan = InclRxp || SignRxp; X} X X/* set_signature_start() sets the signature pattern, used for determining where X * the signature of the article starts, to the given string argument. */ X Xvoid Xset_signature_start( sig_strt ) Xchar *sig_strt; X{ X if( SignRxp ) X /* Free old compiled version of the include prefix string */ X perm_free( SignRxp ); X X SignRxp = *sig_strt ? REG_COMP_P( sig_strt ) : (rxp_type) NULL; X X /* Update the prefix scanning flag to indicate whether or not X * scanning is to be done for included or signature lines. */ X X PfxScan = InclRxp || SignRxp; X} X X/* read_body() performs the first-level analysis of the article body, by X * reading the article and counting bytes and lines. Lines are stored X * individually as ASCIIZ strings in preparation for any necessary X * higher-level processing, such as article section parsing (handled by X * parse_body() below). */ X Xvoid Xread_body( start, end ) Xunsigned int start, end; X{ X register char *nbptr; /* Updated pointer into buffer */ X register int nblen; /* Updated size of buffer */ X char *lps; /* Ptr to start of the current line. */ X char *lpe; /* Ptr to the end of the current line. */ X int len; /* Length of the current line. */ X char *lptr[ULB_SIZE]; /* Temporary storage for located line ptrs. */ X short lidx = 0; /* Index into lptr array. */ X X if( !Fptr ) { X /* If incremental reading is ever implemented, the X * argument to get_body_desc() will need to be computed X * intelligently, probably based somehow on end. */ X Fptr = get_body_desc( MAXINT ); X } X X while( end > LastLine ) { X X if( !Blen ) { X /* The current block has been completely scanned; X * if EoF hasn't been reached, attempt to read X * another block. If nothing further is read, X * then EoF has been encountered, and the reading X * process is complete. */ X X if( ArticleEoF || !(Blen = block_read( &Bptr )) ) { X LastLine = MAXINT; X break; X } X } X X lps = nbptr = Bptr; X if( preserve_case ) { X for( nblen = Blen; nblen-- && *nbptr++ != '\n'; ) X ; X } X else { X for( nblen = Blen; nblen-- && *nbptr != '\n'; nbptr++ ) X *nbptr = lowcase( *nbptr ); X nbptr++; X } X X if( nblen < 0 ) { X /* End of buffer reached; it is necessary to read X * another block from the article and continue X * the search for the end of this line. */ X register char *tptr; X int tlen, spl; X char *addressptr; X X if( !ArticleEoF ) { X nblen = block_read( &addressptr ); X nbptr = addressptr; X } X else { X /* Hmm. End of file has been encountered, but X * the end of the current line has not. Handle X * this by just faking in the final newline. */ X warning(2, "premature end-of-file encountered"); X nblen = 1; X nbptr = "\n"; X } X X tlen = 0; X if( preserve_case ) { X for( tptr = nbptr; tlen++ <= nblen && X *tptr++ != '\n'; ) X ; X } X else { X for( tptr = nbptr; tlen++ <= nblen && X *tptr != '\n'; tptr++ ) X *tptr = lowcase( *tptr ); X tptr++; X } X X if( tlen > nblen ) { X /* Hmm. It appears that the current line X * spans more than one full buffer. Sorry, X * but the line will have to be split. */ X if( !LineSplit ) { X warning( 2, X "line(s) too long -- line(s) split" ); X LineSplit = 1; X } X X tlen = nblen; /* Truncate line length. */ X spl = 1; /* Byte for added "newline". */ X } X else { X spl = 0; /* Extra byte not required. */ X } X X lps = talloc( (Blen + tlen + spl)*sizeof(char) ); X memcpy( lps, Bptr, Blen ); X memcpy( lps + Blen, nbptr, tlen ); X len = Blen + tlen + spl - 1; X lpe = lps + len; X Bptr = tptr; X Blen = nblen - tlen; X } X else { X Bptr = nbptr; X Blen = nblen; X lpe = Bptr - 1; X len = 1 + (lpe - lps); X } X X LastLine++; /* We've found one more line... */ X ArticleStats[ID_LINES][LT_BODY]++; X ArticleStats[ID_BYTES][LT_BODY] += len; X X /* Insert the line into the current area. */ X X lptr[lidx++] = lps; /* Stash ptr to start of line. */ X *lpe = '\0'; /* Stomp null over end newline. */ X RawText->size += len; /* Add line length to area size. */ X X if( lidx >= ULB_SIZE ) { X /* Line buffer is full; allocate "permanent" X * storage and copy ptrs to the new buffer. */ X X Lul = flush_lines( RawText, Lul, &lptr[0], lidx ); X lidx = 0; X } X } X X if( lidx ) X Lul = flush_lines( RawText, Lul, &lptr[0], lidx ); X} X X/* parse_body() divides the [already-read] article body into its component X * subsections of included text, new text, and signature text, and updates X * statistics on the various areas. The first level of paragraphing is also X * performed here -- lines are associated into paragraphs. */ X Xvoid Xparse_body( start, end ) Xunsigned int start, end; X{ X char *lps; X int ind, len, j; X char *lptr[ULB_SIZE]; /* Temporary storage for located line ptrs. */ X short lidx = 0; /* Index into lptr array. */ X u_list *ul; X int typ; X area_type *nap; X X if( ArticleParsed ) X return; X X if( !ArticleEoF ) X /* Ensure that the entire article body has been read. */ X read_body( 1, MAXINT ); X X for( ul = RawText->list; ul; ul = ul->next ) X for( j = 0; j < ul->size; j++ ) { X X typ = IDLine( ul->u_txt[j], &lps, &ind, &len ); X X if( InSignature ) { X /* If "signature_start" has already been seen, then X * all subsequent lines are really signature lines. */ X typ = LT_SIGNATURE; X } X else if( typ == LT_SIGNATURE ) { X /* The signature line has just been seen. Set the X * flag to indicate that all subsequent lines are X * part of the article signature, and set PfxScan X * to zero so that no further line identification X * will be attempted. */ X PfxScan = 0; X InSignature++; X } X X ArticleStats[ID_LINES][typ]++; X ArticleStats[ID_BYTES][typ] += 1 + len; X X if( (typ != Ltyp) || (paragraph_scan && (ind > Lind)) ) { X /* A new paragraph or text area has been encountered. */ X X if( Ltyp != LT_NONE && lidx ) { X /* Flush accumulated lines into old area. */ X (void) flush_lines( Ap, Lul, &lptr[0], lidx ); X lidx = 0; X } X X Lul = (u_list *) NULL; X X if( !Ap || Ap->list ) { X /* Allocate new area/paragraph encountered, and X * link it onto the list of article areas. */ X nap = (area_type *) talloc( sizeof(area_type) ); X if( Article ) X Ap = Ap->next = nap; X else X Ap = Article = nap; X } X X zero( Ap, sizeof(area_type) ); X Ltyp = Ap->txt_typ = typ; X } X X if( len ) { X /* Insert the line into the current area, if it X * isn't an empty line. */ X X lptr[lidx++] = lps; /* Stash ptr to start of line. */ X Ap->size += len + 1; /* Add line length to area size. */ X } X else { X /* Set indentation to ensure next X * line will start a new paragraph. */ X ind = -1; X } X X Lind = ind; /* Save the level of line indent. */ X X if( lidx >= ULB_SIZE ) { X /* Line buffer is full; allocate "permanent" X * storage and copy ptrs to the new buffer. */ X X Lul = flush_lines( Ap, Lul, &lptr[0], lidx ); X lidx = 0; X } X } X X if( lidx ) X Lul = flush_lines( Ap, Lul, &lptr[0], lidx ); X X ArticleParsed = TRUE; X} X X/* init_stats() ensures that the article statistics table is up-to-date, X * by performing whatever level of analysis is required to obtain the X * statistics for the specified article region. */ X Xvoid Xinit_stats( statid ) Xint statid; X{ X if( statid == LT_BODY ) X read_body( 1, MAXINT ); X else if( !StatsDone ) { X if( !ArticleParsed ) X parse_body( 1, MAXINT ); X ArticleStats[ID_LINES][LT_TEXT] = X ArticleStats[ID_LINES][LT_NEWTEXT] + X ArticleStats[ID_LINES][LT_INCLUDED]; X ArticleStats[ID_BYTES][LT_TEXT] = X ArticleStats[ID_BYTES][LT_NEWTEXT] + X ArticleStats[ID_BYTES][LT_INCLUDED]; X StatsDone++; X } X} X X/* IDLine() "parses" the next line starting from rptr, which points to X * the line to be parsed, as an ASCIIZ string. Pointers to the actual X * start of the string (white-space and/or include_prefix trimmed), the X * level of indentation and the line length are returned; the "type" of X * the line -- included, signature or newtext -- is returned explictly. */ X Xstatic int XIDLine( rptr, start, indent, length ) Xchar *rptr; /* Start of unidentified text line */ Xchar **start; /* Returned ptr to line start. */ Xint *indent, *length; /* Returned line indent, length */ X{ X register char *ptr = rptr; /* Line scanning pointer */ X int id = 0; /* Level of line indentation */ X int ltype; /* Area to which the line belongs */ X int llen = 0; /* Line length computed. */ X char *eptr; /* Ptr to last character in line. */ X X /* The first task is to determine the level X * of indentation of this particular line. */ X X while( *ptr ) { X if( *ptr == ' ' ) X id++; X else if( *ptr == '\t' ) X id = (id + 8) % 8; X else if( *ptr == '\f' ) { X id = MAXINT; X break; X } X else X break; X ptr++; X } X X /* Indicate where the line starts (either stripped or actual). */ X *start = IWhiteCompress ? ptr : rptr; X X /* The next task is to identify the area with which the line is X * associated. If it is known that both of the include_prefix and X * signature_start strings are empty, then the line type is X * automatically assigned as NEWTEXT. */ X X if( PfxScan ) { X /* Look for a match on "include_prefix" */ X X /* Search starts after whitespace is skipped, always. */ X if( InclRxp && REG_EXEC( InclRxp, ptr ) ) { X ltype = LT_INCLUDED; X if( paragraph_scan ) { X *start = InclRxp->endp[0]; X } X } X /* Search starts at true beginning of line, always. */ X else if( SignRxp && REG_EXEC( SignRxp, rptr ) ) X ltype = LT_SIGNATURE; X else X ltype = LT_NEWTEXT; X } X else { X ltype = LT_NEWTEXT; X } X X X llen = strlen( *start ); X if( IWhiteCompress && llen ) { X /* If IWhiteCompress is set, then ensure that ptr is updated X * to point at the end of the line in preparation for the X * trimming of any trailing whitespace from the line. */ X while( *ptr ) X *ptr++; X eptr = ptr; X /* Scan back over any end-of-line whitespace */ X for( ; llen && isspace(*(eptr-1)); llen--, eptr-- ) X ; X } X X /* Line successfully scanned... update the return X * values and explicitly return the line type. */ X X *length = llen; X *indent = id; X X return( ltype ); X} X X/* flush_lines() just copies accumulated pointers into "permanent" storage. */ X Xstatic Xu_list * Xflush_lines( ap, lul, lptr, size ) Xarea_type *ap; /* Area to which the lines belong. */ Xu_list *lul; /* Last unit belonging to area. */ Xchar **lptr; /* Pointer to the buffered lines. */ Xint size; /* Number of lines in buffer */ X{ X u_list *ul; X X if( !size ) { X /* Return immediately if there are no buffered lines. */ X return( lul ); X } X X /* Fetch memory to hold the line list, and fill in the fields. */ X ul = (u_list *) talloc( sizeof(u_list) + (size-1)*sizeof(char *) ); X ul->size = size; X ul->next = (u_list *) NULL; X X /* Copy the buffered pointers to the allocated memory. */ X memcpy( &ul->u_txt[0], lptr, size*sizeof(char *) ); X if( lul ) X lul->next = ul; /* Extend list of lines */ X else X ap->list = ul; /* First list of lines */ X X return( ul ); X} X X/* block_read() obtains a block of article text at a time, for analysis. */ X Xstatic Xunsigned int Xblock_read( ptr ) Xchar **ptr; X{ X unsigned int len; X X *ptr = talloc( BLOCK_SIZE*sizeof(char) ); X X if( !(len = fread( *ptr, sizeof(char), BLOCK_SIZE, Fptr )) X && ferror( Fptr ) ) X error( "file error during read" ); X X if( len < BLOCK_SIZE ) X ArticleEoF++; X X return( len ); X} X X/* paragraphize() copies the lines of the given area into a consecutive X * ASCIIZ string, suitable for regular expression searches in paragraph X * scanning mode. */ X Xvoid Xparagraphize( ap ) Xarea_type *ap; X{ X register char *wptr; X register u_list *ul; X register int l; X X wptr = ap->para = talloc( (ap->size + 1)*sizeof(char) ); X for( ul = ap->list; ul; ul = ul->next ) X for( l = 0; l < ul->size; l++ ) X wptr = copy_line( wptr, ul->u_txt[l] ); X X if( ap->para != wptr ) X /* Remove the unnecessary trailing blank. */ X wptr--; X X *wptr = '\0'; X} X X/* copy_line() copies characters starting from rptr to wptr, compressing X * internal whitespace regardless of the setting of the IWhiteCompress X * flag. It is known that having paragraph_scan set implies that X * IWhiteCompress is als set. */ X Xstatic char * Xcopy_line( wptr, rptr ) Xregister char *wptr; /* Ptr to target memory for the line copy. */ Xregister char *rptr; /* Ptr to the line to copy. */ X{ X char *origptr = wptr; X X while( *rptr && isspace(*rptr) ) X rptr++; X X while( *rptr ) { X while( *rptr && !isspace(*rptr) ) X *wptr++ = *rptr++; X while( *rptr && isspace( *rptr ) ) X rptr++; X if( *rptr ) { X *wptr++ = ' '; X } X } X X if( origptr != wptr ) { X /* Add a trailing space only if we moved the write pointer. */ X *wptr++ = ' '; X } X X return( wptr ); X} X X#ifdef DEBUG X Xchar *AreaNames[] = { X "LT_NONE", X "LT_SIGNATURE", X "LT_INCLUDED", X "** illegal **", X "LT_NEWTEXT", X "** illegal **", X "** illegal **", X "LT_BODY" X }; X Xvoid Xdump_body() X{ X area_type *ap; X u_list *ul; X int i = 1, j; X X for( ap = Article ? Article : RawText; ap; ap = ap->next, i++ ) { X printf("{Paragraph %d (type %s)}\n",i,AreaNames[ap->txt_typ]); X if( !paragraph_scan ) { X for( ul = ap->list; ul; ul = ul->next ) X for( j = 0; j < ul->size; j++ ) X printf( ">%s<\n", ul->u_txt[j] ); X /* puts( ul->u_txt[j] ); */ X } X else { X if( !ap->para ) X paragraphize( ap ); X printf( ">%s<\n", ap->para ); X /* puts( ap->para ); */ X } X } X} X X#endif /*DEBUG*/ END_OF_FILE if test 19691 -ne `wc -c <'scanbody.c'`; then echo shar: \"'scanbody.c'\" unpacked with wrong size! fi # end of 'scanbody.c' fi echo shar: End of archive 9 \(of 15\). cp /dev/null ark9isdone MISSING="" for I in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ; do if test ! -f ark${I}isdone ; then MISSING="${MISSING} ${I}" fi done if test "${MISSING}" = "" ; then echo You have unpacked all 15 archives. rm -f ark[1-9]isdone ark[1-9][0-9]isdone else echo You still need to unpack the following archives: echo " " ${MISSING} fi ## End of shell archive. exit 0