[comp.text] concordance software in C

tut%cairo@Sun.COM (Bill "Bill" Tuthill) (04/22/88)

In the last week I've received a flood of requests (well, actually only
two) for the Hum concordance package.  I replied to both requests, but
my messages were bounced back.  So here are two programs (with man page)
to produce a keyword-in-context concordance with this command:

	% kwic filename | sort | format

---------------------------- cut here ----------------------------
echo Extracting kwic.1 1>&2
sed 's/^X//' > kwic.1 <<\!!!EOF!!!
X.TH KWIC HUM (rev3.7)
X.ds ]W UC Berkeley
X.SH NAME
Xkwic \- key word in context concordance
X.SH SYNOPSIS
X.nf
X\fBkwic\fP  [ \fB\-k\fIn\fP \-m \-w\fIS\fP \-f\fIn\fP \-r \-l\fIn\fP \-p\fIn\fP \-i\fIc\fP \-c\fIn\fP \-d\fIF\fR + \- ]  filename ...
X\-kn: keyword is n characters long (defaults to 15)
X\-m : keywords not mapped from upper to lower case
X\-wS: write string S onto id field (use quotes around blanks)
X\-fn: filename (up to n characters) written onto id field
X\-r : reset linenumber to 1 at beginning of every file
X\-ln: line numbering begins with line n (instead of 1)
X\-pn: page numbering begins with page n (instead of 1)
X\-ic: page incrementer is character c (defaults to =)
X\-cn: context is n characters long (defaults to 50)
X\-dF: define punctuation set according to file F
X\(pl : the + character indicates cedilla or umlaut
X\(mi : read text from standard input (terminal or pipe)
X.fi
X.SH DESCRIPTION
X\fIKwic\fP is a text concordance program,
Xgenerally for use with prose,
Xalthough it is often used for poetry.
XNormally, it prints a left-hand keyword,
Xa 6 digit linenumber or 6 place pagenumber
X(depending on how you want to label your text),
Xand a context of 50 characters, centered around the keyword.
XWords are separated at their natural boundaries,
Xand adjustment is made for backspaces.
XNewline characters are printed as "/",
Xand tabs are printed as a single blank.
XIf you want to have a space after the newline "/",
Xuse the pad option of \fItprep\fP to insert a space
Xat the beginning of each line in your text.
XThe following characters are considered to be
Xpunctuation marks:  ,.;:-"?!()[]{}  but all other
Xnon-alphabetic characters can be part of a word.
XThese punctuation characters can be changed.
X.PP
XBy default, only the first 15 characters
Xof the keyword are printed, followed by a vertical bar;
Xlonger keywords are truncated.
XIf you want more or less than 15 characters in the keyword,
Xuse the \-k option to lengthen or shorten it.
XTo find the longest word in your text,
Xuse the \fImaxwd\fP program, and set \-k accordingly.
XKeywords are mapped to lower case to ease the logistics of sorting,
Xunless the \-m option is specified.
X.PP
XThe \-w argument allows you to write an id field
X(such as the name of an author or work) after the keyword.
XIf you want to include any blanks,
Xenclose the entire string in quotes: \-w"Prose Edda".
XThe \-f argument allows you to write the current filename,
Xup to a number of characters you specify.
XIf the filename is shorter, it will be blank-padded,
Xand if it is longer, it will be truncated.
X.PP
XIf the program encounters the character "=",
Xwhich, by default, indicates pagination,
Xit will count pages as well as line numbers.
XLine numbers will print as: ``\ 12469'',
Xwhile page numbers will print as: ``178,12''.
XIf you are concording a series of short poems,
Xeach starting with line 1, type them into separate files,
Xand use the \-r option to reset the linenumber to 1
Xat the beginning of each new file.
XIf you resume concording in the middle of your text,
Xyou can set the line number with the \-l option,
Xor the page number with the \-p option.  
XIf you want to indicate pagination,
Xmake sure that you begin your text with ``=1'',
Xon a line of its own, to indicate the first page.
XWhen a new chapter starts at the top of the page,
Xbe sure to set \-p to the previous page.
XThe page indicator can be changed with the \-i option;
X\-i% will change it to a percent sign, for instance.
X.PP
XIf you are sending output to the lineprinter,
Xthe context width can be increased with the \-c argument;
X\-c110, for instance, will give you about 55 characters
Xon either side of the keyword in context.
XNote that the lineprinter can print only 132 characters per line,
Xso add up your field widths carefully.
X.PP
XIf you are working with a foreign language,
Xand need to use normal punctuation marks as diacritical marks,
Xyou can change the default punctuation set with the \-d option.
XJust type the punctuation marks you want into a file,
Xon a single line with no embedded spaces,
Xand specify the filename after the \-d in your command line.
XIf you have cedillas or umlauts, you can represent them
Xas a `+' character after the accented letter.
XUse the `+' option of \fIkwic\fP, and filter your output through
Xeither the \fIcedilla\fP or \fIumlaut\fP program.
X.PP
XAfter generating the concordance,
Xit should be alphabetized using the Unix \fIsort\fP program.
XKeywords should be grouped and counted with the \fIformat\fP program,
Xand the final results can be sent to the lineprinter.
XHere is a typical program sequence for generating a concordance:
X.nf
X % kwic \-c110 chapter* | sort | format | lpr
X.fi
XUsually, it is better to send the results of FORMAT
Xto a file, where they can be examined and edited,
Xbefore sending the file to the lineprinter.
X.SH FILES
XA temporary file, /tmp/KwicXXXXX,
Xis created if \fIkwic\fP has to work with standard input,
Xbecause seeking can only be done with files.
X.SH "SEE ALSO"
Xformat(hum), kwal(hum), maxwd(hum), tprep(hum), sort(1)
X.SH LIMITATIONS
XWords cannot be longer than 512 characters,
Xnor can the first half of the context.
XLinenumbers cannot exceed 999999 and pagenumbers 
Xcannot exceed 999,99 without skewing the output format.
XMost lineprinters will not print entries longer than 132 characters,
Xand the CAT/4 typesetter cannot handle lines longer than 7.54 inches.
X.SH AUTHOR
XBill Tuthill
X.SH BUGS
XIf there are lots of backspaces in the text,
Xthe context width is somewhat shortened.
XUsing a wheel-like data structure might be more efficient
Xthan using disk seeks and reads to output the contexts.
!!!EOF!!!
if [ "`wc -c kwic.1`" != "    5585 kwic.1" ]
			then	
				echo \07WARNING kwic.1: extraction error
			fi
echo Extracting kwic.c 1>&2
sed 's/^X//' > kwic.c <<\!!!EOF!!!
X# include <stdio.h>				/* kwic.c (rev3.7) */
X# include <ctype.h>
X# include <signal.h>
X
Xusage()			/* print usage and synopsis of options */
X{
X	puts("Key Word In Context concordance program\t\t(rev3.7)");
X	puts("Usage: kwic [-kn -m -wS -fn -r -ln -pn -ic -cn -dF + -] filename(s)");
X	puts("-kn: keyword is n characters long (defaults to 15)");
X	puts("-m : keywords not mapped from upper to lower case");
X	puts("-wS: write string S onto id field (use quotes around blanks)");
X	puts("-fn: filename (up to n characters) written onto id field");
X	puts("-r : reset linenumber to 1 at beginning of every file");
X	puts("-ln: line numbering begins with line n (instead of 1)");
X	puts("-pn: page numbering begins with page n (instead of 1)");
X	puts("-ic: page incrementer is character c (defaults to =)");
X	puts("-cn: context is n characters long (defaults to 50)");
X	puts("-d : define punctuation set according to file F");
X	puts("+  : the + character indicates cedilla or umlaut");
X	puts("-  : read text from standard input (terminal or pipe)");
X	puts("Kwic will linenumber until first page indicator.");
X	exit(1);
X}
X
Xint kwlen = 15;		/* lefthand keyword length */
Xint kwmap = 1;		/* toggle mapping of keyword to lcase */
Xint wrtid = 0;		/* toggle write onto id field */
Xchar *idfld;		/* string to write onto id field */
Xint wrtfnm = 0;		/* toggle writing of filename */
Xlong lineno = 1;	/* count line numbers */
Xint resetno = 0;	/* reset lineno for new file */
Xchar pgincr = '=';	/* page incrementing character */
Xint pageno = 0;		/* toggle and count page numbers */
Xint cntxt = 50;		/* context width around keyword */
Xint plusm = 0;		/* toggle plus sign as accent mark */
Xchar punct[BUFSIZ/4] 	= ",.;:-?!\"()[]{}" ;
Xchar zerowid[BUFSIZ/4]	= "+" ;
X
Xmain(argc, argv)	/* Key Word In Context concordance program */
Xint argc;
Xchar *argv[];
X{
X	FILE *fp, *fopen();
X	int i;
X
X	if (argc == 1)
X		usage();
X
X	for (i = 1; i < argc; i++)
X	{
X		if (*argv[i] == '-')
X			getflag(argv[i]);
X		else if (*argv[i] == '+')
X			plusm = 1;
X		else if ((fp = fopen(argv[i], "r")) != NULL)
X		{
X			kwic(fp, argv[i]);
X			fclose(fp);
X			if (resetno)
X				lineno = 1;
X		}
X		else
X		{
X			fprintf(stderr,
X			"Kwic cannot access the file: %s\n", argv[i]);
X			continue;
X		}
X	}
X	exit(0);
X}
X
Xgetflag(f)		/* parses command line to set options */
Xchar *f;
X{
X	char *pfile;
X	long atol();
X
X	f++;
X	switch(*f++)
X	{
X		case 'k':
X			kwlen = atoi(f);
X			break;
X		case 'm':
X			kwmap = 0;
X			break;
X		case 'w':
X			wrtid = 1;
X			idfld = f;
X			break;
X		case 'f':
X			wrtfnm = atoi(f);
X			break;
X		case 'r':
X			resetno = 1;
X			break;
X		case 'l':
X			lineno = atol(f);
X			break;
X		case 'p':
X			pageno = atoi(f);
X			break;
X		case 'i':
X			pgincr = *f;
X			break;
X		case 'c':
X			cntxt = atoi(f);
X			break;
X		case 'd':
X			pfile = f;
X			getpunct(pfile);
X			break;
X		case NULL:
X			rd_stdin();
X			break;
X		default:
X			fprintf(stderr,
X			"Invalid kwic flag: -%s\n", --f);
X			exit(1);
X			break;
X	}
X}
X
Xgetpunct(pfile)		/* read user's punctuation from pfile */
Xchar *pfile;
X{
X	FILE *pfp, *fopen();
X	char s[BUFSIZ/4], *strcpy();
X
X	if ((pfp = fopen(pfile, "r")) == NULL)
X	{
X		fprintf(stderr,
X		"Kwic cannot access Punctfile: %s\n", pfile);
X		exit(1);
X	}
X	if (fgets(s, BUFSIZ/4, pfp))
X	{
X		strcpy(punct, s);
X		punct[strlen(punct)-1] = NULL;
X	}
X	if (fgets(s, BUFSIZ/4, pfp))
X	{
X		strcpy(zerowid, s);
X		zerowid[strlen(zerowid)-1] = NULL;
X		plusm = 1;
X	}
X}
X
Xchar *tempfile;		/* file storage for seeking stdin */
X
Xrd_stdin()		/* create tempfile with standard input */
X{
X	FILE *tfp, *fopen();
X	char s[BUFSIZ], *mktemp();
X	int catch();
X
X	tempfile = "/tmp/KwicXXXXX";
X	mktemp(tempfile);
X	tfp = fopen(tempfile, "w");
X	if (signal(SIGINT, SIG_IGN) != SIG_IGN)
X		signal(SIGINT, catch);
X	while (fgets(s, BUFSIZ, stdin))
X		fputs(s, tfp);
X	fclose(tfp);
X	tfp = fopen(tempfile, "r");
X	kwic(tfp, "Stdin");
X	unlink(tempfile);
X}
X
Xcatch()			/* remove tempfile in case of interrupt */
X{
X	unlink(tempfile);
X	fprintf(stderr, "\nInterrupt\n");
X	exit(1);
X}
X
Xkwic(fp, fname)		/* prints kwic entry for each word */
XFILE *fp;
Xchar *fname;
X{
X	char word[BUFSIZ], str[BUFSIZ];
X	long pos, ftell();
X	register int wdlen, i;
X
X	while (wdlen = getword(word, fp))
X	{
X		pr_keywd(word);		/* lefthand keyword */
X		pr_idfld(fname);	/* identificaton field */
X
X		pos = ftell(fp);		/* where in text? */
X		if (pos < cntxt/2 + wdlen + 2)	/* if at beginning */
X		{
X			fseek(fp, -(pos), 1);
X			for (i = 0; i < cntxt/2 - pos + wdlen + 2; i++)
X				putchar(' ');
X			load_str(str, (int)(pos - wdlen - 1), 1, fp);
X		}
X		else			/* if past beginning of text */
X		{
X			fseek(fp, (long) -(cntxt/2 + wdlen + 2), 1);
X			load_str(str, (cntxt/2 + 1), 0, fp);
X		}
X		printf("%s|", str);	/* print out first half */
X
X		end_entry(fp);		/* print out second half */
X
X		fseek(fp, pos, 0);	/* back to original location */
X		putchar('\n');		/* end concordance entry */
X	}
X}
X
Xpr_keywd(word)		/* print keyword, adjusting for backspaces */
Xchar word[];
X{
X	char *cp, *index();
X	register int i, kwbsno = 0;
X
X	cp = word;
X
X	for (i = 0; i < kwlen; i++)
X	{
X		if (*cp)
X		{
X			if (*cp == '\b')
X				kwbsno += 2;
X			if (plusm && index(zerowid, *cp))
X				kwbsno++;
X			putchar(*cp++);
X		}
X		else
X			putchar(' ');
X	}
X	while (kwbsno-- > 0)
X		putchar(' ');
X	putchar('|');
X}
X
X
Xpr_idfld(fname)		/* print specified id fields and numbering */
Xchar *fname;
X{
X	char *wfile;
X	int i;
X
X	if (wrtid)
X		printf("%s ", idfld);
X	if (wrtfnm)
X	{
X		wfile = fname;
X		for (i = 0; i < wrtfnm; i++)
X			if (*wfile)
X				putchar(*wfile++);
X			else
X				putchar(' ');
X		putchar(' ');
X	}
X	if (pageno)
X		printf("%3d,%2ld|", pageno, lineno);
X	else
X		printf("%6ld|", lineno);
X}
X
Xload_str(str, max, prnt, fp)	/* load first half onto string */
Xchar str[];
Xint max, prnt;
XFILE *fp;
X{
X	int i, bsno = 0;
X	int c;
X
X	for (i = 0; i < max; i++)
X	{
X		c = getc(fp);
X		if (!prnt)
X		{
X			str[i] = ' ';
X			prnt = isspace(c);
X		}
X		else	/* if (prnt) */
X		{
X			if (c == '\n')
X				str[i] = '/';
X			else if (c == '\t')
X				str[i] = ' ';
X			else
X				str[i] = c;
X
X			if (c == '\b')
X				bsno += 2;
X			if (plusm && index(zerowid, c))
X				bsno++;
X		}
X	}
X	str[i] = NULL;
X	while (bsno-- > 0)	/* adjust for backspaces */
X		putchar(' ');
X}
X
Xend_entry(fp)		/* read and write to end of entry */
XFILE *fp;
X{
X	register int i, c;
X
X	for (i = 0; i < cntxt/2 - 2; i++)
X	{
X		c = getc(fp);
X		if (c == '\n')
X			c = '/';
X		if (c == '\t')
X			c = ' ';
X		if (c == EOF)
X			return;
X		putchar(c);
X	}
X	while (!isspace(c = getc(fp)) && c != EOF)
X		putchar(c);
X}
X
Xgetword(word, fp)	/* drives program through text word by word */
Xchar word[];
XFILE *fp;
X{
X	static char nl = 0;
X	register int wlen = 1;
X
X	if (nl)		/* increments lineno at beginning of line */
X	{
X		nl = 0;
X		lineno++;
X	}
X	while ((*word = getc(fp)) && isskip(*word) && *word != EOF)
X		if (*word == '\n')	/* skip to text */
X			lineno++;
X
X	if (*word == pgincr)
X	{
X		pageno++;	/* pgincr must begin word */
X		lineno = 0;
X	}
X	if (*word == EOF)
X		return(NULL);
X	if (kwmap)
X		if (isupper(*word))
X			*word = tolower(*word);
X
X	while ((*++word = getc(fp)) && !isskip(*word) && *word != EOF)
X	{				/* get next word */
X		if (kwmap)
X			if (isupper(*word))
X				*word = tolower(*word);
X		wlen++;
X	}
X	if (*word == '\n')	/* set nl at end of line */
X		nl = 1;
X	*word = NULL;
X
X	return(wlen);
X}
X
Xisskip(c)		/* function to evaluate punctuation */
Xchar c;
X{
X	char *ptr;
X
X	if (isspace(c))
X		return(1);
X	for (ptr = punct; *ptr != c && *ptr != NULL; ptr++)
X		;
X	if (*ptr == NULL)
X		return(0);
X	else
X		return(1);
X}
!!!EOF!!!
if [ "`wc -c kwic.c`" != "    7396 kwic.c" ]
			then	
				echo \07WARNING kwic.c: extraction error
			fi
echo Extracting format.c 1>&2
sed 's/^X//' > format.c <<\!!!EOF!!!
X# include <stdio.h>			/* format.c (rev3.7) */
X# include <ctype.h>
X# include <signal.h>
X
Xchar *tempfile;		/* to store overflow while counting */
Xint nomap = 0;		/* toggle for mapping keyword to lcase */
Xint nocnt = 0;		/* toggle for counting keyword */
Xint nokwd = 0;		/* toggle for suppressing keyword */
X
Xusage()			/* print proper usage and exit */
X{
X	puts("Usage: format [-mck] [filename(s)]\t\t(rev3.7)");
X	puts("-m: keywords not mapped from lower to upper case");
X	puts("-c: suppress counting of keyword frequency");
X	puts("-k: entirely suppress printing of keyword");
X	exit(1);
X}
X
Xmain(argc, argv)	/* make keyword headings with count */
Xint argc;
Xchar *argv[];
X{
X	FILE *fopen(), *fp;
X	int i, j, onintr();
X	char *mktemp();
X
X	if (signal(SIGINT, SIG_IGN) != SIG_IGN)
X		signal(SIGINT, onintr);
X
X	tempfile = "/tmp/FmtXXXXX";
X	mktemp(tempfile);
X
X	for (i = 1; argv[i] && *argv[i] == '-'; i++)
X	{
X		for (j = 1; argv[i][j] != NULL; j++)
X		{
X			if (argv[i][j] == 'm')
X				nomap = 1;
X			else if (argv[i][j] == 'c')
X				nocnt = 1;
X			else if (argv[i][j] == 'k')
X				nokwd = 1;
X			else  /* bad option */
X			{
X				fprintf(stderr,
X				"Illegal format flag: -%c\n", argv[i][j]);
X				usage();
X			}
X		}
X	}
X	if (i == argc)
X	{
X		if (nokwd)
X			rmkwds(stdin);
X		else if (nocnt)
X			ffmt(stdin);
X		else
X			format(stdin);
X	}
X	for (; i < argc; i++)
X	{
X		if ((fp = fopen(argv[i], "r")) != NULL)
X		{
X			if (nokwd)
X				rmkwds(fp);
X			else if (nocnt)
X				ffmt(fp);
X			else
X				format(fp);
X			fclose(fp);
X		}
X		else  /* attempt to open file failed */
X		{
X			fprintf(stderr,
X			"Format cannot access the file: %s\n", argv[i]);
X			continue;
X		}
X	}
X	unlink(tempfile);
X	exit(0);
X}
X
Xchar buff[BUFSIZ*8];	/* tempfile buffer for storing contexts */
Xint bufflen;		/* total length of contexts in buffer */
Xint fulltf = 0;		/* does the tempfile contain something? */
XFILE *tf = NULL;	/* file pointer for tempfile routines */
X
Xformat(fp)	  	/* print keyword and count only if different */
XFILE *fp;
X{
X	char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
X	char *sp, *kwp, *cxp, *strcpy();
X	int kwfreq = 0;
X
X	strcpy(okw,"~~~~~");	/* make sure 1st keyword is printed */
X
X	while (fgets(s, BUFSIZ, fp))
X	{
X		for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
X		{
X			if (!nomap && islower(*sp))
X				*kwp = toupper(*sp);
X			else
X				*kwp = *sp;
X		}
X		*kwp = NULL;
X
X		for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
X		{
X			if (*sp == '|') {
X				*cxp = ' '; *++cxp = ' '; *++cxp = ' ';
X			} else
X				*cxp = *sp;
X		}
X		*cxp = '\n';
X		*++cxp = NULL;
X
X		if (strcmp(nkw, okw) != 0)  /* kwds different */
X		{
X			if (kwfreq != 0)
X			{
X				getbuff(kwfreq);
X				putchar('\n');
X			}
X			*buff = NULL;
X			bufflen = 0;
X			fputs(nkw, stdout);
X			putbuff(cntxt);
X			kwfreq = 1;
X		}
X		else  /* if keywords are the same */
X		{
X			putbuff(cntxt);
X			kwfreq++;
X		}
X		strcpy(okw, nkw);
X	}
X	getbuff(kwfreq);
X}
X
Xputbuff(cntxt)		/* cache routine to buffer tempfile */
Xchar cntxt[];
X{
X	char *strcat();
X
X	if (!fulltf)
X	{
X		bufflen += strlen(cntxt);
X		if (bufflen < BUFSIZ*8)
X			strcat(buff, cntxt);
X		else {
X			fulltf = 1;
X			if ((tf = fopen(tempfile, "w")) == NULL)
X				perror(tempfile);
X			fputs(buff, tf);
X			*buff = NULL;
X			bufflen = 0;
X		}
X	}
X	else  /* fulltf */
X		fputs(cntxt, tf);
X}
X
Xgetbuff(kwfreq)		/* print frequency and context buffer */
Xint kwfreq;
X{
X	char str[BUFSIZ];
X
X	printf("(%d)\n", kwfreq);
X	if (!fulltf)
X		fputs(buff, stdout);
X	else
X	{
X		fclose(tf);
X		if ((tf = fopen(tempfile, "r")) == NULL)
X			perror(tempfile);
X		while (fgets(str, BUFSIZ, tf))
X			fputs(str, stdout);
X		fclose(tf);
X		fulltf = 0;
X	}
X}
X
Xint onintr()		/* remove tempfile in case of interrupt */
X{
X	fprintf(stderr, "\nInterrupt\n");
X	unlink(tempfile);
X	exit(1);
X}
X
Xffmt(fp)	  	/* if different, print keyword without count */
XFILE *fp;
X{
X	char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
X	char *sp, *kwp, *cxp, *strcpy();
X
X	strcpy(okw,"~~~~~");	/* make sure 1st keyword is printed */
X	while (fgets(s, BUFSIZ, fp))
X	{
X		for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
X		{
X			if (!nomap && islower(*sp))
X				*kwp = toupper(*sp);
X			else
X				*kwp = *sp;
X		}
X		*kwp = NULL;
X
X		for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
X		{
X			if (*sp == '|') {
X				*cxp = ' '; *++cxp = ' '; *++cxp = ' ';
X			} else
X				*cxp = *sp;
X		}
X		*cxp = '\n';
X		*++cxp = NULL;
X
X		if (strcmp(nkw, okw) != 0)  /* kwds different */
X			printf("\n%s\n %s", nkw, cntxt);
X		else  /* if keywords are the same */
X			printf(" %s", cntxt);
X		strcpy(okw, nkw);
X	}
X}
X
Xrmkwds(fp)		/* completely suppress printing of keyword */
XFILE *fp;
X{
X	char s[BUFSIZ], *sp;
X
X	while (fgets(s, BUFSIZ, fp))
X	{
X		for (sp = s; *sp && *sp != '|'; sp++)
X			;
X		for (; *sp; sp++)
X		{
X			if (*sp == '|')
X				printf("   ");
X			else
X				putchar(*sp);
X		}
X	}
X}
!!!EOF!!!
if [ "`wc -c format.c`" != "    4765 format.c" ]
			then	
				echo \07WARNING format.c: extraction error
			fi