[comp.lang.c] Silly Question?

athos@apple.com (Rick Eames) (10/13/89)

Okay, here it is:  I am writing a program which takes a text file and 
ouputs a concordance of the words in the file.  I have it working fine, 
however, I have problems with contractions:  (i.e. can't)  My question is 
this: does anyone have any good ideas for filtering every punctuation mark 
except contraction apostraphes?  I know how to do this in pascal, but not 
c.  Even in pascal, however, it is still kind of kludgy....

Rick Eames

rice@dg-rtp.dg.com (Brian Rice) (10/18/89)

In article <4726@internal.Apple.COM> athos@apple.com (Rick Eames) writes:
>Okay, here it is:  I am writing a program which takes a text file and 
>ouputs a concordance of the words in the file.  I have it working fine, 
>however, I have problems with contractions:  (i.e. can't)  My question is 
>this: does anyone have any good ideas for filtering every punctuation mark 
>except contraction apostraphes?  

Below is a function fgetbaseword() which may help; it filters both 
punctuation and contractions.  For instance, it will work its way through 
	Hey, Joe, isn't that Sally's "pinochle deck"?!
by giving back (through a char[] buffer one provides it) 
	Hey Joe is that Sally pinochle deck

It's only 116 rather sparse lines of C code, so I felt it would be O.K. 
to post it here.

Brian Rice   rice@dg-rtp.dg.com   (919) 248-6328
DG/UX Product Assurance Engineering
Data General Corp., Research Triangle Park, N.C.
"My other car is an AViiON."
--------------------clip here-------------------
#include <stdio.h>
#include <string.h>
#ifndef TRUE
#define TRUE 1
#define FALSE 0
#endif
#ifndef BOOLEAN
#define BOOLEAN char
#endif
#define WORD_SPLITTERS " \n\t.,;:^?/!@#$%%^&*()_-=+<>{}[]\\~|`\""
/* I included two %'s so that one can printf WORD_SPLITTERS without
   getting tricked. */

fgetbaseword(fp, s, lim)
FILE *fp;
char *s;
int lim;
/* fgetbaseword() reads an input stream fp and puts each word, minus any
   contractions it may have appended to it, into s. (Note that
   it will not behave properly for words like "O'Shaughnessy",
   and "ain't" will trick it into reporting "ai".)  Punctuation
   is filtered out.  fgetbaseword() is case-insensitive.

   This function is an example of a finite-state machine, although
   not necessarily an efficient one.  If you don't understand the
   code and you don't know what a F.S.M. is, it might help to find
   out. 

   fgetbaseword() has as its spiritual ancestor getline(), from page 67 
   of K&R-1.  All hail. */
{
	int c,c2,i;
	char *end_of_word;
	BOOLEAN in_word;
	BOOLEAN ignore_text;
	BOOLEAN maybe_nt;  /* n't is a hard one to deal with, so
	                      we give our finite-state machine a
	                      special state for it */

	i = 0;
	end_of_word = NULL;
	in_word = TRUE;
	ignore_text = FALSE;
	maybe_nt = FALSE;

	while (--lim > 0 && (c = getc(fp)) != EOF) {
		if (strchr(WORD_SPLITTERS,c)) {
			if (in_word) {
				if (!ignore_text) {
					end_of_word = s+i;
				}
				in_word = FALSE;
			}
			ignore_text = FALSE;
			continue;
		}
		if (c == '\'') {
			if (in_word) {
				if (maybe_nt) {
					in_word = FALSE;
					if ((c2=getc(fp)) != 't' &&
				  	    c2 != 'T') {
						end_of_word = s+i;
					} 
					if (c2 == EOF) {
						break;
					}
				} else {
					end_of_word = s+i;
					ignore_text = TRUE;
				}
			}
			continue;
		}
						
		if (c == 'n' || c == 'N') {
			if (in_word) {
				end_of_word = s+i;
				s[i++] = c;
				maybe_nt = TRUE;
				continue;
			} else {
				ungetc(c,fp);
				break;
			}
		}
				
		if (in_word) {
			if (!ignore_text) {
				s[i++] = c;
				maybe_nt = FALSE;
			}
			continue;
		} else {
			if (i == 0) {
				s[i++] = c;
				in_word = TRUE;
			} else {
				ungetc(c,fp);
				break;
			}
		}
	}

	if (end_of_word == NULL) {
		s[i] = '\0';
		return i;
	} else {
		*end_of_word = '\0';
		return (end_of_word - s);
	}

}
/* Brian Rice, 1989
   This code is in the public domain.  Everyone may
   copy, use, and modify it at will. */