athos@apple.com (Rick Eames) (10/13/89)
Okay, here it is: I am writing a program which takes a text file and ouputs a concordance of the words in the file. I have it working fine, however, I have problems with contractions: (i.e. can't) My question is this: does anyone have any good ideas for filtering every punctuation mark except contraction apostraphes? I know how to do this in pascal, but not c. Even in pascal, however, it is still kind of kludgy.... Rick Eames
rice@dg-rtp.dg.com (Brian Rice) (10/18/89)
In article <4726@internal.Apple.COM> athos@apple.com (Rick Eames) writes: >Okay, here it is: I am writing a program which takes a text file and >ouputs a concordance of the words in the file. I have it working fine, >however, I have problems with contractions: (i.e. can't) My question is >this: does anyone have any good ideas for filtering every punctuation mark >except contraction apostraphes? Below is a function fgetbaseword() which may help; it filters both punctuation and contractions. For instance, it will work its way through Hey, Joe, isn't that Sally's "pinochle deck"?! by giving back (through a char[] buffer one provides it) Hey Joe is that Sally pinochle deck It's only 116 rather sparse lines of C code, so I felt it would be O.K. to post it here. Brian Rice rice@dg-rtp.dg.com (919) 248-6328 DG/UX Product Assurance Engineering Data General Corp., Research Triangle Park, N.C. "My other car is an AViiON." --------------------clip here------------------- #include <stdio.h> #include <string.h> #ifndef TRUE #define TRUE 1 #define FALSE 0 #endif #ifndef BOOLEAN #define BOOLEAN char #endif #define WORD_SPLITTERS " \n\t.,;:^?/!@#$%%^&*()_-=+<>{}[]\\~|`\"" /* I included two %'s so that one can printf WORD_SPLITTERS without getting tricked. */ fgetbaseword(fp, s, lim) FILE *fp; char *s; int lim; /* fgetbaseword() reads an input stream fp and puts each word, minus any contractions it may have appended to it, into s. (Note that it will not behave properly for words like "O'Shaughnessy", and "ain't" will trick it into reporting "ai".) Punctuation is filtered out. fgetbaseword() is case-insensitive. This function is an example of a finite-state machine, although not necessarily an efficient one. If you don't understand the code and you don't know what a F.S.M. is, it might help to find out. fgetbaseword() has as its spiritual ancestor getline(), from page 67 of K&R-1. All hail. */ { int c,c2,i; char *end_of_word; BOOLEAN in_word; BOOLEAN ignore_text; BOOLEAN maybe_nt; /* n't is a hard one to deal with, so we give our finite-state machine a special state for it */ i = 0; end_of_word = NULL; in_word = TRUE; ignore_text = FALSE; maybe_nt = FALSE; while (--lim > 0 && (c = getc(fp)) != EOF) { if (strchr(WORD_SPLITTERS,c)) { if (in_word) { if (!ignore_text) { end_of_word = s+i; } in_word = FALSE; } ignore_text = FALSE; continue; } if (c == '\'') { if (in_word) { if (maybe_nt) { in_word = FALSE; if ((c2=getc(fp)) != 't' && c2 != 'T') { end_of_word = s+i; } if (c2 == EOF) { break; } } else { end_of_word = s+i; ignore_text = TRUE; } } continue; } if (c == 'n' || c == 'N') { if (in_word) { end_of_word = s+i; s[i++] = c; maybe_nt = TRUE; continue; } else { ungetc(c,fp); break; } } if (in_word) { if (!ignore_text) { s[i++] = c; maybe_nt = FALSE; } continue; } else { if (i == 0) { s[i++] = c; in_word = TRUE; } else { ungetc(c,fp); break; } } } if (end_of_word == NULL) { s[i] = '\0'; return i; } else { *end_of_word = '\0'; return (end_of_word - s); } } /* Brian Rice, 1989 This code is in the public domain. Everyone may copy, use, and modify it at will. */