[comp.lang.icon] snipping out sentences from text files

goer@SOPHIST.UCHICAGO.EDU (Richard Goerwitz) (08/14/90)

Recently I have had to do some analysis on my prose (my sentences tend
to be too long, contain too many polysyllabic words, and to have too
many subordinate clauses).  To do this I wrote a short Icon program.
One procedure I needed to write was a sentence-finder, which snips out
sentences from a given ASCII file.  This seemed to be of general in-
terest, so I'm posting it here.

My ulterior motive is that I'll bet a lot of literary types are on
this mailing list, and could improve the algorithm substantially.
I'd really like to see any modifications anyone makes to this code.

-Richard

------------------------------ cut here ------------------------------

procedure main(a)

    intext := open(\a[1]) | &input
    every write("\n",sentence(intext))

end


############################################################################
#
#	Name:	 sentence.icn
#
#	Title:	 A simple sentence-finder
#
#	Author:	 Richard Goerwitz
#
#	Version: (just a preliminary hack)
#
############################################################################
#  
#     This procedure simply reads an ASCII file (or maybe an n/troff file),
#  and suspends the sentences contained in it.  Sentences are defined as
#  any string of characters ending in a punctuation-group containing an
#  exclamation point, period, or question mark, followed by two spaces, or
#  else by a space, an (optional) string of sentence-initial punctuation,
#  and finally a capital letter or number.  If I could count on everyone
#  having a /usr/dict/words file, I'd require that the capital letter be-
#  gin a word which maps to a lowercase entry in /usr/dict/words (so as to
#  avoid calling cases like "R. Goerwitz" a sentence-break).  Since not every-
#  one has such a wordlist, I'll just rest content to leave the algorithm
#  (imperfect) as it is.  Note that cases where a sentence ends a line are
#  handled by appending a space to that line, and then appending the next
#  line from the input file.  This puts the sentence break into a form the
#  algorithm can recognize as a sentence break.
#
############################################################################
#
#  Requires: coexpressions
#
#  See also: segment.icn (if you wish to tokenize)
#
############################################################################


procedure sentence(intext)

    local sentence, get_line, line, tmp_s, end_part
    static inits, punct
    initial {
	inits := &ucase ++ &digits
	punct := ".\"'!?)]"
    }
    sentence := ""
    get_line := create read_line(intext)

    while line := @get_line do {

	# Go on until you can't find any more sentence-endings in line,
	# then break and get another line.
	repeat {

	    # Scan for a sentence break somewhere in line.
	    line ? {

		# Ugly, but it works.  Look for sequences containing
		# things like periods and question marks, followed by
		# a space and another space or a word beginning with
		# a capital letter.  If we don't have enough context,
		# append the next line from intext to line & scan again.
		if tmp_s := tab(upto(punct)) &
		    upto('!?.', end_part := tab(many(punct))) &
		    not (pos(-1), line ||:= @get_line, next) &
		    =" " & (=" " | (tab(many('\'"('))|&null,any(inits)))
		then {
		    suspend sentence || tmp_s || end_part
		    tab(many(' '))
		    line := tab(0)
		    sentence := ""
		    next
		}
		else break
	    }
	}

	# Otherwise just tack line onto sentence & try again.
	sentence ||:= line
    }

    return sentence

end




procedure read_line(intext)

    local new_line

    while line := trim(!intext,'\t ') do {
	line ? {
	    match(".") & next
	    tab(many('\t #'))
	    pos(0) & next
	    new_line := tab(-1) || (="-" | (move(1) || " "))
	}
	suspend new_line
    }

end