[net.math.stat] Positive thinking

gbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)

In the 489 text-lines in the net.jobs files currently on our
machine, there are only 2 occurrences of the word "not"!  Here,
is a comparison with results for some other net groups:

		     LINES   LINES
GROUP		      WITH	OF
		     "not"    TEXT
net.flame		 84   2394    3.51%
net.general		  8    672    1.19%
net.jobs		  2    489    0.41%
net.math		 10    426    2.34%
net.news		 83   2345    3.54%
net.nlang		101   2168    4.65%
net.unix-wizards	136   3633    3.77%

     What started all this was that I was posting a job announcment
on someone's behalf, and I wanted to see what phrase was used on
the net corresponding to the local "Please do not reply to this
account".  So I went into the directory with the net.jobs files, did
"grep not *", and was surprized to find only two occurrences (neither
of which was what I wanted).  To get some good comparisons, I eventually
set up the alias

alias not 'sed -n -f file1 /usr/spool/news/net/\!^/[1-9]*[0-9]|sed -n -f file2|sed -n -f file3'

(files1-3 given below).
Then the command "not groupname" gave me my raw statistics.  In the
alias, the file-listing .../[1-9]*[0-9] is to exclude files like
uucp and 4bsd under "bugs", which are actually subdirectories and
confuse sed.  The sed-script file1 removes all header-material (lines
beginning Xxxxx:) and all empty lines, the sed-script file2 prints only
lines containing the word "not", but ends by giving the total number
of lines it has seen, and file3 prints the last line it sees and the
number of lines it has seen.  This number will be one more than the
number of lines containing "not", since it also sees the count-line from
the previous script.  So one subtracts 1 from the first of
the two numbers one gets, and takes the ratio.
     The method is primitive -- one should really count words rather
than lines, since files with short lines will naturally yield a lower
frequency of any word, and on the other hand multiple occurrences in
one line are not counted.  It is curious that in net.general, which had
the second lowest ratio, more than half the "not"'s came from one
file, the MT XINU bug report announcement with its disclaimers.
Without that it might approach net.jobs in sparsity of "not"s.  I'm
really surprized at the low frequency of this very basic word.  I'm not
going to pursue this any further -- newsstats@seismo, want to start a
new project?  (Featuring a different word each month?)
     Here are the sed scripts used:

::::::::::::::
# file1
/^[A-Z][a-z]*:/d
# afterthought -- this should have been /^[A-Z][^ ]*:/d
/^$/d
p
::::::::::::::
# file2
s/.*/ & /
# above command pads all lines, so next won't miss "not" at either end
/[^A-Za-z]not[^a-z]/p
$=
::::::::::::::
# file3
$=
$p
::::::::::::::
			George Bergman
			Math, UC Berkeley 94720 USA
			...!ucbvax!gbergman%cartan

gbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)

Two points just occurred to me re my preceding note on occurrences
of the word "not".  First, one often has "n't" instead.
So I did a similar search (removing the restriction that it be
preceded by a nonalphabetic character, of course) for that sequence
in a few cases:
	GROUP		"not"	"n't"	LINES
	jobs		   2       4	  489
	general		   8       5	  672
	flame		  84     109	 2394
(I draw no conclusions.)  Second, if I wanted to do this seriously
I would try to exclude occurrences in signatures.  Before I had
the alias in final form, when I was looking at grep outputs and
counting lines, I saw a number of "[I am not a stranger ...]"
signatures.  I suppose one should just look for recurring lines of this
sort and have the sed script delete them specifically.

			George Bergman
			Math, UC Berkeley 94720 USA
			...!ucbvax!gbergman%cartan