gbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)
In the 489 text-lines in the net.jobs files currently on our
machine, there are only 2 occurrences of the word "not"! Here,
is a comparison with results for some other net groups:
LINES LINES
GROUP WITH OF
"not" TEXT
net.flame 84 2394 3.51%
net.general 8 672 1.19%
net.jobs 2 489 0.41%
net.math 10 426 2.34%
net.news 83 2345 3.54%
net.nlang 101 2168 4.65%
net.unix-wizards 136 3633 3.77%
What started all this was that I was posting a job announcment
on someone's behalf, and I wanted to see what phrase was used on
the net corresponding to the local "Please do not reply to this
account". So I went into the directory with the net.jobs files, did
"grep not *", and was surprized to find only two occurrences (neither
of which was what I wanted). To get some good comparisons, I eventually
set up the alias
alias not 'sed -n -f file1 /usr/spool/news/net/\!^/[1-9]*[0-9]|sed -n -f file2|sed -n -f file3'
(files1-3 given below).
Then the command "not groupname" gave me my raw statistics. In the
alias, the file-listing .../[1-9]*[0-9] is to exclude files like
uucp and 4bsd under "bugs", which are actually subdirectories and
confuse sed. The sed-script file1 removes all header-material (lines
beginning Xxxxx:) and all empty lines, the sed-script file2 prints only
lines containing the word "not", but ends by giving the total number
of lines it has seen, and file3 prints the last line it sees and the
number of lines it has seen. This number will be one more than the
number of lines containing "not", since it also sees the count-line from
the previous script. So one subtracts 1 from the first of
the two numbers one gets, and takes the ratio.
The method is primitive -- one should really count words rather
than lines, since files with short lines will naturally yield a lower
frequency of any word, and on the other hand multiple occurrences in
one line are not counted. It is curious that in net.general, which had
the second lowest ratio, more than half the "not"'s came from one
file, the MT XINU bug report announcement with its disclaimers.
Without that it might approach net.jobs in sparsity of "not"s. I'm
really surprized at the low frequency of this very basic word. I'm not
going to pursue this any further -- newsstats@seismo, want to start a
new project? (Featuring a different word each month?)
Here are the sed scripts used:
::::::::::::::
# file1
/^[A-Z][a-z]*:/d
# afterthought -- this should have been /^[A-Z][^ ]*:/d
/^$/d
p
::::::::::::::
# file2
s/.*/ & /
# above command pads all lines, so next won't miss "not" at either end
/[^A-Za-z]not[^a-z]/p
$=
::::::::::::::
# file3
$=
$p
::::::::::::::
George Bergman
Math, UC Berkeley 94720 USA
...!ucbvax!gbergman%cartangbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)
Two points just occurred to me re my preceding note on occurrences of the word "not". First, one often has "n't" instead. So I did a similar search (removing the restriction that it be preceded by a nonalphabetic character, of course) for that sequence in a few cases: GROUP "not" "n't" LINES jobs 2 4 489 general 8 5 672 flame 84 109 2394 (I draw no conclusions.) Second, if I wanted to do this seriously I would try to exclude occurrences in signatures. Before I had the alias in final form, when I was looking at grep outputs and counting lines, I saw a number of "[I am not a stranger ...]" signatures. I suppose one should just look for recurring lines of this sort and have the sed script delete them specifically. George Bergman Math, UC Berkeley 94720 USA ...!ucbvax!gbergman%cartan