gbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)
In the 489 text-lines in the net.jobs files currently on our machine, there are only 2 occurrences of the word "not"! Here, is a comparison with results for some other net groups: LINES LINES GROUP WITH OF "not" TEXT net.flame 84 2394 3.51% net.general 8 672 1.19% net.jobs 2 489 0.41% net.math 10 426 2.34% net.news 83 2345 3.54% net.nlang 101 2168 4.65% net.unix-wizards 136 3633 3.77% What started all this was that I was posting a job announcment on someone's behalf, and I wanted to see what phrase was used on the net corresponding to the local "Please do not reply to this account". So I went into the directory with the net.jobs files, did "grep not *", and was surprized to find only two occurrences (neither of which was what I wanted). To get some good comparisons, I eventually set up the alias alias not 'sed -n -f file1 /usr/spool/news/net/\!^/[1-9]*[0-9]|sed -n -f file2|sed -n -f file3' (files1-3 given below). Then the command "not groupname" gave me my raw statistics. In the alias, the file-listing .../[1-9]*[0-9] is to exclude files like uucp and 4bsd under "bugs", which are actually subdirectories and confuse sed. The sed-script file1 removes all header-material (lines beginning Xxxxx:) and all empty lines, the sed-script file2 prints only lines containing the word "not", but ends by giving the total number of lines it has seen, and file3 prints the last line it sees and the number of lines it has seen. This number will be one more than the number of lines containing "not", since it also sees the count-line from the previous script. So one subtracts 1 from the first of the two numbers one gets, and takes the ratio. The method is primitive -- one should really count words rather than lines, since files with short lines will naturally yield a lower frequency of any word, and on the other hand multiple occurrences in one line are not counted. It is curious that in net.general, which had the second lowest ratio, more than half the "not"'s came from one file, the MT XINU bug report announcement with its disclaimers. Without that it might approach net.jobs in sparsity of "not"s. I'm really surprized at the low frequency of this very basic word. I'm not going to pursue this any further -- newsstats@seismo, want to start a new project? (Featuring a different word each month?) Here are the sed scripts used: :::::::::::::: # file1 /^[A-Z][a-z]*:/d # afterthought -- this should have been /^[A-Z][^ ]*:/d /^$/d p :::::::::::::: # file2 s/.*/ & / # above command pads all lines, so next won't miss "not" at either end /[^A-Za-z]not[^a-z]/p $= :::::::::::::: # file3 $= $p :::::::::::::: George Bergman Math, UC Berkeley 94720 USA ...!ucbvax!gbergman%cartan
gbergman@ucbtopaz.CC.Berkeley.ARPA (11/14/84)
Two points just occurred to me re my preceding note on occurrences of the word "not". First, one often has "n't" instead. So I did a similar search (removing the restriction that it be preceded by a nonalphabetic character, of course) for that sequence in a few cases: GROUP "not" "n't" LINES jobs 2 4 489 general 8 5 672 flame 84 109 2394 (I draw no conclusions.) Second, if I wanted to do this seriously I would try to exclude occurrences in signatures. Before I had the alias in final form, when I was looking at grep outputs and counting lines, I saw a number of "[I am not a stranger ...]" signatures. I suppose one should just look for recurring lines of this sort and have the sed script delete them specifically. George Bergman Math, UC Berkeley 94720 USA ...!ucbvax!gbergman%cartan