[comp.theory.info-retrieval] IRList Digest V4 #14

FOXEA@VTVAX3.BITNET (02/29/88)
IRList Digest           Sunday, 28 February 1988      Volume 4 : Issue 14

Today's Topics:
   Source - Discussion and UNIX code for co-term term relations

News addresses are
   Internet or CSNET: fox@vtopus.cs.vt.edu
   BITNET: foxea@vtvax3.bitnet

----------------------------------------------------------------------

Date:  8 Feb 88 10:10 +0100
From: wyle%solaris.uucp@relay.cs.net
Subject: co-term term relations code attached


Hi Ed!

Sorry I haven't contributed anything to your digest for such a long
time.  Care and feeding of the machines here has kept me too busy for
many things...

Attached is a Unix shell-script which calculates the cosine co-term
relationship defined in "Co-Word Search:  A System for Information
Retrieval," Bertrand Michelet and W.A. Turner, Journal of Information
Science, vol. 11, (1985) p 173-181.  The relationship is

Aij = Cij * Cij / (Ci * Cj)

where Cij is the sum of co-occurances of words i and j in all sentences
of a collection (max. 1 per sentence), Ci is the sum of the occurances
of word i (max. 1 per sentence) in the collection, and Cj is the sum
for word j.

A different shell script to filter out the 150 most common words in
English is also in the shell archive.  It takes about half an hour to
run these scripts against the CACM document test collection (2.2
Megabytes) and generates a 6 Megabyte report file on a normally loaded
Sun-3/280.  The object was to get a quick and dirty solution, not an
efficient one.

I helped teach a course in IR this semester and have some other
interesting "example solutions" to assignments in statistical text
analysis, all written in Unix shell-script or Modula-2.  If anyone is
interested, send me e-mail (wyle%ifi.ethz.ch@relay.cs.net)  Please let
me know if other courses tought on Unix machines exist, and let's share
experiences about the exercises.

I am curious about others who are using unix shell scripts and commands
for IR text analysis.  There are a wealth of text filtering utilities
built into Unix, including awk, sed, tr, as well as obscure (?)
concordance systems like egrep, and ptx.  There are also public domain
packages such as the humanities (hum) system by Dr. Tuthill, SMART by
Chris Buckley & Dr. Salton, etc.

[Note: When I started the project to redo SMART for UNIX at Cornell in
1980, we began with a few programs and a bunch of UNIX tools all
pieced together.  Gradually, to get things more efficient, the UNIX
tools and use of Ingres was eliminated.  But for teaching purposes, it
is a good approach.  By the way, there is an effort at U. Chicago,
with Scott Deerwester and others, using UNIX to build tools for text
processing and to help scholars working with text collections. - Ed.]
* * *

Ed, you really should wail on your readers to contribute more.  I am a
member of a different mailing-list where the moderator periodically
sends out "Contribute, you leaches!! You know who you are!" messages
whenever the quality and volume drop.

[Note: Thanks for the encouragement - readers take heed! - Ed.]

* * *

Enjoy the co-word cosine code!  If anyone hacks up improvements,
variations, etc.  Please let me know!

--------------------- cut here ---------------------------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create:
#       p1
#       sl
# This archive created: Mon Feb  8 09:25:08 1988
export PATH; PATH=/bin:/usr/bin:$PATH
if test -f 'p1'
then
        echo shar: "will not over-write existing file 'p1'"
else
cat << \SHAR_EOF > 'p1'
#!/bin/sh
#
# Term dependency calculation script  M F Wyle 21.01.88
#

#
# Define some constants.  Be sure to change these per system!
#

# debug
set -x

FILE=/usr/frei/ir/ir/data/testcoll/cacm/docs
SRTD=/usr/frei/ir/wyle
STOPLIST=./sl
TEMPFILE=./tmp.fil
REPORTFILE=./report.cacm
COMBOFILE=./word.combos
COMBOFIL1=./word.combo.j1
FREQFILE=./word.freq

# First calculate the occurance of each individual word in all sentences:

cat $FILE |
tr -cs \.\?\!A-Za-z ' ' |       # delete all unwanted chars
tr '\.\?\!' '\012' |            # separate sentences onto new lines
tr A-Z a-z |                    # convert upper to lower case
$STOPLIST > $TEMPFILE           # eliminate words in stop-list

cat $TEMPFILE |

# Eliminate duplicate words from sentence and print each word:

awk '
{
  for (i = 1; i <= NF; i++) dup[i] = 0
  for (i = 1; i < NF; i++)
  {
    for (j = i+1; j <= NF ; j++)
    {
      if ($i == $j) dup[i] = 1
    }
  }
  for (i = 1; i <= NF ; i++)
    if (dup[i] == 0) print $i
}' |
sort |                                   # sort words in alphabetical order
uniq -c |                                # count occurance frequency
awk '                                    # print fields in reverse order
{
  print $2 " " $1                        # print word, then occurance
}' > $FREQFILE                           # save in a file


########################################################################
#
# Now calculate all word combinations:
#
########################################################################

cat $TEMPFILE |
awk '                        # remove all duplicate words from sentence:
                             # and then print all word pairs from sentence:
{
  for (i = 1; i <= NF; i++) dup[i] = 0
  out = ""

  for (i = 1; i < NF; i++)
  {
    for (j = i+1; j <= NF ; j++)
    {
      if ($i == $j) dup[i] = 1
    }
  }
  for (i = 1; i < NF ; i++)
  {
    for (j = i+1; j <= NF ; j++)
      if ( (dup[i] == 0) && (dup[j] == 0) ) print $i "  " $j
  }
}' |
sort |
uniq -c > $COMBOFILE

#########################################################################
#
# Join the files by 1st word, get:
# FirstWord    FirstWordFreq     FreqTogether     SecondWord
#
#########################################################################

join -j2 2 $FREQFILE $COMBOFILE |
sort +3 > $COMBOFIL1                   # sort by 2nd word

#########################################################################
#
# Join the files by 2nd word, get:
# SecondWord SecondWordFreq FirstWord FirstWordFreq FreqTogether
#
#########################################################################

join -j2 4 $FREQFILE $COMBOFIL1 |
awk '
{
  Ajk = $5 / ( sqrt($2) * sqrt($4) )
    printf("%-35s %-35s %g\n", $3, $1, Ajk)
}' |
sort -T $SRTD +2 -rn >$REPORTFILE


#
# print only word pairs whose Ajk values are between 1 and 0.1
#
/bin/rm -f $COMBOFILE   # remove work files
/bin/rm -f $COMBOFIL1
/bin/rm -f $FREQFILE
/bin/rm -f $TEMPFILE
SHAR_EOF
chmod +x 'p1'
fi
if test -f 'sl'
then
        echo shar: "will not over-write existing file 'sl'"
else
cat << \SHAR_EOF > 'sl'
#!/bin/sh
awk '
{
  {
    sl["a"] = 1;
    sl["b"] = 1
    sl["c"] = 1
    sl["d"] = 1
    sl["e"] = 1
    sl["f"] = 1
    sl["g"] = 1
    sl["h"] = 1
    sl["i"] = 1
    sl["j"] = 1
    sl["k"] = 1
    sl["l"] = 1
    sl["m"] = 1
    sl["n"] = 1
    sl["o"] = 1
    sl["p"] = 1
    sl["q"] = 1
    sl["r"] = 1
    sl["s"] = 1
    sl["t"] = 1
    sl["u"] = 1
    sl["v"] = 1
    sl["w"] = 1
    sl["x"] = 1
    sl["y"] = 1
    sl["z"] = 1
    sl["about"] = 1;
    sl["after"] = 1;
    sl["against"] = 1;
    sl["all"] = 1;
    sl["also"] = 1;
    sl["an"] = 1;
    sl["and"] = 1;
    sl["another"] = 1;
    sl["any"] = 1;
    sl["are"] = 1;
    sl["as"] = 1;
    sl["at"] = 1;
    sl["back"] = 1;
    sl["be"] = 1;
    sl["because"] = 1;
    sl["been"] = 1;
    sl["before"] = 1;
    sl["being"] = 1;
    sl["between"] = 1;
    sl["both"] = 1;
    sl["but"] = 1;
    sl["by"] = 1;
    sl["came"] = 1;
    sl["can"] = 1;
    sl["come"] = 1;
    sl["could"] = 1;
    sl["day"] = 1;
    sl["did"] = 1;
    sl["do"] = 1;
    sl["down"] = 1;
    sl["each"] = 1;
    sl["even"] = 1;
    sl["first"] = 1;
    sl["for"] = 1;
    sl["from"] = 1;
    sl["get"] = 1;
    sl["go"] = 1;
    sl["good"] = 1;
    sl["great"] = 1;
    sl["had"] = 1;
    sl["has"] = 1;
    sl["have"] = 1;
    sl["he"] = 1;
    sl["her"] = 1;
    sl["here"] = 1;
    sl["him"] = 1;
    sl["his"] = 1;
    sl["how"] = 1;
    sl["i"] = 1;
    sl["if"] = 1;
    sl["in"] = 1;
    sl["into"] = 1;
    sl["is"] = 1;
    sl["it"] = 1;
    sl["its"] = 1;
    sl["just"] = 1;
    sl["know"] = 1;
    sl["last"] = 1;
    sl["life"] = 1;
    sl["like"] = 1;
    sl["little"] = 1;
    sl["long"] = 1;
    sl["made"] = 1;
    sl["make"] = 1;
    sl["man"] = 1;
    sl["many"] = 1;
    sl["may"] = 1;
    sl["me"] = 1;
    sl["men"] = 1;
    sl["might"] = 1;
    sl["more"] = 1;
    sl["most"] = 1;
    sl["mr"] = 1;
    sl["much"] = 1;
    sl["must"] = 1;
    sl["my"] = 1;
    sl["never"] = 1;
    sl["new"] = 1;
    sl["no"] = 1;
    sl["not"] = 1;
    sl["now"] = 1;
    sl["of"] = 1;
    sl["off"] = 1;
    sl["old"] = 1;
    sl["on"] = 1;
    sl["one"] = 1;
    sl["only"] = 1;
    sl["or"] = 1;
    sl["other"] = 1;
    sl["our"] = 1;
    sl["out"] = 1;
    sl["over"] = 1;
    sl["own"] = 1;
    sl["people"] = 1;
    sl["right"] = 1;
    sl["said"] = 1;
    sl["same"] = 1;
    sl["see"] = 1;
    sl["she"] = 1;
    sl["should"] = 1;
    sl["since"] = 1;
    sl["so"] = 1;
    sl["some"] = 1;
    sl["state"] = 1;
    sl["still"] = 1;
    sl["such"] = 1;
    sl["take"] = 1;
    sl["than"] = 1;
    sl["that"] = 1;
    sl["the"] = 1;
    sl["their"] = 1;
    sl["them"] = 1;
    sl["then"] = 1;
    sl["there"] = 1;
    sl["these"] = 1;
    sl["they"] = 1;
    sl["this"] = 1;
    sl["those"] = 1;
    sl["three"] = 1;
    sl["through"] = 1;
    sl["time"] = 1;
    sl["to"] = 1;
    sl["too"] = 1;
    sl["two"] = 1;
    sl["under"] = 1;
    sl["up"] = 1;
    sl["us"] = 1;
    sl["used"] = 1;
    sl["very"] = 1;
    sl["was"] = 1;
    sl["way"] = 1;
    sl["we"] = 1;
    sl["well"] = 1;
    sl["were"] = 1;
    sl["what"] = 1;
    sl["when"] = 1;
    sl["where"] = 1;
    sl["which"] = 1;
    sl["while"] = 1;
    sl["who"] = 1;
    sl["will"] = 1;
    sl["with"] = 1;
    sl["work"] = 1;
    sl["world"] = 1;
    sl["would"] = 1;
    sl["year"] = 1;
    sl["years"] = 1;
    sl["you"] = 1;
    sl["your"] = 1;
  }
  out = ""
  for (i = 1 ; i <= NF ; i++)
  {
    if (sl[$i] != 1) out = out " " $i
  }
  print out
}'
SHAR_EOF
chmod +x 'sl'
fi
exit 0
#       End of shell archive

-Mitchell F. Wyle            wyle@ethz.uucp
Institut fuer Informatik     wyle%ifi.ethz.ch@relay.cs.net
ETH Zentrum
8092 Zuerich, Switzerland    +41 1 256-5237

------------------------------

END OF IRList Digest
********************