[comp.theory.info-retrieval] IRList Digest V4 #42

FOXEA@VTCC1 (07/29/88)

IRList Digest           Thursday, 28 July 1988      Volume 4 : Issue 42

Today's Topics:
   Email - Address for Charles Meadow
   Query - Definition of hypertext/hypermedia
   Reply - Suffixing, stemming
   Discussion - Metamorph, stemming, online search style
              - Online search style
              - Metamorph
   Announcement - Forum on small-systems database products
                - NTIS demo on Japanese research
                - Thesis defense on comparing extended Boolean schemes

News addresses are
   Internet or CSNET: fox@vtopus.cs.vt.edu or fox@fox.cs.vt.edu
   BITNET: foxea@vtvax3.bitnet (soon will be foxea@vtcc1)

----------------------------------------------------------------------

Date:         Wed, 27 Jul 88 08:58:06 CST
From:         Jeff Huestis <C81350JH@WUVMD>
Subject:      Address for Charles T. Meadow

Ed: do you have an email address for Charles T. Meadow? ...

--Jeff

------------------------------

Date: Wed, 27 Jul 88 14:47 EDT
From: VENTURA%21514%atc.bendix.com@RELAY.CS.NET
Subject: What exactly IS "hypertext"/"hypermedia"?


Does anyone have a good (succinct) definition of what hypertext/-media is?
I am trying to figure out whether or not an application I am working on
qualifies.


CA Ventura

------------------------------

Date: Fri, 22 Jul 88 16:42:38 EDT
From: Donna Harman <harman@nav.icst.nbs.gov>
Subject: reply to stemming query in IRDIGEST


[Note: to send mail to Donna, do not use the above address (at
least I could not get it to work) - instead try
   harman%icst-nav@icst-osi.arpa
Be careful in later correspondence since "Reply" may use the
one you see above under "From" rather than what I have given. - Ed.]

I don't know how to reply to the IRDIGEST, so I am trying it this way.

[Note: You did fine - use addresses in the header of each IRList
or as explained in the Welcome message. - Ed.]


Reply to the query on suffixing:

In interest of answering the actual question, I am supplying four
references--my paper on stemming performance, and three papers on
actual algorithms.

      Harman D., "A Failure Analysis on the Limitation of Suffixing
                  in an Online Environment", Proceedings of the Tenth
                  Annual International Conference on Research and
                  Development in Information Retrieval, New Orleans, 1987.

      Lovins J.B., "Development of a Stemming Algorithm", Mechanical
                  Translation and Computational Linguistics 11, March 1968.
                  (this is the description of the Lovins stemming algorithm
                   which has been extended for use as the SMART stemmer).

      Porter M.F.  "An Algorithm for Suffix Stripping", Program, Vol 14,
                   July 1980.
                   (this is a newer algorithm, removing fewer stems)

      Ulmschneider J. and Doszkocs T. "A Practical Stemming Algorithm for
                   Online Search Assistance", Online Review 7(4), 1983.
                   (this is a description of how to tailor-build a stemming
                   algorithm for a given collection)


In interest of rabid discussion on stemming, I will put forth the following
strawman for debate.

      Stemming is not an improvement on full word retrieval except in two
   situations:
      1)  storage is a problem--stems store in less space, although the inverted
          file is not smaller (same number of postings, just organized under a
          smaller number of terms)
      2)  the number of documents is small and/or recall is much more important
          than precision.

Fire away!

[Note: Since there are conflicting results regarding the value of
stemming and that seems to depend on the stemming algorithm and the
collection being used for the tests, why not just try to figure out
what combination of cases is best rather than make such a categorical
statement as you have done above? - Ed.]

------------------------------

Date: Mon, 25 Jul 88 19:14:54 EDT
From: MARCUS@Lids.mit.edu   (Richard Marcus)
Subject: Metamorph Stemming Search Costs and Style

Ed,

I have comments on three subjects in recent IRList Digests
which seem to be interrelated in various ways:

(1) Metamorph -- Ed, I admire your restraint in attempting to report on
this effort which has received so much hype and provided so little technical
details by which to judge it. I don't have any more details on Metamorph
as such, but there was an interesting article in BYTE (May, 1988;
p 297ff) by Roy E. Kimbrell which describes an apparently related
"N-Gram" method attributed to Raymond D'Amore and Clinton Mah of
PAR Government Systems Corp (McLean, VA). This N-Gram approach uses many

[Note: full address is 1840 Michael Faraday Dr., Suite 300,
 Reston, VA 22090-5341 and switchboard is 703/478-9690 - Ed.]

of the Salton SMART techniques (weighted vectors, cosine matching,
clustering, stemming, etc.) but applied to letter strings, or n-grams,
WITHIN words.  Although I would argue against statistical, non-word methods
as techniques of CHOICE, at least the methods are reasonably well
explained and some indication of experiments with a test corpus is
given (but no details or comparison with other methods).

(2) Ed, your pointers to Aalbersberg [IRLD:4(38)] on stemming were
good starters.  Coincidently, a stemming (conflation) algorithm in the C
programming language is given by Kimbrell in the above-mentioned Byte
article.  Let me also add that Julie Lovins, a linguist, developed
a nice stemming algorithm under our Intrex Project (Lovins, Mechanical
Translation 11:22-31[1968]) which has been used to good effect by us
and a number of other organizations.  A useful evaluation of the
algorithm was reported by Julie in the Journal of ASIS
[22(1)28-40; January, 1971].

[Note: the Lovins method is the basis for what is used in SMART - Ed.]

One interesting point is how drastically the evaluation depends on
the context.  Salton has, I believe, reported on small but significant
effectiveness for simple stemmers in SMART. Donna Harman has reported
(Proceedings 1988 RIAO Conference, pps 839-848) on experiments with
the NLM IRX system that stemming doesn't help at all. Harman suggests
that the IRX batch oriented context might be the reason for non
utility of stemming and an interactive context would probably
yield different results.  Our own research supports the latter;
experiments with our highly interactive CONIT system (see, e.g.,
Marcus, Journal ASIS 34(6):381-404; Nov., 1983) have demonstrated
the critical importance of stemming in that context.

(3) Costs Affecting Search Styles -- Bill Joel (supported by
Jeff Huestis) is right on! Cost is a critical component of context.
The Telebase Easynet front end system owes a large part of its success
to techniques for holding down online costs.  We have reported
(see, e.g., Marcus, Proceedings ASIS 85; 22:289-292) how cost factors
markedly influence search behavior online.  Despite exponential increases
in benefits/costs factors, we have not yet reached the point where
online users can derive anything like the full effectiveness of the
interactive capabilities on computers (although we're working toward
that goal with our 'smart Boolean' approach).

---Dick Marcus, MIT Lab for Information and Decision Systems...

------------------------------

Date:     25 Jul 88 17:03:00 EDT
From:     Nahum (N.) Goldmann <ACOUST@BNR.CA>
Subject:  Please post.  Thanks. (re:Do online costs affect search styl

In response to Dr.Joel's request on IRLIST, the key-factor in negotiating
a search online under the pressure (cost) is the KNOWLEDGE OF THE
SEARCH SUBJECT.  I discussed this in detail in Chapters 2 and 10
of my book (ONLINE RESEARCH AND RETRIVAL, TAB Professional
and Reference Books).  This knowledge is generally associated with
the END-USER of information, as opposed to the INTERMEDIARY (information
brocker).  Your analogy with library is entirely correct, except that a
sane specialist would never ask a librarian to search at the stacks
on his/her behalf (precisely because it has to be interactive).

I believe that it is better to negotiate online for some (the end-user)
but is necessary to define beforehand for the others (the intermediary).

Nahum Goldmann
acoust@bnr
Tel. (613)763-2329

------------------------------

Date:         Sun, 24 Jul 88 10:10:09 EDT
From:         Tung-Ying Chang <EC6C6003@TWNMOE10.bitnet>
Subject:      Metamorph


Dear Professor Fox,
     I have received volume 4 issue 36-40 and try to review the comments
/materials which you mentioned in issue 40.  I read the article "Word
ladders and a tower of Babel lead to computational heights defying
assault" in Scientific American Aug. 1987. I consider that this is the
article which Defense Science mentioned in regard to Bell Lab's research.
There is not technical details but general description.
     I agree with you that we don't need to discuss commerical systems
unless there is something new.  I suspect most of "new things" are
covered with commerical secret. Anyway, I am interested to web structure
and morpheme retrieval. Thank you very much.
     Good luck.
                  Tung-Ying Chang
~ Tung-Ying Chang     Professor Fox        7/24/88 Metamorph

------------------------------

Date:         Mon, 25 Jul 88 15:04:27 EST
From:         "James S. Cowie" <JCOWIE@YALEVM>
Subject:      PCDBMS-L at YALEVM


Greetings, IRLIST people...

   Just a brief note to inform you that due to a great positive response to
initial inquiries, there now exists a Listserv forum for discussion of
small-systems database products in academic or library contexts.  All are
welcome.  The new list is PCDBMS-L at YALEVM.  Products to be discussed
include Paradox, NotaBene, Quattro, Dbase, Rbase, DataEase, Reflex,
Revelation, etc.

                                              yours truly,

                                              James Cowie
                                              Yale University Library
                                              Systems Office
~ James S. Cowie      Irlist               7/25/88 PCDBMS-L
Acknowledge-To: <JCOWIE@YALEVM>

------------------------------

Date: Wed, 27 Jul 88 13:07:59 EDT
From: Edward A. Fox <fox>
Subject: NTIS demonstration

On Friday July 29 at 2pm (in the Idea Salon in CPAP, at
104 Draper Road, Blacksburg, VA) there will be a
demonstration by Tim Feinstein of NTIS of their
system to access Japanese research work.  All are invited.
For more information, contact John Dickey, Center for Public
Administration and Policy, VPI&SU (703) 961-5133/5830.

------------------------------

Date: Wed, 27 Jul 88 13:04:49 EDT
From: Edward A. Fox <fox@fox.cs.vt.edu>
Subject: defense

Whay C. Lee will have his MS thesis defense on Friday,
July 29 at 10am in McBryde room 558.  The title of
his thesis is "Experimental Comparison of Schemes
for Interpreting Boolean Queries".
  All are invited. - Ed Fox

------------------------------

END OF IRList Digest
********************