[comp.theory.info-retrieval] IRList Digest V4 #34 - resent without tabs, after mail problems

FOXEA@VTCC1 (07/04/88)
IRList Digest           Tuesday, 7 May 1988      Volume 4 : Issue 34

Today's Topics:
   Abstract - Selected abstracts appearing in SIGIR FORUM (part 2 of 2)

News addresses are
   Internet or CSNET: fox@vtopus.cs.vt.edu
   BITNET: foxea@vtvax3.bitnet

----------------------------------------------------------------------

Date: Tue, 17 May 88 09:10:51 CDT
From: "Dr. Raghavan" <raghavan%raghavansun%usl.csnet@RELAY.CS.NET>
Subject: Abstracts from SIGIR Forum  [Part II of II - Ed.]

[Note: this is the final part, continued from previous issue - Ed.]

                        ABSTRACTS [Note: continued - Ed.]

       ON MODELING OF INFORMATION RETRIEVAL CONCEPTS IN VECTOR SPACES
       S.K.M. Wong, W. Ziarko, V.V. Raghavan, and P.C.N. Wong, Department of Com
puter
       Science, University of Regina, Regina, Canada S4S 0A2
          The Vector Space Model (VSM) has been adopted in information retrieval
 as a
       means of coping with inexact representation of documents and queries, and
 the
       resulting difficulties in determining the relevance of a document relativ
e to
       a given query.  The major problem in employing this approach is that the
       explicit representation of term vectors is not known a priori. Consequent
ly,
       earlier researchers made the assumption that the vectors corresponding to
       terms are pairwise orthogonal.  Such an assumption is clearly unrealistic
.
       Although attempts have been made to compensate for this assumption by som
e
       separate, corrective steps, such methods are ad hoc and, in most cases,
       formally inconsistent.
          In this paper, a generalization of the VSM, called the GVSM, is advanc
ed.
       The developments provide a solution not only for the computation of a mea
sure
       of similarity (correlation) between terms, but also for the incorporation
 of
       these similarities into the retrieval process.
          The major strength of the GVSM derives from the fact that it is
       theoretically sound and elegant.  Furthermore, experimental evaluation of
 the
       model on several test collections indicates that the performance is bette
r
       than that of the VSM.  Experiments have been performed on some variations
 of
       the GVSM, and all these results have also been compared to those of the V
SM,
       based on inverse document frequency weighting.  These results and some id
eas
       for the efficient implementation of the GVSM are discussed.
       (ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 12, No. 2, pp. 299-321, 1987)



       TERM CO-OCCURRENCE IN CITED/CITING JOURNAL ARTICLES AS A MEASURE OF DOCUM
ENT
       SIMILARITY
       Donna Trivison, 1453 Elbur Avenue, Lakewood, OH 44107,
          Term co-occurrences were measured in pairs of cited/citing research
       articles selected over the period of time from 1971 until 1983 from a cor
e
       literature in the field of information science.  A consistent pattern of
term
       similarity was observed in these article pairs.  In contrast, document
       similarity was extremely low in randomly paired articles selected from th
e
       same core data base.  In 77% of cited/citing articles, there were more co
-
       occurrences of significant terms than there were in 87% of the same artic
les
       paired randomly.  The study served to quantify terminology-relatedness.
A
       comparison of the similarity of cited/citing literature of various ages
       resulted in an indication of the amount of new terminology entering the f
ield.
       And, because a clear delineation was achieved between the similarity of
       cited/citing articles and the similarity of non-cited/citing articles, th
e
       results were extended to define an expected success rate of a matching
       procedure in one context of information retrieval.
       (INFORMATION PROCESSING 7 MANAGEMENT, Vol. 23, No. 3, pp. 183-194, 1987)



       KNOWLEDGE-SPARSE AND KNOWLEDGE-RICH LEARNING IN INFORMATION RETRIEVAL
       Roy Rada, National Library of Medicine, Bethesda, MD 20894
          This paper reviews some aspects of the relationship between the large
and
       growing fields of machine learning (ML) and information retrieval (IR).
       Learning programs are described along several dimensions.  One dimension
       refers to the degree of dependence of an ML + IR program on users, thesau
ri,
       or documents.  This paper emphasizes the role of the thesaurus in ML + IR
       work.  ML + IR programs are also classified in a dimension that extends f
rom
       knowledge-sparse learning at one end to knowledge-rich learning at the ot
her.
       Knowledge-sparse learning depends largely on user yes-no feedback or on w
ord
       frequencies across documents to guide adjustments in the IR system.
       Knowledge-rich learning depends on more complex sources of feedback, such
 as
       the structure within a document or thesaurus, to direct changes in the
       knowledge bases on which an intelligent IR system depends.  New advances
in
       computer hardware make the knowledge-sparse learning programs that depend
 on
       word occurrences in documents more practical.  Advances in artificial
       intelligence bode well for knowledge-rich learning.
       (INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 3, pp. 195-210, 1987)



       KNOWLEDGE RESOURCE TOOLS FOR ACCESSING LARGE TEXT FILES
       Donald E. Walker, Artificial Intelligence and Information Science Researc
h,
       Bell Communications Research, 435 South Street MRE 2A379, Morristown, NJ
07960
          This paper provides an overview of a research program just being defin
ed at
       Bellcore.  The objective is to develop facilities for working with large
       document collections that provide more refined access to the information
       contained in these ``source'' materials than is possible through current
       information retrieval procedures.  The tools being used for this purpose
are
       machine-readable dictionaries, encyclopedias, and related ``resources'' t
hat
       provide geographical, biographical, and other kinds of specialized knowle
dge.
       A major feature of the research program is the exploitation of the recipr
ocal
       relationships between sources and resources. These interactions between t
exts
       and tools are intended to support experts who organize and use informatio
n in
       a workstation environment.  Two systems under development will be describ
ed to
       illustrate the approach:  one providing capabilities for full-text subjec
t
       assessment; the other for concept elaboration while reading text.  Progre
ss in
       the research depends critically on developments in artificial intelligenc
e,
       computational linguistices, and information science to provide a scientif
ic
       base, and on software engineering, database management, and distributed
       systems to provide the technology.
       (PROCEEDINGS OF THE FIRST CONFERENCE OF THE UNIVERSITY OF WATERLOO CENTER
 FOR
       THE NEW OXFORD ENGLAND DICTIONARY, Waterloo, Canada, pp. 11-24, November,
       1985)



       PICTURES OF RELEVANCE:  A GEOMETRIC ANALYSIS OF SIMILARITY MEASURES
       William P. Jones, Microelectronics and Computer Technology Corporation, P
.O.
       Box 200195, Austin, Texas 78720 and George W. Furnas, Bell Communications
       Research, 435 South Street, Morristown, N.J. 07960
          We want computer systems that can help us assess the similarity or
       relevance of existing objects (e.g., documents, functions, commands, etc.
) to
       a statement of our current needs (e.g., the query).  Towards this end, a
       variety of similarity measures have been proposed.  However, the relation
ship
       between a measure's formula and its performance is not always obvious.  A
       geometric analysis is advanced and its utility demonstrated through its
       application to six conventional information retrieval similarity measures
 and
       a seventh spreading activation measure.  All seven similarity measures wo
rk
       with a representational scheme wherein a query and the database objects a
re
       represented as vectors of term weights.  A geometric analysis characteriz
es
       each similarity measure by the nature of its iso-similarity contours in a
n n-
       space containing query and object vectors.  This analysis reveals importa
nt
       differences among the similarity measures and suggests conditions in whic
h
       these differences will affect retrieval performance.  The cosine coeffici
ent,
       for example, is shown to be insensitive to between-document differences i
n the
       magnitude of term weights while the inner product measure is sometimes ov
erly
       affected by such differences.  The context-sensitive spreading activation
       measure may overcome both of these limitations and deserves further study
.
       The geometric analysis is intended to complement, and perhaps to guide, t
he
       empirical analysis of similarity measures.
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 6,
 pp.
       420-442, 1987)



        3
       I R:  A NEW APPROACH TO THE DESIGN OF DOCUMENT RETRIEVAL SYSTEMS
       W.B. Croft and R.H. Thompson, Department of Computer and Information Scie
nce,
       University of Massachusetts, Amherst, MA 01003
          The most effective method of improving the retrieval performance of a
       document retrieval system is to acquire a detailed specification of the u
ser's
       information need.  The system described in this article, IIIR, provides
       a number of facilities and search strategies based on this approach.  The
       system uses a novel architecture to allow more than one system facility t
o be
       used at a given stage of a search session.  Users influence the system ac
tions
       by stating goals they wish to achieve, by evaluating system output, and b
y
       choosing particular facilities directly. The other main features of IIIR
       are an emphasis on domain knowledge used for refining the model of the
       information need, and the provision of a browsing mechanism that allows t
he
       user to navigate through the knowledge base.
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 6,
pp.
       389-404, 1987)



       HYPERTEXT:  AN INTRODUCTION AND SURVEY
       Jeff Conklin, Microelectronics and Computer Technology Corp., P.O. Box 20
0195,
       Austin, TX  78720
          As workstations grow cheaper, more powerful, and more available, new
       possibilities emerge for extending the traditional notion of ``flat'' tex
t
       files by allowing more complex organizations of the material.  Mechanisms
 are
       being devised which allow direct machine-supported references from one te
xtual
       chunk to another; new interfaces provide the user with the ability to int
eract
       directly with these chunks and to establish new relationships between the
m.
       These extensions of the traditional text fall under the general category
of
       hypertext (also known as nonlinear text).
          This article is a survey of existing hypertext systems, their applicat
ions,
       and their design.  It is both an introduction to the world of hypertext a
nd,
       at a deeper cut, a survey of some of the most important design issues tha
t go
       into fashioning a hypertext environment.
       (COMPUTER, Vol. 20, No. 9, pp. 17-42, 1987)



       PARALLEL QUERYING OF LARGE DATABASES:  A CASE STUDY
       Harold S. Stone, IBM T.J. Watson Research Center,
          Parallelism by itself does not necessarily lead to higher speed.  In t
he
       case study presented here, the parallel algorithm was far less efficient
than
       a good serial algorithm.  The study does, however, reveal how to best use
       parallelism to best use - run the more efficient serial algorithm in a
       parallel manner.
          The case study extends the work of Stanfil and Kahle, who presented
       an algorithm for high-speed querying of a large database.  They demonstra
ted
       the use of a parallel program running on a 16,000-processor Connection Ma
chine
       and obtained estimates for the running time of the algorithm on a 64K-
       processor system with queries made against a very large database of Reute
rs
       news releases.  Their results show that the throughput for parallel query
       analysis is high in an absolute sense.  But they did not provide a perfor
mance
       analysis of speedup or other aspects of algorithmic behavior that would r
eveal
       what factors of machine and algorithm design contribute most strongly to
the
       performance.  This article provides that analysis.
       (COMPUTER, Vol. 20, No. 10, pp. 11-12, 1987)



       HISTORICAL NOTE:  A PERSONALIZED HISTORY OF OCLC
       Frederick G. Kilgour, Founder Trusteed, OCLC Online Computer Library Cent
er,
       Inc., Dublin, Ohio
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
 pp.
       381-384, 1987)



       HISTORICAL NOTE:  THE PAST THIRTY YEARS IN INFORMATION RETRIEVAL
       Gerard Salton, Department of Computer Science, Cornell University, Ithaca
, New
       York 14853
          The doucmentation literature of the 1950s is reviewed briefly, and som
e
       early text processing endeavors are discussed.  Various predictions made
in
       1960 by Mooers about the creative role of computers in information retrie
val
       are then considered, and an attempt is made to explain why some of the mo
re
       exciting predictions have not been fulfilled.  Conclusions are drawn
       concerning the limits of computer power in text retrieval applications.
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 5,
pp.
       375-380, 1987)



       HISTORICAL NOTE:  INFORMATION SCIENCE AND TECHNOLOGY:  FROM COORDINATE
       INDEXING TO THE GLOBAL BRAIN
       Cloyd Dake Gull, 8 Pimlico Court, Silver Spring, MD 20906
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
 pp.
       338-366, 1987)



       HISTORICAL NOTE:  SHINING PALACES, SHIFTING SANDS:  NATIONAL INFORMATION
       SYSTEMS
       Harold Wooster, Senior Information Scientist (Retired), Lister Hill Natio
nal
       Center for Biomedical Communications, National Library of Medicine, Depar
tment
       of Health and Human Services, Bethesda, MD 20894
          This article discusses post-Sputnik national information systems under
       three major headings:  Shifting Sands, the false assumptions that the Sov
iets
       were first in space because of the superiority of their educational syste
m and
       their scientific and technical information system, VINITI; The Shining Pa
laces
       lists as appendixes 31 reports since 1958 which propose various forms of
a
       national information system, and analyzes 30 National Plans.  The author
does
       not presume to favor any of them; in Solid Rock-The Ugly Houses the autho
r
       lists in an appendix the involvement of the federal government with scien
tific
       and technical information since the first patent act of 1709, and discuss
es
       what he thinks should be done for the users of a national system, the rol
e of
       technical documentary reports, project information systems and scientific
       journals. The Summary and Conclusions starts with three quotations, writt
en 22
       years apart, which show that nothing has changed in over two decades.  In
 a
       Personal Note the author summarizes his forty year career as an informati
on
       scientist.
       (JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
 pp.
       321-335, 1987)

------------------------------

END OF IRList Digest
********************