FOXEA@VTCC1 (07/04/88)
IRList Digest Tuesday, 7 May 1988 Volume 4 : Issue 34
Today's Topics:
Abstract - Selected abstracts appearing in SIGIR FORUM (part 2 of 2)
News addresses are
Internet or CSNET: fox@vtopus.cs.vt.edu
BITNET: foxea@vtvax3.bitnet
----------------------------------------------------------------------
Date: Tue, 17 May 88 09:10:51 CDT
From: "Dr. Raghavan" <raghavan%raghavansun%usl.csnet@RELAY.CS.NET>
Subject: Abstracts from SIGIR Forum [Part II of II - Ed.]
[Note: this is the final part, continued from previous issue - Ed.]
ABSTRACTS [Note: continued - Ed.]
ON MODELING OF INFORMATION RETRIEVAL CONCEPTS IN VECTOR SPACES
S.K.M. Wong, W. Ziarko, V.V. Raghavan, and P.C.N. Wong, Department of Com
puter
Science, University of Regina, Regina, Canada S4S 0A2
The Vector Space Model (VSM) has been adopted in information retrieval
as a
means of coping with inexact representation of documents and queries, and
the
resulting difficulties in determining the relevance of a document relativ
e to
a given query. The major problem in employing this approach is that the
explicit representation of term vectors is not known a priori. Consequent
ly,
earlier researchers made the assumption that the vectors corresponding to
terms are pairwise orthogonal. Such an assumption is clearly unrealistic
.
Although attempts have been made to compensate for this assumption by som
e
separate, corrective steps, such methods are ad hoc and, in most cases,
formally inconsistent.
In this paper, a generalization of the VSM, called the GVSM, is advanc
ed.
The developments provide a solution not only for the computation of a mea
sure
of similarity (correlation) between terms, but also for the incorporation
of
these similarities into the retrieval process.
The major strength of the GVSM derives from the fact that it is
theoretically sound and elegant. Furthermore, experimental evaluation of
the
model on several test collections indicates that the performance is bette
r
than that of the VSM. Experiments have been performed on some variations
of
the GVSM, and all these results have also been compared to those of the V
SM,
based on inverse document frequency weighting. These results and some id
eas
for the efficient implementation of the GVSM are discussed.
(ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 12, No. 2, pp. 299-321, 1987)
TERM CO-OCCURRENCE IN CITED/CITING JOURNAL ARTICLES AS A MEASURE OF DOCUM
ENT
SIMILARITY
Donna Trivison, 1453 Elbur Avenue, Lakewood, OH 44107,
Term co-occurrences were measured in pairs of cited/citing research
articles selected over the period of time from 1971 until 1983 from a cor
e
literature in the field of information science. A consistent pattern of
term
similarity was observed in these article pairs. In contrast, document
similarity was extremely low in randomly paired articles selected from th
e
same core data base. In 77% of cited/citing articles, there were more co
-
occurrences of significant terms than there were in 87% of the same artic
les
paired randomly. The study served to quantify terminology-relatedness.
A
comparison of the similarity of cited/citing literature of various ages
resulted in an indication of the amount of new terminology entering the f
ield.
And, because a clear delineation was achieved between the similarity of
cited/citing articles and the similarity of non-cited/citing articles, th
e
results were extended to define an expected success rate of a matching
procedure in one context of information retrieval.
(INFORMATION PROCESSING 7 MANAGEMENT, Vol. 23, No. 3, pp. 183-194, 1987)
KNOWLEDGE-SPARSE AND KNOWLEDGE-RICH LEARNING IN INFORMATION RETRIEVAL
Roy Rada, National Library of Medicine, Bethesda, MD 20894
This paper reviews some aspects of the relationship between the large
and
growing fields of machine learning (ML) and information retrieval (IR).
Learning programs are described along several dimensions. One dimension
refers to the degree of dependence of an ML + IR program on users, thesau
ri,
or documents. This paper emphasizes the role of the thesaurus in ML + IR
work. ML + IR programs are also classified in a dimension that extends f
rom
knowledge-sparse learning at one end to knowledge-rich learning at the ot
her.
Knowledge-sparse learning depends largely on user yes-no feedback or on w
ord
frequencies across documents to guide adjustments in the IR system.
Knowledge-rich learning depends on more complex sources of feedback, such
as
the structure within a document or thesaurus, to direct changes in the
knowledge bases on which an intelligent IR system depends. New advances
in
computer hardware make the knowledge-sparse learning programs that depend
on
word occurrences in documents more practical. Advances in artificial
intelligence bode well for knowledge-rich learning.
(INFORMATION PROCESSING & MANAGEMENT, Vol. 23, No. 3, pp. 195-210, 1987)
KNOWLEDGE RESOURCE TOOLS FOR ACCESSING LARGE TEXT FILES
Donald E. Walker, Artificial Intelligence and Information Science Researc
h,
Bell Communications Research, 435 South Street MRE 2A379, Morristown, NJ
07960
This paper provides an overview of a research program just being defin
ed at
Bellcore. The objective is to develop facilities for working with large
document collections that provide more refined access to the information
contained in these ``source'' materials than is possible through current
information retrieval procedures. The tools being used for this purpose
are
machine-readable dictionaries, encyclopedias, and related ``resources'' t
hat
provide geographical, biographical, and other kinds of specialized knowle
dge.
A major feature of the research program is the exploitation of the recipr
ocal
relationships between sources and resources. These interactions between t
exts
and tools are intended to support experts who organize and use informatio
n in
a workstation environment. Two systems under development will be describ
ed to
illustrate the approach: one providing capabilities for full-text subjec
t
assessment; the other for concept elaboration while reading text. Progre
ss in
the research depends critically on developments in artificial intelligenc
e,
computational linguistices, and information science to provide a scientif
ic
base, and on software engineering, database management, and distributed
systems to provide the technology.
(PROCEEDINGS OF THE FIRST CONFERENCE OF THE UNIVERSITY OF WATERLOO CENTER
FOR
THE NEW OXFORD ENGLAND DICTIONARY, Waterloo, Canada, pp. 11-24, November,
1985)
PICTURES OF RELEVANCE: A GEOMETRIC ANALYSIS OF SIMILARITY MEASURES
William P. Jones, Microelectronics and Computer Technology Corporation, P
.O.
Box 200195, Austin, Texas 78720 and George W. Furnas, Bell Communications
Research, 435 South Street, Morristown, N.J. 07960
We want computer systems that can help us assess the similarity or
relevance of existing objects (e.g., documents, functions, commands, etc.
) to
a statement of our current needs (e.g., the query). Towards this end, a
variety of similarity measures have been proposed. However, the relation
ship
between a measure's formula and its performance is not always obvious. A
geometric analysis is advanced and its utility demonstrated through its
application to six conventional information retrieval similarity measures
and
a seventh spreading activation measure. All seven similarity measures wo
rk
with a representational scheme wherein a query and the database objects a
re
represented as vectors of term weights. A geometric analysis characteriz
es
each similarity measure by the nature of its iso-similarity contours in a
n n-
space containing query and object vectors. This analysis reveals importa
nt
differences among the similarity measures and suggests conditions in whic
h
these differences will affect retrieval performance. The cosine coeffici
ent,
for example, is shown to be insensitive to between-document differences i
n the
magnitude of term weights while the inner product measure is sometimes ov
erly
affected by such differences. The context-sensitive spreading activation
measure may overcome both of these limitations and deserves further study
.
The geometric analysis is intended to complement, and perhaps to guide, t
he
empirical analysis of similarity measures.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 6,
pp.
420-442, 1987)
3
I R: A NEW APPROACH TO THE DESIGN OF DOCUMENT RETRIEVAL SYSTEMS
W.B. Croft and R.H. Thompson, Department of Computer and Information Scie
nce,
University of Massachusetts, Amherst, MA 01003
The most effective method of improving the retrieval performance of a
document retrieval system is to acquire a detailed specification of the u
ser's
information need. The system described in this article, IIIR, provides
a number of facilities and search strategies based on this approach. The
system uses a novel architecture to allow more than one system facility t
o be
used at a given stage of a search session. Users influence the system ac
tions
by stating goals they wish to achieve, by evaluating system output, and b
y
choosing particular facilities directly. The other main features of IIIR
are an emphasis on domain knowledge used for refining the model of the
information need, and the provision of a browsing mechanism that allows t
he
user to navigate through the knowledge base.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 6,
pp.
389-404, 1987)
HYPERTEXT: AN INTRODUCTION AND SURVEY
Jeff Conklin, Microelectronics and Computer Technology Corp., P.O. Box 20
0195,
Austin, TX 78720
As workstations grow cheaper, more powerful, and more available, new
possibilities emerge for extending the traditional notion of ``flat'' tex
t
files by allowing more complex organizations of the material. Mechanisms
are
being devised which allow direct machine-supported references from one te
xtual
chunk to another; new interfaces provide the user with the ability to int
eract
directly with these chunks and to establish new relationships between the
m.
These extensions of the traditional text fall under the general category
of
hypertext (also known as nonlinear text).
This article is a survey of existing hypertext systems, their applicat
ions,
and their design. It is both an introduction to the world of hypertext a
nd,
at a deeper cut, a survey of some of the most important design issues tha
t go
into fashioning a hypertext environment.
(COMPUTER, Vol. 20, No. 9, pp. 17-42, 1987)
PARALLEL QUERYING OF LARGE DATABASES: A CASE STUDY
Harold S. Stone, IBM T.J. Watson Research Center,
Parallelism by itself does not necessarily lead to higher speed. In t
he
case study presented here, the parallel algorithm was far less efficient
than
a good serial algorithm. The study does, however, reveal how to best use
parallelism to best use - run the more efficient serial algorithm in a
parallel manner.
The case study extends the work of Stanfil and Kahle, who presented
an algorithm for high-speed querying of a large database. They demonstra
ted
the use of a parallel program running on a 16,000-processor Connection Ma
chine
and obtained estimates for the running time of the algorithm on a 64K-
processor system with queries made against a very large database of Reute
rs
news releases. Their results show that the throughput for parallel query
analysis is high in an absolute sense. But they did not provide a perfor
mance
analysis of speedup or other aspects of algorithmic behavior that would r
eveal
what factors of machine and algorithm design contribute most strongly to
the
performance. This article provides that analysis.
(COMPUTER, Vol. 20, No. 10, pp. 11-12, 1987)
HISTORICAL NOTE: A PERSONALIZED HISTORY OF OCLC
Frederick G. Kilgour, Founder Trusteed, OCLC Online Computer Library Cent
er,
Inc., Dublin, Ohio
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
pp.
381-384, 1987)
HISTORICAL NOTE: THE PAST THIRTY YEARS IN INFORMATION RETRIEVAL
Gerard Salton, Department of Computer Science, Cornell University, Ithaca
, New
York 14853
The doucmentation literature of the 1950s is reviewed briefly, and som
e
early text processing endeavors are discussed. Various predictions made
in
1960 by Mooers about the creative role of computers in information retrie
val
are then considered, and an attempt is made to explain why some of the mo
re
exciting predictions have not been fulfilled. Conclusions are drawn
concerning the limits of computer power in text retrieval applications.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol 38, No. 5,
pp.
375-380, 1987)
HISTORICAL NOTE: INFORMATION SCIENCE AND TECHNOLOGY: FROM COORDINATE
INDEXING TO THE GLOBAL BRAIN
Cloyd Dake Gull, 8 Pimlico Court, Silver Spring, MD 20906
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
pp.
338-366, 1987)
HISTORICAL NOTE: SHINING PALACES, SHIFTING SANDS: NATIONAL INFORMATION
SYSTEMS
Harold Wooster, Senior Information Scientist (Retired), Lister Hill Natio
nal
Center for Biomedical Communications, National Library of Medicine, Depar
tment
of Health and Human Services, Bethesda, MD 20894
This article discusses post-Sputnik national information systems under
three major headings: Shifting Sands, the false assumptions that the Sov
iets
were first in space because of the superiority of their educational syste
m and
their scientific and technical information system, VINITI; The Shining Pa
laces
lists as appendixes 31 reports since 1958 which propose various forms of
a
national information system, and analyzes 30 National Plans. The author
does
not presume to favor any of them; in Solid Rock-The Ugly Houses the autho
r
lists in an appendix the involvement of the federal government with scien
tific
and technical information since the first patent act of 1709, and discuss
es
what he thinks should be done for the users of a national system, the rol
e of
technical documentary reports, project information systems and scientific
journals. The Summary and Conclusions starts with three quotations, writt
en 22
years apart, which show that nothing has changed in over two decades. In
a
Personal Note the author summarizes his forty year career as an informati
on
scientist.
(JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, Vol. 38, No. 5,
pp.
321-335, 1987)
------------------------------
END OF IRList Digest
********************