[comp.text] How should uniqbib look for "near" duplicates?

bin@primate.wisc.edu (Brain in Neutral) (12/23/88)

A short while ago, I posted "uniqbib", a program for eliminating
duplicates from bibliographic databases in refer format.  My motivation
originally was too allow the results of several overlapping lookbib
queries to filter those references that were hits on more than one
query.  In such cases you know the entries will be identical.

I've been in correspondence now with several people who have expressed
an interest in looking for "near" duplicates, e.g., such as might arise
when entries are added to bibliographic databases by different people.
In this instance, entries may be "the same" to a human, but actually
slightly different - a journal title might be abbreviated by one person
and not the other.

Several strategies for finding near duplicates have been suggested to
me, and I've thought of several others.  I'm asking for comment from
the net on this issue.  Given two entries, how would you determine
whether they are the same.  (phrased another way, how you you estimate
the distance between two entries?)

I would prefer that responses be posted.  Thanks.

Paul DuBois
dubois@primate.wisc.edu	rhesus!dubois
bin@primate.wisc.edu	rhesus!bin

bin@primate.wisc.edu (Brain in Neutral) (12/23/88)

From article <455@rhesus.primate.wisc.edu>, by bin@primate.wisc.edu (Brain in Neutral):
> A short while ago, I posted "uniqbib", a program for eliminating
> duplicates from bibliographic databases in refer format.  My motivation
> originally was too allow the results of several overlapping lookbib
> queries to filter those references that were hits on more than one
> query.

Make that "to allow the results of several overlapping lookbib queries
to be combined while filtering...".  Arg.

Paul DuBois
dubois@primate.wisc.edu	rhesus!dubois
bin@primate.wisc.edu	rhesus!bin

perlman@tut.cis.ohio-state.edu (Gary Perlman) (12/26/88)

In article <455@rhesus.primate.wisc.edu> bin@primate.wisc.edu
>I've been in correspondence now with several people who have expressed
>an interest in looking for "near" duplicates, e.g., such as might arise
>when entries are added to bibliographic databases by different people.
>In this instance, entries may be "the same" to a human, but actually
>slightly different - a journal title might be abbreviated by one person
>and not the other.

OCLC (Online Computer Library Center) is a non-profit company formed
to establish, maintain, and operate a computerized library network
(among other things).  I was told that they have a database of about
18 million entries, mostly on books.  They are very interested in the
problem of detecting duplicate bibliographic records.  In their annual
review of OCLC research (July 1987-June 1988), I saw some work in that
area.  Tom Hickey would be a good person to ask for pointers, although
I do not think he is actively working in that area.  One research
project is called "Duplicate Detection and the 'Species Problem,'" which
was managed by John Bunge.  The other is called "Clustering Equivalent
Bibliographic Records," which was managed by Elaine Svenonius (a visiting
scholar then at the time, so she may not be there).

	OCLC Online Computer Library Center, Inc.
	6565 Frantz Road
	Dublin, OH 43017
	Phone: 614-764-6000

Another area of research that seems relevant is that at BellCore
by Sue Dumais (with others) on Latent Sematic Indexing.  They had
a paper in the last or second to last ACM SIGCHI Conference Proceedings.
She can probably be reached at std@bellcore.com.  I do not seem to have
the physical address.
-- 
Gary Perlman                   Department of Computer and Information Science
perlman@cis.ohio-state.edu     The Ohio State University
614-292-2566                   2036 Neil Avenue Mall
                               Columbus, OH 43210-1277

jbayer@ispi.UUCP (Jonathan Bayer) (12/29/88)

In article <455@rhesus.primate.wisc.edu>, bin@primate.wisc.edu (Brain in Neutral) writes:
> A short while ago, I posted "uniqbib", a program for eliminating
> 
> Several strategies for finding near duplicates have been suggested to
> me, and I've thought of several others.  I'm asking for comment from
> the net on this issue.  Given two entries, how would you determine
> whether they are the same.  (phrased another way, how you you estimate
> the distance between two entries?)
> 

Try the Soundex algorithm.  It will be able to match two words which are
spelled differently, but which are basicly the same.  It will not be
able to match an abbriviation with a full word, however.

Jonathan Bayer
Intelligent Software Products, Inc.

-- 
life used to be so simple.