bin@primate.wisc.edu (Brain in Neutral) (12/23/88)
A short while ago, I posted "uniqbib", a program for eliminating duplicates from bibliographic databases in refer format. My motivation originally was too allow the results of several overlapping lookbib queries to filter those references that were hits on more than one query. In such cases you know the entries will be identical. I've been in correspondence now with several people who have expressed an interest in looking for "near" duplicates, e.g., such as might arise when entries are added to bibliographic databases by different people. In this instance, entries may be "the same" to a human, but actually slightly different - a journal title might be abbreviated by one person and not the other. Several strategies for finding near duplicates have been suggested to me, and I've thought of several others. I'm asking for comment from the net on this issue. Given two entries, how would you determine whether they are the same. (phrased another way, how you you estimate the distance between two entries?) I would prefer that responses be posted. Thanks. Paul DuBois dubois@primate.wisc.edu rhesus!dubois bin@primate.wisc.edu rhesus!bin
bin@primate.wisc.edu (Brain in Neutral) (12/23/88)
From article <455@rhesus.primate.wisc.edu>, by bin@primate.wisc.edu (Brain in Neutral): > A short while ago, I posted "uniqbib", a program for eliminating > duplicates from bibliographic databases in refer format. My motivation > originally was too allow the results of several overlapping lookbib > queries to filter those references that were hits on more than one > query. Make that "to allow the results of several overlapping lookbib queries to be combined while filtering...". Arg. Paul DuBois dubois@primate.wisc.edu rhesus!dubois bin@primate.wisc.edu rhesus!bin
perlman@tut.cis.ohio-state.edu (Gary Perlman) (12/26/88)
In article <455@rhesus.primate.wisc.edu> bin@primate.wisc.edu >I've been in correspondence now with several people who have expressed >an interest in looking for "near" duplicates, e.g., such as might arise >when entries are added to bibliographic databases by different people. >In this instance, entries may be "the same" to a human, but actually >slightly different - a journal title might be abbreviated by one person >and not the other. OCLC (Online Computer Library Center) is a non-profit company formed to establish, maintain, and operate a computerized library network (among other things). I was told that they have a database of about 18 million entries, mostly on books. They are very interested in the problem of detecting duplicate bibliographic records. In their annual review of OCLC research (July 1987-June 1988), I saw some work in that area. Tom Hickey would be a good person to ask for pointers, although I do not think he is actively working in that area. One research project is called "Duplicate Detection and the 'Species Problem,'" which was managed by John Bunge. The other is called "Clustering Equivalent Bibliographic Records," which was managed by Elaine Svenonius (a visiting scholar then at the time, so she may not be there). OCLC Online Computer Library Center, Inc. 6565 Frantz Road Dublin, OH 43017 Phone: 614-764-6000 Another area of research that seems relevant is that at BellCore by Sue Dumais (with others) on Latent Sematic Indexing. They had a paper in the last or second to last ACM SIGCHI Conference Proceedings. She can probably be reached at std@bellcore.com. I do not seem to have the physical address. -- Gary Perlman Department of Computer and Information Science perlman@cis.ohio-state.edu The Ohio State University 614-292-2566 2036 Neil Avenue Mall Columbus, OH 43210-1277
jbayer@ispi.UUCP (Jonathan Bayer) (12/29/88)
In article <455@rhesus.primate.wisc.edu>, bin@primate.wisc.edu (Brain in Neutral) writes: > A short while ago, I posted "uniqbib", a program for eliminating > > Several strategies for finding near duplicates have been suggested to > me, and I've thought of several others. I'm asking for comment from > the net on this issue. Given two entries, how would you determine > whether they are the same. (phrased another way, how you you estimate > the distance between two entries?) > Try the Soundex algorithm. It will be able to match two words which are spelled differently, but which are basicly the same. It will not be able to match an abbriviation with a full word, however. Jonathan Bayer Intelligent Software Products, Inc. -- life used to be so simple.