[net.wanted.sources] Wanted: software to help with 'bib' database management

roy@phri.UUCP (Roy Smith) (05/14/85)

	We use the 'bib' bibliographic data base system here, but I
suspect this applies to 'refer' users as well.  We have a lot of trouble
keeping duplicate entries out of our data base.

	People working on a new manuscript keep their own private
reference files which have only the references that don't appear in the
system ref file.  Periodically, they tag them with their name and date
(we use %W for this) and mail them to one person who adds them to the
system list.

	Invariably, no matter how careful people are, dups pop up.  We
have identified 2 sources of real dups and a source of pseudo dups.

	1) Somebody cites a paper which is not in the system file so
they put it in their private file.  Before that entry gets added to the
system file, someone else does the same thing.  Eventually, both people
(convinced that they have eliminated all dups), submit their references
for inclusion in the master list.

	2) A reference is put in the system-wide file twice, but bib
doesn't complain because one of them has a typo in it.  Eventually,
somebody notices the typo, submits a correction request, and all of a
sudden, people's known-to-be-unique citations generate error messages
from bib.

	3) Early in the year, someone puts a reference in the data base
and finds a unique citation for it.  They send their manuscript out;
later in the year it comes back for revisions.  By this time, someone
else has added another ref to the master list which is not a dup, but
the particular citation that worked earlier in the year doesn't have any
of the disambiguating keywords in it.

	What we need is an automated way to search the reference file(s)
for near duplicates.  The scheme I have in mind would take a reference
file, parse it into individual references, and for each reference,
generate a list of keywords.  So far, this is just the 'invert' front
end.  It would then report entries in the system file which match 75% or
more of the keywords.  Has anybody already done anything like this, or
have any hints as to how I should proceed?

-- 
allegra!phri!roy (Roy Smith)
System Administrator, Public Health Research Institute