[bionet.molbio.genbank] Suggestion for keywords in genbank

roy@alanine.phri.nyu.edu (Roy Smith) (02/01/91)

	Twice in the last couple of days, the same thing has happened to
me.  J. Random biologist walks into my office and wants to find an entry in
genbank.  I try hard to extract some useful keywords.  In this case, I got
from said JRB the name of a gene, FemA.  Unfortunately, FEMA isn't a
keyword that the entry is indexed under, but "FEMA PROTEIN" is, which we
only discover by some trial and error.

	Obviously, a query of FEMA should match the "FEMA PROTEIN" keyword
supplied by the submitter, but what's the best way to make that work?  One
strategy is to have the searching program (be it IRX or anything else) be
smart enough to do partial matches.  Another would be to have the database
maintainers/indexers be smart enough to realize that while PROTEIN by
itself would not make a very good keyword, FEMA by itself would and turn
the submitter's "FEMA PROTEIN" into "FEMA PROTEIN, FEMA".

	As a programmer who writes data base searching software, I'd prefer
the later solution, since it makes my life easier at the expense of
somebody else's effort.  I imagine the database maintainers feel just the
other way.  I'd be interested to hear comments from other people about what
is the best way to generate good keyword/keyphrase indicies for genbank.

	I suppose I could always just take the keyword index IG provides
and re-work it to split keyphrases into their component words, but after my
flame last week about personalized reformatting of files on the
distribution tapes, I'd probably just end up getting into a shouting match
with myself :-)
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

Davison@UH.EDU (Dan Davison) (02/01/91)

> 	Twice in the last couple of days, the same thing has happened to
> me.  J. Random biologist walks into my office and wants to find an entry in
> genbank.  I try hard to extract some useful keywords.  In this case, I got
> from said JRB the name of a gene, FemA.  Unfortunately, FEMA isn't a
> keyword that the entry is indexed under, but "FEMA PROTEIN" is [...]
> I'd be interested to hear comments from other people about what
> is the best way to generate good keyword/keyphrase indicies for genbank.


Geez, I thought it was only me.  This happens a lot; one of the most
brain-damaged things about GenBank in its flat-file incarnation is
that the "keywords" are essentially a joke.  J. Random Faculty Member
or J. Random Gradual Student asks for this kind of thing once a week
or so.  And people always want such thing from the e-mail server (UH,
not GenBank's, presumably they want it from GB too).  I have just
about finished an awk script to let people do lookups by keyword, but
it uses GB's-provided keyword index.  The solution is to do a baby
IRX; index all the words in the headers.  (If the Gene-Server grant is
funded I think I'll do this, thanks for the idea, Roy).


The thing to do is to somehow get the flat file format "keyword" line
improved.  Since there is YAGCC in the winds (Yet Another GenBank
Contract Change) I won't hold my breath, but it sure would be nice if
the GenBank curator project would have the curators attach
"significant" (to others in the field) keywords.  Paul?



> but after my flame last week about personalized reformatting of
> files on the distribution tapes, I'd probably just end up getting
> into a shouting match with myself :-)

I know how you feel...see my .sig...

dan

-- 
dr. dan davison/dept. of biochemical and biophysical sciences/univ. of
Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU
Disclaimer: As always, I speak only for myself, and, usually, only to
myself.

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/01/91)

Roy,

	The utility of the GenBank keywords has been problematic for
many years.  The standardization of vocabulary for such a complex
subject as ours is not a trivial task, but I acknowledge from my own
experience that examples of suboptimal keyword choices are not hard to
find in the database.  Please note that the index file provided with
the database merely compiles what is on the KEYWORDS line in the flat
files and does not attempt any additional classifications.  For the
more astute, one could always try utilities such as grep, etc., on
this file.  

	On GOS we have surmounted this problem through the use of IRX
which basically indexes every word in the database and makes keyword
searches trivial.  The National Library of Medicine has developed
(with considerable effort) a standard terminology called MeSH (Medical
Subject Headings).  However, at this stage it would require much, much
more effort and money to try and rework all of the GenBank keyword
entries than to simply adopt the IRX approach and invert the database
for keyword searches.  Our colleagues at LANL are now working on the
RDBMS version of the database and perhaps they can elaborate on how
keywords are treated in the relational format.

				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/01/91)

Dan,

	Be careful if you want to permit keyword searches by e-mail.
Even though we get asked this all the time, there is a good reason for
not allowing it.  The potential for horrendous amounts of output if an
unfortunate choice of words is made is very high.  That is why we
restrict the use of IRX to the dial-up account.  In IRX if a person
does a search on something specific like "gene" 8-) the program warns
them that they have found 1,465,345,567 entries and would they like to
consider rephrasing their query.  

	Although a warning mechanism, e.g., sending back the number of
entries, could be built into e-mail, this is obviously more work than
the dial-up approach which allows people to scrutinize the output
first instead of just sending it back through the mail system.

	Perhaps a compromise for e-mail keyword searches would be just
to return the DEFINITION lines describing the sequences and then let
the user send additional mail to retrieve entries of interest.  The
worst output one could get from a search of release 66 would be about
41,000 DEFINITION lines in a mail message!  That would be less than a
4 megabyte mail message 8-) 8-) 8-)!!!  "Bon chance" as they say in
France!

Dave

Davison@UH.EDU (Dan Davison) (02/01/91)

Dave K. notes:



> The potential for horrendous amounts of output if an
> unfortunate choice of words is made is very high. 

Yes, and my solution is...

> 	Although a warning mechanism, e.g., sending back the number of
> entries, could be built into e-mail, 

and their

> 	Perhaps a compromise for e-mail keyword searches would be just
> to return the DEFINITION lines 

and accession numbers.  The results can still get pretty big, so I
have a "top 100" (actually " ...| head -100 | ...") in the script.

The major problem is that I can't program to save myself, so awk
chokes on several lines in GenBank.  Until I get the time to do this,
the whole thing's on hold.

dan
-- 
dr. dan davison/dept. of biochemical and biophysical sciences/univ. of
Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU
Disclaimer: As always, I speak only for myself, and, usually, only to
myself.