roy@alanine.phri.nyu.edu (Roy Smith) (02/01/91)
Twice in the last couple of days, the same thing has happened to me. J. Random biologist walks into my office and wants to find an entry in genbank. I try hard to extract some useful keywords. In this case, I got from said JRB the name of a gene, FemA. Unfortunately, FEMA isn't a keyword that the entry is indexed under, but "FEMA PROTEIN" is, which we only discover by some trial and error. Obviously, a query of FEMA should match the "FEMA PROTEIN" keyword supplied by the submitter, but what's the best way to make that work? One strategy is to have the searching program (be it IRX or anything else) be smart enough to do partial matches. Another would be to have the database maintainers/indexers be smart enough to realize that while PROTEIN by itself would not make a very good keyword, FEMA by itself would and turn the submitter's "FEMA PROTEIN" into "FEMA PROTEIN, FEMA". As a programmer who writes data base searching software, I'd prefer the later solution, since it makes my life easier at the expense of somebody else's effort. I imagine the database maintainers feel just the other way. I'd be interested to hear comments from other people about what is the best way to generate good keyword/keyphrase indicies for genbank. I suppose I could always just take the keyword index IG provides and re-work it to split keyphrases into their component words, but after my flame last week about personalized reformatting of files on the distribution tapes, I'd probably just end up getting into a shouting match with myself :-) -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
Davison@UH.EDU (Dan Davison) (02/01/91)
> Twice in the last couple of days, the same thing has happened to > me. J. Random biologist walks into my office and wants to find an entry in > genbank. I try hard to extract some useful keywords. In this case, I got > from said JRB the name of a gene, FemA. Unfortunately, FEMA isn't a > keyword that the entry is indexed under, but "FEMA PROTEIN" is [...] > I'd be interested to hear comments from other people about what > is the best way to generate good keyword/keyphrase indicies for genbank. Geez, I thought it was only me. This happens a lot; one of the most brain-damaged things about GenBank in its flat-file incarnation is that the "keywords" are essentially a joke. J. Random Faculty Member or J. Random Gradual Student asks for this kind of thing once a week or so. And people always want such thing from the e-mail server (UH, not GenBank's, presumably they want it from GB too). I have just about finished an awk script to let people do lookups by keyword, but it uses GB's-provided keyword index. The solution is to do a baby IRX; index all the words in the headers. (If the Gene-Server grant is funded I think I'll do this, thanks for the idea, Roy). The thing to do is to somehow get the flat file format "keyword" line improved. Since there is YAGCC in the winds (Yet Another GenBank Contract Change) I won't hold my breath, but it sure would be nice if the GenBank curator project would have the curators attach "significant" (to others in the field) keywords. Paul? > but after my flame last week about personalized reformatting of > files on the distribution tapes, I'd probably just end up getting > into a shouting match with myself :-) I know how you feel...see my .sig... dan -- dr. dan davison/dept. of biochemical and biophysical sciences/univ. of Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU Disclaimer: As always, I speak only for myself, and, usually, only to myself.
kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/01/91)
Roy, The utility of the GenBank keywords has been problematic for many years. The standardization of vocabulary for such a complex subject as ours is not a trivial task, but I acknowledge from my own experience that examples of suboptimal keyword choices are not hard to find in the database. Please note that the index file provided with the database merely compiles what is on the KEYWORDS line in the flat files and does not attempt any additional classifications. For the more astute, one could always try utilities such as grep, etc., on this file. On GOS we have surmounted this problem through the use of IRX which basically indexes every word in the database and makes keyword searches trivial. The National Library of Medicine has developed (with considerable effort) a standard terminology called MeSH (Medical Subject Headings). However, at this stage it would require much, much more effort and money to try and rework all of the GenBank keyword entries than to simply adopt the IRX approach and invert the database for keyword searches. Our colleagues at LANL are now working on the RDBMS version of the database and perhaps they can elaborate on how keywords are treated in the relational format. Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/01/91)
Dan, Be careful if you want to permit keyword searches by e-mail. Even though we get asked this all the time, there is a good reason for not allowing it. The potential for horrendous amounts of output if an unfortunate choice of words is made is very high. That is why we restrict the use of IRX to the dial-up account. In IRX if a person does a search on something specific like "gene" 8-) the program warns them that they have found 1,465,345,567 entries and would they like to consider rephrasing their query. Although a warning mechanism, e.g., sending back the number of entries, could be built into e-mail, this is obviously more work than the dial-up approach which allows people to scrutinize the output first instead of just sending it back through the mail system. Perhaps a compromise for e-mail keyword searches would be just to return the DEFINITION lines describing the sequences and then let the user send additional mail to retrieve entries of interest. The worst output one could get from a search of release 66 would be about 41,000 DEFINITION lines in a mail message! That would be less than a 4 megabyte mail message 8-) 8-) 8-)!!! "Bon chance" as they say in France! Dave
Davison@UH.EDU (Dan Davison) (02/01/91)
Dave K. notes: > The potential for horrendous amounts of output if an > unfortunate choice of words is made is very high. Yes, and my solution is... > Although a warning mechanism, e.g., sending back the number of > entries, could be built into e-mail, and their > Perhaps a compromise for e-mail keyword searches would be just > to return the DEFINITION lines and accession numbers. The results can still get pretty big, so I have a "top 100" (actually " ...| head -100 | ...") in the script. The major problem is that I can't program to save myself, so awk chokes on several lines in GenBank. Until I get the time to do this, the whole thing's on hold. dan -- dr. dan davison/dept. of biochemical and biophysical sciences/univ. of Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU Disclaimer: As always, I speak only for myself, and, usually, only to myself.