kazic@ANTARES.MCS.ANL.GOV (09/25/90)
Rob, You are right, I think it is quite simple, at least for 98% of the entries. I have been told that many have tried without success, but I can't speak to their experience. Mine so far has been encouraging. As an aside, I thought LDL is soon to be able to manipulate Sybase tables, which is what the relational version of GenBank is supposed to be in. So I don't know if you need to do a parser. I have written a fair proportion of the elements for a feature table parser in awk as part of some code to strip out facts relevant to E. coli for inclusion in my Prolog database. What it does so far is flatten a number of variations in phrasing, then outputs each feature, whether GenBank or EMBL, into a separate Prolog fact. The list of variations is certainly incomplete. What I have to go back and write is the parser which recognizes a keyword in the 1-line text comment (after the trigger word and coordinates), and substitutes that for the trigger word in a new fact. The keyword will consist of a list containing combinations of "important words". The important words are gene names, protein names, and names of functional sites in genes (e.g. promoter, operator, etc). The parser will be in Prolog. I have deferred writing it until I see the relational version, which i believe will come out as version 64. I understand it is considerably different from the flat file format and possibly better, so I concluded I had extracted much of the exercise value from what I did and could wait to complete it until i had seen the new format. You should be aware that there are inconsistencies of usage which could hang a parser (this is why I flattened; my choices are based on my own biological experience and informal polling as to word usage among my friends). I had planned to use GenBank and PIR as sources for many of the gene and protein names, along with a list of genetic loci for coli which I have. I have written code to extract the gene names from GenBank, but have not dealt with the PIR names yet. In general i suggest you organize the vocabulary; I found I spent a considerable amount of time looking at entries to get a sense of usage. Toni