[bionet.molbio.genome-program] parser for feature table

kazic@ANTARES.MCS.ANL.GOV (09/25/90)

Rob,

You are right, I think it is quite simple, at least for 98% of the entries.  I
have been told that many have tried without success, but I can't speak to
their experience.  Mine so far has been encouraging.  As an aside, I thought
LDL is soon to be able to manipulate Sybase tables, which is what the
relational version of GenBank is supposed to be in.  So I don't know if you
need to do a parser.

I have written a fair proportion of the elements for a feature table parser in
awk as part of some code to strip out facts relevant to E. coli for inclusion
in my Prolog database.  What it does so far is flatten a number of variations
in phrasing, then outputs each feature, whether GenBank or EMBL, into a
separate Prolog fact.  The list of variations is certainly incomplete.  What I
have to go back and write is the parser which recognizes a keyword in the
1-line text comment (after the trigger word and coordinates), and substitutes
that for the trigger word in a new fact.  The keyword will consist of a list
containing combinations of "important words".  The important words are gene
names, protein names, and names of functional sites in genes (e.g. promoter,
operator, etc).  The parser will be in Prolog.  I have deferred writing it
until I see the relational version, which i believe will come out as version
64.  I understand it is considerably different from the flat file format and
possibly better, so I concluded I had extracted much of the exercise value
from what I did and could wait to complete it until i had seen the new format.
You should be aware that there are inconsistencies of usage which could hang a
parser (this is why I flattened; my choices are based on my own biological
experience and informal polling as to word usage among my friends).

I had planned to use GenBank and PIR as sources for many of the gene and
protein names, along with a list of genetic loci for coli which I have.  I
have written code to extract the gene names from GenBank, but have not dealt
with the PIR names yet.  In general i suggest you organize the vocabulary; I
found I spent a considerable amount of time looking at entries to get a sense
of usage.

Toni