[bionet.molbio.genome-program] GenBank Parser

read@cs.utexas.edu (Rob) (09/23/90)

*  Does anyone have a lexer/parser for the GenBank Feature Table?

*  Would anone want one if wrote one?

I am seeking to avoid duplication of effort.  I work for the Biological
Workstation Group at the University of Texas Center for High
Performance Computing (UT-CHPC).  We are going to construct a logic
programming-based query interface to the feature table using the
experimental knowledge base LDL.  At present, LDL reads data
from a "flat" file in its own format.  Therefore, I need to translate
the GenBank distribution into that format.

I am a computer science student and new to the biological aspects
of this area.  From a computer science point of view, it seems
pretty obvious that the published feature table definition should
be easily parsed using widely available lexical anaylyzer and 
parser generator tools, like Flex/Lex and Bison/Yacc.

So I figure some one out there has already built a parser.
If not, I will do so.

Thanks for any response and looking for electronic contacts,
Rob

Robert L. Read				University of Texas
read@cs.utexas.edu			Center for High Performance
(512)-477-1240				Computation

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (09/25/90)

In article <922@dimebox.cs.utexas.edu> read@cs.utexas.edu (Rob) writes:
>*  Does anyone have a lexer/parser for the GenBank Feature Table?
>*  Would anone want one if wrote one?

This is an excellent idea.  But it depends on a "Definition of Genbank", a
document that (to my knowledge) not ever made public.  Has this been released?
Is it complete?  Does it include a complete BNF of the entire data structure
(not just the features table)?  Does it define allowed values of all parameters
in the structure?  If these things are not in place and documented, your parser
is doomed eventually...  (I spent many years attempting as a GenBank Advisor to
get them to define the database, and I don't think they have.)

>Robert L. Read				University of Texas
>read@cs.utexas.edu			Center for High Performance
>(512)-477-1240				Computation

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

usenet@nlm.nih.gov (usenet news poster) (09/25/90)

In article <1885@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>In article <922@dimebox.cs.utexas.edu> read@cs.utexas.edu (Rob) writes:
>>*  Does anyone have a lexer/parser for the GenBank Feature Table?
>>*  Would anone want one if wrote one?
>
>This is an excellent idea.  But it depends on a "Definition of Genbank", a
>document that (to my knowledge) not ever made public...

We have a parser which translates the new GenBank format to the data formats
defined in the GenInfo ASN.1 definition.  It should be released with
the NCBI software toolkit distribution later this fall.  Sources and
documentation will be placed in the anonymous ftp site "ncbi.nlm.nih.gov" 
subdirectories ./toolbox and ./tech-reports when they are available
(not yet).

>>Robert L. Read				University of Texas
>>read@cs.utexas.edu			Center for High Performance
>>(512)-477-1240				Computation
>
>  Tom Schneider
>  National Cancer Institute
>  Laboratory of Mathematical Biology
>  Frederick, Maryland  21702-1201
>  toms@ncifcrf.gov

David States
National Center for Biotechnology Information
Nation Library of Medicine

roy@phri.nyu.edu (Roy Smith) (09/25/90)

states@artemis.NLM.NIH.GOV (David States) writes:
-> data formats defined in the GenInfo ASN.1 definition.

What is "GenInfo ASN.1"?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

usenet@nlm.nih.gov (usenet news poster) (09/29/90)

In article <1990Sep25.013551.9336@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>
>What is "GenInfo ASN.1"?

ASN.1 is Abstract Syntax Notation 1, an International Standards Organization
standard for the description of data.  It is designed to facilitate
exchange of data between application programs.  Automated tools are
available to read ASN.1 definitions and generate source code to read
the datafiles etc.

GenInfo is a trademark of the National Library of Medicine (NLM) applied to
a number of databases and services being developed at the National
Center for Biotechnology Information (NCBI), a recently created center 
within the NLM.

More information on the data formats we are using can be obtained
by anonymous FTP to the server ncbi.nlm.nih.gov.  Look at tech-report.1.txt
in the ./tech-reports directory.
>--
>Roy Smith, Public Health Research Institute
>roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy

David States
states@ncbi.nlm.nih.gov