[bionet.molbio.genome-program] feature table parsers - what's it all mean?

MICPRF@lure.latrobe.edu.au (10/03/90)

Hi,
Excuse my profound ignorance, but all this discussion about Genbank
parsers and feature tables etc. has finally aroused my curiosity to
the point where I simply must risk being flamed. Can someone please
explain to me (first-grade level please!) what a feature table is,
what a parser does with it and what one can use that for?
If this is too elementary for general consumption, email to my
kindergarten please at micprf@lure.latrobe.edu.au. Thanks,
				Paul Fisher.

kristoff@genbank.bio.net (David Kristofferson) (10/03/90)

In a nutshell, the GenBank features table is a list of "features" such
as coding regions that are found in the accompanying sequence data.  A
parser is a computer program that attempts to decipher this
information and transform it for use in computations.  A simple
example might be as follows.  If the features table lists a coding
region from base x to base y, a parser would automatically read this
information and pass it to a translation program which would translate
the sequence between these bases.  Without a parser, a human would
have to read the features table and input the starting and ending base
into the translation program.

Hope this clarifies things.

 
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

pkarp@NCBI.NLM.NIH.GOV (Peter Karp) (10/03/90)

Let me answer the question in a different way.  From a computer
science point of view the GenBank features table is a language: every
possible feature is a sentence (string) constructed from an alphabet
of symbols (characters).  A grammar is a formal mechanism for
describing languages in a very precise way: a grammar specifies very
clearly what sentences are legal within that language and what
sentences are not.  Thus, by writing a grammar for the GenBank feature
table, GenBank is telling us what the allowable syntax of features is
-- without it we can only guess as to what strings we (and our
programs) can expect to see in feature tables.

A grammar is only a specification of a language -- it is not a
computer program for recognizing the language.  A parser is such a
program.  It takes as input a sentence in the language, and breaks the
sentence down into its constituent parts (for example, a parser of
English would break an English sentence down into verbs, nouns,
adjectives, etc.).  Parsers typically take different sorts of actions
when they see different grammatical elements (an English parser would
take one sort of action when it sees a verb and a different action
when it sees a noun).  Usually one cannot writer a parser without
referring to a grammar of the language.  A parser generator is a
program that automatically creates a parser program from a grammar
(the Unix YACC program is a parser generator).

The message here, as any computer scientist would have told you ten or
fifteen years ago, is that anyone who defines a language (such as a
text file that holds a database) that is to be used by a large number
of people, will make those people's lives MUCH easier if they define a
grammar for that language and make that grammar publicly available,
preferably in a format that can be used by a publicly-available
parser-generator program.  Without such a grammar people can only
guess as to the syntax of the language, people certainly cannot
generate parser programs automatically, and even writing parser
programs by hand can be VERY difficult.  Further, the very act of
writing a grammar makes the authors of such a language think very hard
about its properties: some languages are much easier than others for
computers to parse, and often the act of writing a grammar will cause
the language authors to transform the language into a new language
that is easier to parse.  By analogy, releasing a language without a
grammar is like having a society whose legal system operates without
written laws -- there are a lot of grey areas that will make many
people very nervous.

In summary, IF you release a computer database as a text file, PLEASE
consider writing a grammar for the language that you are using.

End of sermon.

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (10/04/90)

I am happy to see a discussion on parsing GenBank after all these years.

The feature table is only part of the problem.  For example, entries of GenBank
now end with a // (an idea taken from the embl database) so that programs could
distinguish where entries ended.  Before Matt Yarus suggested this to me and I
brought the suggestion it to the GenBank staff, it was difficult to tell where
the entry ended.  Indeed, since there is no definition of GenBank, some
programs give one hacked up entry formats that do not end with a // nor do they
have the same format as is on the GenBank tapes.  The authors of these programs
don't understand that the output of their programs should have a // at the end
simply because there is no standard definition of the format.

SUMMARY: we need a FULL DEFINITION of GenBank, not just the feature tables!!!

Example: we should be able to parse out the topology of an entry (circular or
linear sequence) and the references.  The topic is wider than most people have
discussed so far, and I don't understand why GenBank has resisted creating the
definition for so long.  I made the suggestion that a parsable form with an
associated DOCUMENT and DEFINITION is a requirement for the database AT LEAST 8
YEARS AGO.

How many more years will we have to wait for this?

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

davison@MENUDO.UH.EDU (Dan Davison) (10/04/90)

> SUMMARY: we need a FULL DEFINITION of GenBank, not just the feature
> tables!!! 

A complete definition is available from the GenBank staff at Los
Alamos.  Last I knew the name of the document was "dnaform.dc3".  It's
what I use to guide my programming (sic) that reads GenBank entries.


> Example: we should be able to parse out the topology of an entry
> (circular or linear sequence) and the references.  The topic is
> wider than most people have discussed so far, and I don't understand
> why GenBank has resisted creating the definition for so long. 

GenBank et al. can defend themselves, but such a definition does exist
and all you'd have had to do was ask...I did in 1982, that's how I
found out about it.  It has been updated regularly since.

> I made the suggestion that a parsable form with an associated
> DOCUMENT and DEFINITION is a requirement for the database AT LEAST 8
> YEARS AGO. 

Another person who wants to break even more existing software.  GB
staff are very conservative about changes.  This drives most of
non-GB'ers nuts (because of course we know what's right ;->) but
they're right to be conservative on such  matters.


> How many more years will we have to wait for this?

This (genbank-bb) is the place to discuss changes...I'm sure the GB
folk have their flameproof suits on whenever they go near a keyboard.
Fire away.

dan

--
dr. dan davison/dept. of biochemical and biophysical sciences/univ. of
Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU

"Comparing bad weather to rape: 'if it's inevitable, just relax and
enjoy it'"  Clayton Williams, Republican candidate for Governor of
Texas. ...And THIS is the kind of person and attitute most Texans find
acceptable...in 1990...very sad.

Disclaimer: As always, I speak only for myself, and, usually, only to
myself.

kristoff@genbank.bio.net (David Kristofferson) (10/04/90)

Dan,

	No problem with flames.  I enjoy a bit of controversy as long
as everything is above board and honest (and it doesn't get too
time-consuming).  I am a bit too new on the GenBank block to explain
all of the old history, but others at GenBank can reply to Tom's
concerns.  I will point out though that a lot of GenBank's recent
efforts have been devoted to developing the RDBMS version of the
database instead of concentrating on the flat file version (although
we apparently concentrated on the flat file format just enough to tick
you off 8-)!!).  Please note that this is an off-the-cuff remark
though and a more authoritative response should come from LANL.

	I would suggest once again, however, that GenBank issues
should be directed to the GENBANK-BB (bionet.molbio.genbank) newsgroup
instead of the Genome newsgroup.  The posting addresses for that group
are:


Address					Location	Network
-------					--------	-------
genbankb@irlearn.ucd.ie			Ireland		EARN/BITNET
genbankb@uk.ac.daresbury		U.K.		JANET
genbank-bb@bmc.uu.se			Sweden		Internet
genbank-bb@genbank.bio.net		U.S.A. 		Internet/BITNET


The group can be accessed on USENET (bionet.molbio.genbank) or e-mail
subscriptions can be requested by mailing to one of the following
addresses (don't send subscription requests to the newsgroup addresses
above, please).


Address					Location	Network
-------					--------	-------
biosci@irlearn.ucd.ie			Ireland		EARN/BITNET
biosci@uk.ac.daresbury			U.K.		JANET
biosci@bmc.uu.se			Sweden		Internet
biosci@genbank.bio.net			U.S.A. 		Internet/BITNET
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net