[bionet.molbio.genome-program] bnf for genbank

michael%genome@LANL.GOV (Michael J. Cinkosky) (09/25/90)

Tom Schneider writes:

> In article <922@dimebox.cs.utexas.edu> read@cs.utexas.edu (Rob) writes:
> >*  Does anyone have a lexer/parser for the GenBank Feature Table?
> >*  Would anone want one if wrote one?
> 
> This is an excellent idea.  But it depends on a "Definition of Genbank", a
> document that (to my knowledge) not ever made public.  Has this been released?
> Is it complete?  Does it include a complete BNF of the entire data structure
> (not just the features table)?  Does it define allowed values of all parameters
> in the structure?  If these things are not in place and documented, your parser
> is doomed eventually...  (I spent many years attempting as a GenBank Advisor to
> get them to define the database, and I don't think they have.)
> 
> >Robert L. Read				University of Texas
> >read@cs.utexas.edu			Center for High Performance
> >(512)-477-1240				Computation
> 
> Tom Schneider
> National Cancer Institute
> Laboratory of Mathematical Biology
> Frederick, Maryland  21702-1201

Tom is only partly correct.  There is not a bnf for the entire genbank
format, but there is a bnf for the new feature table format ( the bnf is
part of the feature table document).  This is all that is
required in order to write the lexer/parser originally requested.
(However, it should be mentioned that Dan Davison has had trouble
producing a lexical analyser based on that bnf (basically by converting
the bnf into a lex file), and I am not aware of any other attempts.
It is possible that the bnf presented in the feature table description
document contains an error, but no one here has had a chance to chase
down the problem.)

Michael Cinkosky
GenBank Computation Domain Leader
Los Alamos National Laboratory
Los Alamos, NM  87545

read@cs.utexas.edu (Rob) (10/02/90)

Hi.  The Feature Table Grammar published in the document
"The DDBJ/EMBL/GenBank Feature Table: Definition", Version 1.03
August 6,1990, is technically ambiguous.  There are two legal parses
of some strings.

The ambiuities are related to the many values that a qualifier value
may have.  Some qualifiers (e.g. /number) have integer values.  However,
a single integer is also a legal location specifier.  So we have two
derivations:
0) 	 	simple_value->
		location->
		absolute_location->
		local_location->
		base_position->
		INTEGER
1)		simple_value->
		INTEGER
Similarly, the grammar expresses that a <simple-value> may be
a symbol, but a symbol may also be a <feature-label>.  Thus:
0)		simple_value->
		location->
		feature_name->
		feature_label->
		SYMBOL
1)		simple_value->
		SYMBOL
are both legal derivations of this grammar.

These technical ambiguities can be resolved by making some
assumptions:
	0)  That a lexer can discriminate the controlled vocabularies
	of enumerated qualifiers, and that there are no unquoted 
	strings used for qualifier values that are not feature labels
	or from the controlled vocabulary.  This allows us to 
	replace <SYMBOL> in the rule for <simple-value> with
	<CONTROL>, a token for lexically recognized words in the
	controlled vocabularies.
	1)  That a lexer can discriminate qualifiers which take 
	integer values.  (This is not possible if a qualifier can
	have integer values and location values.)  This allows us
	to remove the INTEGER token form the simple-value rule.

These assumptions might be enforcible by the GenBank administration.

Dr. Dan Davison (davison@menudo.uh.edu) provided me with the
original bison grammar and the idea that the grammar was ambiguous.

The grammar below shows my modifications.  It "bisons" successfully 
with 10 shift/reduce errors and 0 reduce/reduce errors.

---------------------------------------------------------------
Robert L. Read 		GENTOOLS PROJECT -- University of Texas
Sarah K. Barron		Center for High Performance Computing
Matthew Witten
read@cs.utexas.edu
---------------------------------------------------------------

/* Cut Here */

/* WARNING -- This seemingly reasonable decision to make <SYMBOL>'s,
<TEXT_STRINGS> and <LITERAL_SEQUENCES> lexically recognized tokens
may have to be amended.  I don't think text_strings and literal_sequences
can be lexically distinguished.  It is unclear how much of the
decoding must be done in the parser and what can be done in the
lexer.
*/



/*  This feature table grammar was originally produce by Dan Davison of   */
/*  UH  -- davison@menudo.uh.edu 					   */
/*  I have modified it in an attempt to disambiguate it as reasonably as   */
/*  as possible; I make some assumptions about how values are used which   */
/*     0)  might not be intended by GenBank				   */
/*     1)  might not be enforced by GenBank			           */
/*  See comments below for details					   */
/*  Robert L. Read 	GenTools Project - UT-CHPC			   */

%{
#define YYSTYPE double
#define YYDEBUG
#include <stdio.h>
#include <math.h>
%}
%token SYMBOL
%token INTEGER
%token ELLIPSE
%token DOUBLE_COLON
%token COMPLEMENT
%token JOIN
%token ORDER
%token GROUP
%token ONE_OF
%token REPLACE
%token TEXT_STRING
%token HEADER_1
%token HEADER_2
%token INTEGER_QUAL    /* This token is produced by the lexer for
			the hopefully finite set of qualifier names
			which can be followed by a location. */
%token CONTROL  	/* This token is should be provided by
			the lexer when a "controlled vocabulary"
			value is recognized.  This may be a
			painful or incomplete context sensitive 
			lexing task.   In particular, it assumes
			that each qualifier that can have an unquoted
			string value has a finite set of values.  */
%% /* grammer rules and actions follow */

feature_table:		  feature_table_header feature_table_body
;
feature_table_header:	  HEADER_1
			| HEADER_2
;
feature_table_body:	  feature
			| feature_table_body feature
;
feature:		  feature_key feature_details
;
feature_key:		  SYMBOL
			| '-'
;
feature_details:	  location qualifier_list
			| location
;
			 
location: 		  absolute_location
			|  feature_name
			| '"' literal_sequence '"'
			| functional_operator '(' location_list ')'
;
/* I think these rules as written made the parser look too far
ahead.  I expanded "path" in the rules below. */
absolute_location:	  local_location
			| SYMBOL ':' local_location
			| SYMBOL DOUBLE_COLON SYMBOL
;
feature_name:		  SYMBOL DOUBLE_COLON SYMBOL ':' SYMBOL
			| SYMBOL ':' SYMBOL 
			| SYMBOL
;
literal_sequence:	  SYMBOL
;
functional_operator:	| COMPLEMENT
			| JOIN
			| ORDER
			| GROUP
			| ONE_OF
			| REPLACE
;
location_list:		  location
			| location_list ',' location
;
local_location:		  base_position
			| between_position
			| base_range
;

/* These rules had to be expanded to remove reduce/reduce conflict 
path:			  database DOUBLE_COLON primary_accession
			| primary_accession
;
feature_label:		  SYMBOL
;
*/

base_position:		  INTEGER
			| low_base_bound
			| high_base_bound
			| two_base_bound
;
between_position:	  base_position '^' base_position
;
base_range:		  base_position ELLIPSE base_position
;
/* These rules had to be expanded to remove reduce/reduce error 
database:		  SYMBOL
;
primary_accession:	  SYMBOL
;
*/
low_base_bound:		  '>' INTEGER
;
high_base_bound:	  '<' INTEGER
;
two_base_bound:		  base_position '.' base_position
;
qualifier_list:		  qualifier
			| qualifier_list qualifier
;
/* Here I am assuming the lexer can distinguish a location-
parametered qualifier from others. */
qualifier:		  '/' qualifier_name
			| '/' qualifier_name '=' value
			| '/' integer_qualifier '=' INTEGER
;
integer_qualifier:	  INTEGER_QUAL
;
qualifier_name:		  SYMBOL
;
value:			  simple_value
			| '(' value_list ')'
			| '(' tagged_value_list ')'
;

/* In the published grammar, and integer qualifier value is
ambiguous because:
	 	simple_value->
		location->
		absolute_location->
		local_location->
		base_position->
		INTEGER
is legal, so either that path must be changed or INTEGER must be
excluded for simple_value.  I therefore count (above) on
being able to discriminate INTEGER valued qualifiers from others.
Similarly,
		simple_value->
		location->
		feature_name->
		feature_label->
		SYMBOL,
so SYMBOL cannot be part of legal value.  I count on being able
to lexically distinguish the the "controlled vocabularies" which
are legal values for particular qualifiers from "symbols".
*/
simple_value:		  location
			| reference_number 
			| '"' TEXT_STRING '"'
			| CONTROL
;
value_list:		  value
			| value_list '.' value
;
tagged_value_list:	  tagged_value
			| SYMBOL ':' '(' value_list ')'
			| SYMBOL ':' '(' tagged_value_list ')'
			| tagged_value_list ',' tagged_value
;

/*
This rule produces no reduce/reduce errors.  We might use it if
we wanted to allow INTEGERS as tagged values.
tagged_value:		  tag ':' '(' value_list ')'
			| tag ':' '(' tagged_value_list ')'
			| tag ':' INTEGER
			| tag ':' '"' TEXT_STRING '"'
			| tag ':' SYMBOL
			| tag ':' reference_number
;
*/
tagged_value:		  tag ':' value
;
tag:			  SYMBOL
;
reference_number:	  '[' INTEGER ']'
;

%%
yyerror (s)
	char *s;
{
	printf ("%s\n", s);
}

davison@MENUDO.UH.EDU (Dan Davison) (10/02/90)

Howdy, all.  

I have one comment on the Feature Table Grammar by Rob Read.

Since I don't know if he posted to all the relevant newgroups, at the
cost additional bandwidth I will include his entire posting after
excerpting the section I want to comment on.


> /* I think these rules as written made the parser look too far
> ahead.  I expanded "path" in the rules below. */
> absolute_location:	  local_location
> 			| SYMBOL ':' local_location
> 			| SYMBOL DOUBLE_COLON SYMBOL
> ;
[...]


There was a reason for looking very far ahead, and perhaps Mike C. or
Dave B. of GenBank can comment on my reasoning.

As I read the original document, I thought that there could arise
situations where there were, in effect, "infinite forward references".
(I think this was one of my comments on the document a year ago, but
it might be that I never posted it). So that, in order to resolve an
absolute location, you might have to read through the ENTIRE file and
possibly another (even all) files to resolve the reference.

I still think this is the case, but since Dave and Mike know my CS
prowess (none) and the excellent job done on my hacking around by Rob,
I will leave this to wiser heads.



dna dna
ps I used the March version of the feature table document, I think,
not the August version as Rob did -- I haven't looked at the August
version.

--
dr. dan davison/dept. of biochemical and biophysical sciences/univ. of
Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU
Disclaimer: As always, I speak only for myself, and, usually, only to
myself.
===========================================================================
Rob's original posting in full:

> 
> 
> Hi.  The Feature Table Grammar published in the document
> "The DDBJ/EMBL/GenBank Feature Table: Definition", Version 1.03
> August 6,1990, is technically ambiguous.  There are two legal parses
> of some strings.
> 
> The ambiuities are related to the many values that a qualifier value
> may have.  Some qualifiers (e.g. /number) have integer values.  However,
> a single integer is also a legal location specifier.  So we have two
> derivations:
> 0) 	 	simple_value->
> 		location->
> 		absolute_location->
> 		local_location->
> 		base_position->
> 		INTEGER
> 1)		simple_value->
> 		INTEGER
> Similarly, the grammar expresses that a <simple-value> may be
> a symbol, but a symbol may also be a <feature-label>.  Thus:
> 0)		simple_value->
> 		location->
> 		feature_name->
> 		feature_label->
> 		SYMBOL
> 1)		simple_value->
> 		SYMBOL
> are both legal derivations of this grammar.
> 
> These technical ambiguities can be resolved by making some
> assumptions:
> 	0)  That a lexer can discriminate the controlled vocabularies
> 	of enumerated qualifiers, and that there are no unquoted 
> 	strings used for qualifier values that are not feature labels
> 	or from the controlled vocabulary.  This allows us to 
> 	replace <SYMBOL> in the rule for <simple-value> with
> 	<CONTROL>, a token for lexically recognized words in the
> 	controlled vocabularies.
> 	1)  That a lexer can discriminate qualifiers which take 
> 	integer values.  (This is not possible if a qualifier can
> 	have integer values and location values.)  This allows us
> 	to remove the INTEGER token form the simple-value rule.
> 
> These assumptions might be enforcible by the GenBank administration.
> 
> Dr. Dan Davison (davison@menudo.uh.edu) provided me with the
> original bison grammar and the idea that the grammar was ambiguous.
> 
> The grammar below shows my modifications.  It "bisons" successfully 
> with 10 shift/reduce errors and 0 reduce/reduce errors.
> 
> ---------------------------------------------------------------
> Robert L. Read 		GENTOOLS PROJECT -- University of Texas
> Sarah K. Barron		Center for High Performance Computing
> Matthew Witten
> read@cs.utexas.edu
> ---------------------------------------------------------------
> 
> /* Cut Here */
> 
> /* WARNING -- This seemingly reasonable decision to make <SYMBOL>'s,
> <TEXT_STRINGS> and <LITERAL_SEQUENCES> lexically recognized tokens
> may have to be amended.  I don't think text_strings and literal_sequences
> can be lexically distinguished.  It is unclear how much of the
> decoding must be done in the parser and what can be done in the
> lexer.
> */
> 
> 
> 
> /*  This feature table grammar was originally produce by Dan Davison of   */
> /*  UH  -- davison@menudo.uh.edu 					   */
> /*  I have modified it in an attempt to disambiguate it as reasonably as   */
> /*  as possible; I make some assumptions about how values are used which   */
> /*     0)  might not be intended by GenBank				   */
> /*     1)  might not be enforced by GenBank			           */
> /*  See comments below for details					   */
> /*  Robert L. Read 	GenTools Project - UT-CHPC			   */
> 
> %{
> #define YYSTYPE double
> #define YYDEBUG
> #include <stdio.h>
> #include <math.h>
> %}
> %token SYMBOL
> %token INTEGER
> %token ELLIPSE
> %token DOUBLE_COLON
> %token COMPLEMENT
> %token JOIN
> %token ORDER
> %token GROUP
> %token ONE_OF
> %token REPLACE
> %token TEXT_STRING
> %token HEADER_1
> %token HEADER_2
> %token INTEGER_QUAL    /* This token is produced by the lexer for
> 			the hopefully finite set of qualifier names
> 			which can be followed by a location. */
> %token CONTROL  	/* This token is should be provided by
> 			the lexer when a "controlled vocabulary"
> 			value is recognized.  This may be a
> 			painful or incomplete context sensitive 
> 			lexing task.   In particular, it assumes
> 			that each qualifier that can have an unquoted
> 			string value has a finite set of values.  */
> %% /* grammer rules and actions follow */
> 
> feature_table:		  feature_table_header feature_table_body
> ;
> feature_table_header:	  HEADER_1
> 			| HEADER_2
> ;
> feature_table_body:	  feature
> 			| feature_table_body feature
> ;
> feature:		  feature_key feature_details
> ;
> feature_key:		  SYMBOL
> 			| '-'
> ;
> feature_details:	  location qualifier_list
> 			| location
> ;
> 			 
> location: 		  absolute_location
> 			|  feature_name
> 			| '"' literal_sequence '"'
> 			| functional_operator '(' location_list ')'
> ;
> /* I think these rules as written made the parser look too far
> ahead.  I expanded "path" in the rules below. */
> absolute_location:	  local_location
> 			| SYMBOL ':' local_location
> 			| SYMBOL DOUBLE_COLON SYMBOL
> ;
> feature_name:		  SYMBOL DOUBLE_COLON SYMBOL ':' SYMBOL
> 			| SYMBOL ':' SYMBOL 
> 			| SYMBOL
> ;
> literal_sequence:	  SYMBOL
> ;
> functional_operator:	| COMPLEMENT
> 			| JOIN
> 			| ORDER
> 			| GROUP
> 			| ONE_OF
> 			| REPLACE
> ;
> location_list:		  location
> 			| location_list ',' location
> ;
> local_location:		  base_position
> 			| between_position
> 			| base_range
> ;
> 
> /* These rules had to be expanded to remove reduce/reduce conflict 
> path:			  database DOUBLE_COLON primary_accession
> 			| primary_accession
> ;
> feature_label:		  SYMBOL
> ;
> */
> 
> base_position:		  INTEGER
> 			| low_base_bound
> 			| high_base_bound
> 			| two_base_bound
> ;
> between_position:	  base_position '^' base_position
> ;
> base_range:		  base_position ELLIPSE base_position
> ;
> /* These rules had to be expanded to remove reduce/reduce error 
> database:		  SYMBOL
> ;
> primary_accession:	  SYMBOL
> ;
> */
> low_base_bound:		  '>' INTEGER
> ;
> high_base_bound:	  '<' INTEGER
> ;
> two_base_bound:		  base_position '.' base_position
> ;
> qualifier_list:		  qualifier
> 			| qualifier_list qualifier
> ;
> /* Here I am assuming the lexer can distinguish a location-
> parametered qualifier from others. */
> qualifier:		  '/' qualifier_name
> 			| '/' qualifier_name '=' value
> 			| '/' integer_qualifier '=' INTEGER
> ;
> integer_qualifier:	  INTEGER_QUAL
> ;
> qualifier_name:		  SYMBOL
> ;
> value:			  simple_value
> 			| '(' value_list ')'
> 			| '(' tagged_value_list ')'
> ;
> 
> /* In the published grammar, and integer qualifier value is
> ambiguous because:
> 	 	simple_value->
> 		location->
> 		absolute_location->
> 		local_location->
> 		base_position->
> 		INTEGER
> is legal, so either that path must be changed or INTEGER must be
> excluded for simple_value.  I therefore count (above) on
> being able to discriminate INTEGER valued qualifiers from others.
> Similarly,
> 		simple_value->
> 		location->
> 		feature_name->
> 		feature_label->
> 		SYMBOL,
> so SYMBOL cannot be part of legal value.  I count on being able
> to lexically distinguish the the "controlled vocabularies" which
> are legal values for particular qualifiers from "symbols".
> */
> simple_value:		  location
> 			| reference_number 
> 			| '"' TEXT_STRING '"'
> 			| CONTROL
> ;
> value_list:		  value
> 			| value_list '.' value
> ;
> tagged_value_list:	  tagged_value
> 			| SYMBOL ':' '(' value_list ')'
> 			| SYMBOL ':' '(' tagged_value_list ')'
> 			| tagged_value_list ',' tagged_value
> ;
> 
> /*
> This rule produces no reduce/reduce errors.  We might use it if
> we wanted to allow INTEGERS as tagged values.
> tagged_value:		  tag ':' '(' value_list ')'
> 			| tag ':' '(' tagged_value_list ')'
> 			| tag ':' INTEGER
> 			| tag ':' '"' TEXT_STRING '"'
> 			| tag ':' SYMBOL
> 			| tag ':' reference_number
> ;
> */
> tagged_value:		  tag ':' value
> ;
> tag:			  SYMBOL
> ;
> reference_number:	  '[' INTEGER ']'
> ;
> 
> %%
> yyerror (s)
> 	char *s;
> {
> 	printf ("%s\n", s);
> }
> 
>