michael%genome@LANL.GOV (Michael J. Cinkosky) (09/25/90)
Tom Schneider writes: > In article <922@dimebox.cs.utexas.edu> read@cs.utexas.edu (Rob) writes: > >* Does anyone have a lexer/parser for the GenBank Feature Table? > >* Would anone want one if wrote one? > > This is an excellent idea. But it depends on a "Definition of Genbank", a > document that (to my knowledge) not ever made public. Has this been released? > Is it complete? Does it include a complete BNF of the entire data structure > (not just the features table)? Does it define allowed values of all parameters > in the structure? If these things are not in place and documented, your parser > is doomed eventually... (I spent many years attempting as a GenBank Advisor to > get them to define the database, and I don't think they have.) > > >Robert L. Read University of Texas > >read@cs.utexas.edu Center for High Performance > >(512)-477-1240 Computation > > Tom Schneider > National Cancer Institute > Laboratory of Mathematical Biology > Frederick, Maryland 21702-1201 Tom is only partly correct. There is not a bnf for the entire genbank format, but there is a bnf for the new feature table format ( the bnf is part of the feature table document). This is all that is required in order to write the lexer/parser originally requested. (However, it should be mentioned that Dan Davison has had trouble producing a lexical analyser based on that bnf (basically by converting the bnf into a lex file), and I am not aware of any other attempts. It is possible that the bnf presented in the feature table description document contains an error, but no one here has had a chance to chase down the problem.) Michael Cinkosky GenBank Computation Domain Leader Los Alamos National Laboratory Los Alamos, NM 87545
read@cs.utexas.edu (Rob) (10/02/90)
Hi. The Feature Table Grammar published in the document "The DDBJ/EMBL/GenBank Feature Table: Definition", Version 1.03 August 6,1990, is technically ambiguous. There are two legal parses of some strings. The ambiuities are related to the many values that a qualifier value may have. Some qualifiers (e.g. /number) have integer values. However, a single integer is also a legal location specifier. So we have two derivations: 0) simple_value-> location-> absolute_location-> local_location-> base_position-> INTEGER 1) simple_value-> INTEGER Similarly, the grammar expresses that a <simple-value> may be a symbol, but a symbol may also be a <feature-label>. Thus: 0) simple_value-> location-> feature_name-> feature_label-> SYMBOL 1) simple_value-> SYMBOL are both legal derivations of this grammar. These technical ambiguities can be resolved by making some assumptions: 0) That a lexer can discriminate the controlled vocabularies of enumerated qualifiers, and that there are no unquoted strings used for qualifier values that are not feature labels or from the controlled vocabulary. This allows us to replace <SYMBOL> in the rule for <simple-value> with <CONTROL>, a token for lexically recognized words in the controlled vocabularies. 1) That a lexer can discriminate qualifiers which take integer values. (This is not possible if a qualifier can have integer values and location values.) This allows us to remove the INTEGER token form the simple-value rule. These assumptions might be enforcible by the GenBank administration. Dr. Dan Davison (davison@menudo.uh.edu) provided me with the original bison grammar and the idea that the grammar was ambiguous. The grammar below shows my modifications. It "bisons" successfully with 10 shift/reduce errors and 0 reduce/reduce errors. --------------------------------------------------------------- Robert L. Read GENTOOLS PROJECT -- University of Texas Sarah K. Barron Center for High Performance Computing Matthew Witten read@cs.utexas.edu --------------------------------------------------------------- /* Cut Here */ /* WARNING -- This seemingly reasonable decision to make <SYMBOL>'s, <TEXT_STRINGS> and <LITERAL_SEQUENCES> lexically recognized tokens may have to be amended. I don't think text_strings and literal_sequences can be lexically distinguished. It is unclear how much of the decoding must be done in the parser and what can be done in the lexer. */ /* This feature table grammar was originally produce by Dan Davison of */ /* UH -- davison@menudo.uh.edu */ /* I have modified it in an attempt to disambiguate it as reasonably as */ /* as possible; I make some assumptions about how values are used which */ /* 0) might not be intended by GenBank */ /* 1) might not be enforced by GenBank */ /* See comments below for details */ /* Robert L. Read GenTools Project - UT-CHPC */ %{ #define YYSTYPE double #define YYDEBUG #include <stdio.h> #include <math.h> %} %token SYMBOL %token INTEGER %token ELLIPSE %token DOUBLE_COLON %token COMPLEMENT %token JOIN %token ORDER %token GROUP %token ONE_OF %token REPLACE %token TEXT_STRING %token HEADER_1 %token HEADER_2 %token INTEGER_QUAL /* This token is produced by the lexer for the hopefully finite set of qualifier names which can be followed by a location. */ %token CONTROL /* This token is should be provided by the lexer when a "controlled vocabulary" value is recognized. This may be a painful or incomplete context sensitive lexing task. In particular, it assumes that each qualifier that can have an unquoted string value has a finite set of values. */ %% /* grammer rules and actions follow */ feature_table: feature_table_header feature_table_body ; feature_table_header: HEADER_1 | HEADER_2 ; feature_table_body: feature | feature_table_body feature ; feature: feature_key feature_details ; feature_key: SYMBOL | '-' ; feature_details: location qualifier_list | location ; location: absolute_location | feature_name | '"' literal_sequence '"' | functional_operator '(' location_list ')' ; /* I think these rules as written made the parser look too far ahead. I expanded "path" in the rules below. */ absolute_location: local_location | SYMBOL ':' local_location | SYMBOL DOUBLE_COLON SYMBOL ; feature_name: SYMBOL DOUBLE_COLON SYMBOL ':' SYMBOL | SYMBOL ':' SYMBOL | SYMBOL ; literal_sequence: SYMBOL ; functional_operator: | COMPLEMENT | JOIN | ORDER | GROUP | ONE_OF | REPLACE ; location_list: location | location_list ',' location ; local_location: base_position | between_position | base_range ; /* These rules had to be expanded to remove reduce/reduce conflict path: database DOUBLE_COLON primary_accession | primary_accession ; feature_label: SYMBOL ; */ base_position: INTEGER | low_base_bound | high_base_bound | two_base_bound ; between_position: base_position '^' base_position ; base_range: base_position ELLIPSE base_position ; /* These rules had to be expanded to remove reduce/reduce error database: SYMBOL ; primary_accession: SYMBOL ; */ low_base_bound: '>' INTEGER ; high_base_bound: '<' INTEGER ; two_base_bound: base_position '.' base_position ; qualifier_list: qualifier | qualifier_list qualifier ; /* Here I am assuming the lexer can distinguish a location- parametered qualifier from others. */ qualifier: '/' qualifier_name | '/' qualifier_name '=' value | '/' integer_qualifier '=' INTEGER ; integer_qualifier: INTEGER_QUAL ; qualifier_name: SYMBOL ; value: simple_value | '(' value_list ')' | '(' tagged_value_list ')' ; /* In the published grammar, and integer qualifier value is ambiguous because: simple_value-> location-> absolute_location-> local_location-> base_position-> INTEGER is legal, so either that path must be changed or INTEGER must be excluded for simple_value. I therefore count (above) on being able to discriminate INTEGER valued qualifiers from others. Similarly, simple_value-> location-> feature_name-> feature_label-> SYMBOL, so SYMBOL cannot be part of legal value. I count on being able to lexically distinguish the the "controlled vocabularies" which are legal values for particular qualifiers from "symbols". */ simple_value: location | reference_number | '"' TEXT_STRING '"' | CONTROL ; value_list: value | value_list '.' value ; tagged_value_list: tagged_value | SYMBOL ':' '(' value_list ')' | SYMBOL ':' '(' tagged_value_list ')' | tagged_value_list ',' tagged_value ; /* This rule produces no reduce/reduce errors. We might use it if we wanted to allow INTEGERS as tagged values. tagged_value: tag ':' '(' value_list ')' | tag ':' '(' tagged_value_list ')' | tag ':' INTEGER | tag ':' '"' TEXT_STRING '"' | tag ':' SYMBOL | tag ':' reference_number ; */ tagged_value: tag ':' value ; tag: SYMBOL ; reference_number: '[' INTEGER ']' ; %% yyerror (s) char *s; { printf ("%s\n", s); }
davison@MENUDO.UH.EDU (Dan Davison) (10/02/90)
Howdy, all. I have one comment on the Feature Table Grammar by Rob Read. Since I don't know if he posted to all the relevant newgroups, at the cost additional bandwidth I will include his entire posting after excerpting the section I want to comment on. > /* I think these rules as written made the parser look too far > ahead. I expanded "path" in the rules below. */ > absolute_location: local_location > | SYMBOL ':' local_location > | SYMBOL DOUBLE_COLON SYMBOL > ; [...] There was a reason for looking very far ahead, and perhaps Mike C. or Dave B. of GenBank can comment on my reasoning. As I read the original document, I thought that there could arise situations where there were, in effect, "infinite forward references". (I think this was one of my comments on the document a year ago, but it might be that I never posted it). So that, in order to resolve an absolute location, you might have to read through the ENTIRE file and possibly another (even all) files to resolve the reference. I still think this is the case, but since Dave and Mike know my CS prowess (none) and the excellent job done on my hacking around by Rob, I will leave this to wiser heads. dna dna ps I used the March version of the feature table document, I think, not the August version as Rob did -- I haven't looked at the August version. -- dr. dan davison/dept. of biochemical and biophysical sciences/univ. of Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU Disclaimer: As always, I speak only for myself, and, usually, only to myself. =========================================================================== Rob's original posting in full: > > > Hi. The Feature Table Grammar published in the document > "The DDBJ/EMBL/GenBank Feature Table: Definition", Version 1.03 > August 6,1990, is technically ambiguous. There are two legal parses > of some strings. > > The ambiuities are related to the many values that a qualifier value > may have. Some qualifiers (e.g. /number) have integer values. However, > a single integer is also a legal location specifier. So we have two > derivations: > 0) simple_value-> > location-> > absolute_location-> > local_location-> > base_position-> > INTEGER > 1) simple_value-> > INTEGER > Similarly, the grammar expresses that a <simple-value> may be > a symbol, but a symbol may also be a <feature-label>. Thus: > 0) simple_value-> > location-> > feature_name-> > feature_label-> > SYMBOL > 1) simple_value-> > SYMBOL > are both legal derivations of this grammar. > > These technical ambiguities can be resolved by making some > assumptions: > 0) That a lexer can discriminate the controlled vocabularies > of enumerated qualifiers, and that there are no unquoted > strings used for qualifier values that are not feature labels > or from the controlled vocabulary. This allows us to > replace <SYMBOL> in the rule for <simple-value> with > <CONTROL>, a token for lexically recognized words in the > controlled vocabularies. > 1) That a lexer can discriminate qualifiers which take > integer values. (This is not possible if a qualifier can > have integer values and location values.) This allows us > to remove the INTEGER token form the simple-value rule. > > These assumptions might be enforcible by the GenBank administration. > > Dr. Dan Davison (davison@menudo.uh.edu) provided me with the > original bison grammar and the idea that the grammar was ambiguous. > > The grammar below shows my modifications. It "bisons" successfully > with 10 shift/reduce errors and 0 reduce/reduce errors. > > --------------------------------------------------------------- > Robert L. Read GENTOOLS PROJECT -- University of Texas > Sarah K. Barron Center for High Performance Computing > Matthew Witten > read@cs.utexas.edu > --------------------------------------------------------------- > > /* Cut Here */ > > /* WARNING -- This seemingly reasonable decision to make <SYMBOL>'s, > <TEXT_STRINGS> and <LITERAL_SEQUENCES> lexically recognized tokens > may have to be amended. I don't think text_strings and literal_sequences > can be lexically distinguished. It is unclear how much of the > decoding must be done in the parser and what can be done in the > lexer. > */ > > > > /* This feature table grammar was originally produce by Dan Davison of */ > /* UH -- davison@menudo.uh.edu */ > /* I have modified it in an attempt to disambiguate it as reasonably as */ > /* as possible; I make some assumptions about how values are used which */ > /* 0) might not be intended by GenBank */ > /* 1) might not be enforced by GenBank */ > /* See comments below for details */ > /* Robert L. Read GenTools Project - UT-CHPC */ > > %{ > #define YYSTYPE double > #define YYDEBUG > #include <stdio.h> > #include <math.h> > %} > %token SYMBOL > %token INTEGER > %token ELLIPSE > %token DOUBLE_COLON > %token COMPLEMENT > %token JOIN > %token ORDER > %token GROUP > %token ONE_OF > %token REPLACE > %token TEXT_STRING > %token HEADER_1 > %token HEADER_2 > %token INTEGER_QUAL /* This token is produced by the lexer for > the hopefully finite set of qualifier names > which can be followed by a location. */ > %token CONTROL /* This token is should be provided by > the lexer when a "controlled vocabulary" > value is recognized. This may be a > painful or incomplete context sensitive > lexing task. In particular, it assumes > that each qualifier that can have an unquoted > string value has a finite set of values. */ > %% /* grammer rules and actions follow */ > > feature_table: feature_table_header feature_table_body > ; > feature_table_header: HEADER_1 > | HEADER_2 > ; > feature_table_body: feature > | feature_table_body feature > ; > feature: feature_key feature_details > ; > feature_key: SYMBOL > | '-' > ; > feature_details: location qualifier_list > | location > ; > > location: absolute_location > | feature_name > | '"' literal_sequence '"' > | functional_operator '(' location_list ')' > ; > /* I think these rules as written made the parser look too far > ahead. I expanded "path" in the rules below. */ > absolute_location: local_location > | SYMBOL ':' local_location > | SYMBOL DOUBLE_COLON SYMBOL > ; > feature_name: SYMBOL DOUBLE_COLON SYMBOL ':' SYMBOL > | SYMBOL ':' SYMBOL > | SYMBOL > ; > literal_sequence: SYMBOL > ; > functional_operator: | COMPLEMENT > | JOIN > | ORDER > | GROUP > | ONE_OF > | REPLACE > ; > location_list: location > | location_list ',' location > ; > local_location: base_position > | between_position > | base_range > ; > > /* These rules had to be expanded to remove reduce/reduce conflict > path: database DOUBLE_COLON primary_accession > | primary_accession > ; > feature_label: SYMBOL > ; > */ > > base_position: INTEGER > | low_base_bound > | high_base_bound > | two_base_bound > ; > between_position: base_position '^' base_position > ; > base_range: base_position ELLIPSE base_position > ; > /* These rules had to be expanded to remove reduce/reduce error > database: SYMBOL > ; > primary_accession: SYMBOL > ; > */ > low_base_bound: '>' INTEGER > ; > high_base_bound: '<' INTEGER > ; > two_base_bound: base_position '.' base_position > ; > qualifier_list: qualifier > | qualifier_list qualifier > ; > /* Here I am assuming the lexer can distinguish a location- > parametered qualifier from others. */ > qualifier: '/' qualifier_name > | '/' qualifier_name '=' value > | '/' integer_qualifier '=' INTEGER > ; > integer_qualifier: INTEGER_QUAL > ; > qualifier_name: SYMBOL > ; > value: simple_value > | '(' value_list ')' > | '(' tagged_value_list ')' > ; > > /* In the published grammar, and integer qualifier value is > ambiguous because: > simple_value-> > location-> > absolute_location-> > local_location-> > base_position-> > INTEGER > is legal, so either that path must be changed or INTEGER must be > excluded for simple_value. I therefore count (above) on > being able to discriminate INTEGER valued qualifiers from others. > Similarly, > simple_value-> > location-> > feature_name-> > feature_label-> > SYMBOL, > so SYMBOL cannot be part of legal value. I count on being able > to lexically distinguish the the "controlled vocabularies" which > are legal values for particular qualifiers from "symbols". > */ > simple_value: location > | reference_number > | '"' TEXT_STRING '"' > | CONTROL > ; > value_list: value > | value_list '.' value > ; > tagged_value_list: tagged_value > | SYMBOL ':' '(' value_list ')' > | SYMBOL ':' '(' tagged_value_list ')' > | tagged_value_list ',' tagged_value > ; > > /* > This rule produces no reduce/reduce errors. We might use it if > we wanted to allow INTEGERS as tagged values. > tagged_value: tag ':' '(' value_list ')' > | tag ':' '(' tagged_value_list ')' > | tag ':' INTEGER > | tag ':' '"' TEXT_STRING '"' > | tag ':' SYMBOL > | tag ':' reference_number > ; > */ > tagged_value: tag ':' value > ; > tag: SYMBOL > ; > reference_number: '[' INTEGER ']' > ; > > %% > yyerror (s) > char *s; > { > printf ("%s\n", s); > } > >