MICPRF@lure.latrobe.edu.au (10/03/90)
Hi, Excuse my profound ignorance, but all this discussion about Genbank parsers and feature tables etc. has finally aroused my curiosity to the point where I simply must risk being flamed. Can someone please explain to me (first-grade level please!) what a feature table is, what a parser does with it and what one can use that for? If this is too elementary for general consumption, email to my kindergarten please at micprf@lure.latrobe.edu.au. Thanks, Paul Fisher.
kristoff@genbank.bio.net (David Kristofferson) (10/03/90)
In a nutshell, the GenBank features table is a list of "features" such as coding regions that are found in the accompanying sequence data. A parser is a computer program that attempts to decipher this information and transform it for use in computations. A simple example might be as follows. If the features table lists a coding region from base x to base y, a parser would automatically read this information and pass it to a translation program which would translate the sequence between these bases. Without a parser, a human would have to read the features table and input the starting and ending base into the translation program. Hope this clarifies things. -- Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
pkarp@NCBI.NLM.NIH.GOV (Peter Karp) (10/03/90)
Let me answer the question in a different way. From a computer science point of view the GenBank features table is a language: every possible feature is a sentence (string) constructed from an alphabet of symbols (characters). A grammar is a formal mechanism for describing languages in a very precise way: a grammar specifies very clearly what sentences are legal within that language and what sentences are not. Thus, by writing a grammar for the GenBank feature table, GenBank is telling us what the allowable syntax of features is -- without it we can only guess as to what strings we (and our programs) can expect to see in feature tables. A grammar is only a specification of a language -- it is not a computer program for recognizing the language. A parser is such a program. It takes as input a sentence in the language, and breaks the sentence down into its constituent parts (for example, a parser of English would break an English sentence down into verbs, nouns, adjectives, etc.). Parsers typically take different sorts of actions when they see different grammatical elements (an English parser would take one sort of action when it sees a verb and a different action when it sees a noun). Usually one cannot writer a parser without referring to a grammar of the language. A parser generator is a program that automatically creates a parser program from a grammar (the Unix YACC program is a parser generator). The message here, as any computer scientist would have told you ten or fifteen years ago, is that anyone who defines a language (such as a text file that holds a database) that is to be used by a large number of people, will make those people's lives MUCH easier if they define a grammar for that language and make that grammar publicly available, preferably in a format that can be used by a publicly-available parser-generator program. Without such a grammar people can only guess as to the syntax of the language, people certainly cannot generate parser programs automatically, and even writing parser programs by hand can be VERY difficult. Further, the very act of writing a grammar makes the authors of such a language think very hard about its properties: some languages are much easier than others for computers to parse, and often the act of writing a grammar will cause the language authors to transform the language into a new language that is easier to parse. By analogy, releasing a language without a grammar is like having a society whose legal system operates without written laws -- there are a lot of grey areas that will make many people very nervous. In summary, IF you release a computer database as a text file, PLEASE consider writing a grammar for the language that you are using. End of sermon.
toms@fcs260c2.ncifcrf.gov (Tom Schneider) (10/04/90)
I am happy to see a discussion on parsing GenBank after all these years. The feature table is only part of the problem. For example, entries of GenBank now end with a // (an idea taken from the embl database) so that programs could distinguish where entries ended. Before Matt Yarus suggested this to me and I brought the suggestion it to the GenBank staff, it was difficult to tell where the entry ended. Indeed, since there is no definition of GenBank, some programs give one hacked up entry formats that do not end with a // nor do they have the same format as is on the GenBank tapes. The authors of these programs don't understand that the output of their programs should have a // at the end simply because there is no standard definition of the format. SUMMARY: we need a FULL DEFINITION of GenBank, not just the feature tables!!! Example: we should be able to parse out the topology of an entry (circular or linear sequence) and the references. The topic is wider than most people have discussed so far, and I don't understand why GenBank has resisted creating the definition for so long. I made the suggestion that a parsable form with an associated DOCUMENT and DEFINITION is a requirement for the database AT LEAST 8 YEARS AGO. How many more years will we have to wait for this? Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov
davison@MENUDO.UH.EDU (Dan Davison) (10/04/90)
> SUMMARY: we need a FULL DEFINITION of GenBank, not just the feature > tables!!! A complete definition is available from the GenBank staff at Los Alamos. Last I knew the name of the document was "dnaform.dc3". It's what I use to guide my programming (sic) that reads GenBank entries. > Example: we should be able to parse out the topology of an entry > (circular or linear sequence) and the references. The topic is > wider than most people have discussed so far, and I don't understand > why GenBank has resisted creating the definition for so long. GenBank et al. can defend themselves, but such a definition does exist and all you'd have had to do was ask...I did in 1982, that's how I found out about it. It has been updated regularly since. > I made the suggestion that a parsable form with an associated > DOCUMENT and DEFINITION is a requirement for the database AT LEAST 8 > YEARS AGO. Another person who wants to break even more existing software. GB staff are very conservative about changes. This drives most of non-GB'ers nuts (because of course we know what's right ;->) but they're right to be conservative on such matters. > How many more years will we have to wait for this? This (genbank-bb) is the place to discuss changes...I'm sure the GB folk have their flameproof suits on whenever they go near a keyboard. Fire away. dan -- dr. dan davison/dept. of biochemical and biophysical sciences/univ. of Houston/4800 Calhoun/Houston,TX 77054-5500/davison@uh.edu/DAVISON@UHOU "Comparing bad weather to rape: 'if it's inevitable, just relax and enjoy it'" Clayton Williams, Republican candidate for Governor of Texas. ...And THIS is the kind of person and attitute most Texans find acceptable...in 1990...very sad. Disclaimer: As always, I speak only for myself, and, usually, only to myself.
kristoff@genbank.bio.net (David Kristofferson) (10/04/90)
Dan, No problem with flames. I enjoy a bit of controversy as long as everything is above board and honest (and it doesn't get too time-consuming). I am a bit too new on the GenBank block to explain all of the old history, but others at GenBank can reply to Tom's concerns. I will point out though that a lot of GenBank's recent efforts have been devoted to developing the RDBMS version of the database instead of concentrating on the flat file version (although we apparently concentrated on the flat file format just enough to tick you off 8-)!!). Please note that this is an off-the-cuff remark though and a more authoritative response should come from LANL. I would suggest once again, however, that GenBank issues should be directed to the GENBANK-BB (bionet.molbio.genbank) newsgroup instead of the Genome newsgroup. The posting addresses for that group are: Address Location Network ------- -------- ------- genbankb@irlearn.ucd.ie Ireland EARN/BITNET genbankb@uk.ac.daresbury U.K. JANET genbank-bb@bmc.uu.se Sweden Internet genbank-bb@genbank.bio.net U.S.A. Internet/BITNET The group can be accessed on USENET (bionet.molbio.genbank) or e-mail subscriptions can be requested by mailing to one of the following addresses (don't send subscription requests to the newsgroup addresses above, please). Address Location Network ------- -------- ------- biosci@irlearn.ucd.ie Ireland EARN/BITNET biosci@uk.ac.daresbury U.K. JANET biosci@bmc.uu.se Sweden Internet biosci@genbank.bio.net U.S.A. Internet/BITNET -- Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net