[bionet.general] GETOB - A GenBank Features Extraction Program

frist@ccu.umanitoba.ca (05/04/91)
*****************************************************************************

              GETOB - A GenBank Feature Extraction Program

                         Brian Fristensky, PhD
                         University of Manitoba
                         Winnipeg, CANADA

*****************************************************************************

Since there have been a number of requests recently for programs
that can extract subsequences using the GenBank/EMBL/DDBJ Features Table 
Format, I am posting this announcement to those who may benefit from 
such a program. GETOB is a Pascal program that can evaluate any legal
FEATURES expression, and write the result to an output file. It is an extremely
versatile and complete tool that was written as part of my XYLEM package.

The XYLEM package is a general purpose set of tools for maintaining and 
manipulating complete databases (eg. GenBank, PIR, EMBL, VecBase, LiMB) or
subsets of these databases. XYLEM makes it possible to retrieve entries from
databases by name, accession number or keyword; to create database subsets; to
extract features from GenBank entries. Tools are provided to perform operations on database subsets, such as randomization or translation.  Partly due to my 
teaching load, and partly due to keeping up with changes in the Features
definition, I don't yet have all the documentation written to formally release
XYLEM to the world at large. I hope to be able to do this in the next couple of 
months. However, to those who need a GenBank parser NOW, this may be your 
ticket.

-------------------
WHAT GETOB CAN DO

Given a namefile containing GenBank LOCUS names or ACCESSION numbers, such as
 
            POTPR1A 
            POTPSTH2
            POTPSTH21 
            POTSTHA  

and an instruction file containing commands telling GETOB what features to
get, (eg. CDS, mRNA, stem_loop, prim_transcript... in short any legal feature),
GETOB will create those features for each entry listed, in namefile, and write
them to an output file (outfile). For example, the CDS (protein coding sequence)for the first sequence specified in namefile would be written to outfile as: 

           >POTPR1A:1
           atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag
           agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg
           aagatttacaaaataagagccctttacttcttcaatctaatccaattcac
           aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat
           ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc
           ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc
           gaagataag
           ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa
           agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca
           aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt
           gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat
           tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg
           aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt
           atccgttaccgagcctacattcagcctgttgatgcttcaaaatga

A second file, containing a log of operations performed in evaluating the
Feature experession, would look like this:

            GETOB          Version 0.94       1 May 1991

            POTPR1A:1
                 join           
                     (
                          295                 603

                          1011                 1355

                     )


            /note="pathogenesis-related protein (prp1)"
            /codon_start=295
            //----------------------------------------------
  
In the example, the CDS from entry POTPR1A  has been written in two chunks,
corresponding to the two exon portions of the coding sequence. Each location
retrieved in constructing the object is written as a separate chunk of
sequence. This makes it easier to see the different pieces that were put
together to form the final product. (One additional point: outfile format is
directly compatible with Bill Pearson's FASTA programs.)

By comparing message file to outfile, human intervention is possible.
This can be very important in 1) catching errors in the database (not uncommon,
but the GenBank staff will never be able to find and correct these errors 
unless YOU report them!) 2) errors in the program (none that I know about, but
this is a very complex program) 3) things you don't really want 
(eg. pseudogenes, which will be documented in the qualifier lines).
 
GETOB can also be used to get specific labeled objects from a given entry.
Examples:


                @k30576:polyprotein
                @k30576:/label=polyprotein
                @x10345:/product="hsp70"
                @j00879:group(1..2200,mutation_37)

The first two constructs given above are equivalent.  Both  will extract the
feature called polyprotein.  The third construct shows that any feature label
can be specified. If none is specified, as in the first example, then
/label= is assumed.  One limitation, however, is that the label sought must be
unique within the entry in its first 15 characters including double quotes (").
Otherwise, only the first matching label expression will be evaluated.
Finally, the last example shows that a mutant sequence can be constructed
by first specifying an expression that evaluates to a sequence (ie. 1..2200)
and then a labeled expression that upon evaluation, uses replace() to modify
that sequence. The usage shown in examples 3 & 4 above represent extensions to
the DDBJ/EMBL/GenBank Features Table Format.


-------------------
CAVEATS THAT MAY KEEP YOU FROM GOING RIGHT OUT AND GETTING THE PROGRAMS

1) At present, the only documentation available is in the form of Unix manual
   pages. In the future, I hope to have a readable User's Guide available, with
   lots of examples. I simply haven't had time to write this yet!

2) To run GETOB, you need to compile the entire XYLEM Package. XYLEM has so far
   only been run under SUN  and ATT SysV Unix. For systems on which the c-shell 
   is available, shell scripts are provided that automate the installation
   process.  

3) GETOB reads entries in GenBank flat-file (tape) format, with one 
   qualification: annotation and sequence are split between separate files,
   with an index file logically linking the two. Thus, to use GETOB you have
   to be able to generate a database subset containing canonical GenBank
   flatfile entries, and then use splitdb to reorganize the data as
   described.  A future version of GETOB may be able to directly read GenBank
   files with both annotation and sequence together, but for now you need
   to do things within the context of XYLEM. In fact, XYLEM is intended for
   maintanence of complete databases (GenBank, EMBL, PIR, VecBase, LiMB), and
   you may wish to use it as your general database package. 

4) At present, all XYLEM programs are written to behave as Unix commands,
   which makes them not terribly user-friendly. For example, the syntax for
   getob is
 
   getob [-frcn] infile namefile anofile seqfile indfile message outfile

   An X-Windows front end is currently under development, and a menu-driven
   shell script is provided that makes retrieval of database entries very
   easy.

4) Non-Unix systems
   These programs adhere religiously to Standard Pascal, and are written with
   portability in mind. The only non-standard features are file I/O calls 
   which are carefully isolated in specialized subroutines. A programmer's
   guide is included with a step-by-step instructions for adapting the 
   programs to different Pascal compilers, with the help of Tom Schneider's
   module program.

5) At present, GETOB only works with GenBank entries. Since EMBL and DDBJ
   also use the same Features format, it should be easy to adapt GETOB to
   read these databases as well. I just haven't gottin around to doing it yet.

--------------------------
HOW TO GET GETOB AND XYLEM
 
The XYLEM Package can be obtained by anonymous FTP from ccu.umanitoba.ca.
XYLEM is stored as a compressed tar file in the directory

/var/spool/ftp/pub/psgendb

If you don't know how to use FTP, send me a message and I'll mail you 
instructions.

===============================================================================
Brian Fristensky                |  
Department of Plant Science     | "There's a big ... machine in the sky...
University of Manitoba          | some kind of electric snake... coming
Winnipeg, MB R3T 2N2  CANADA    | straight at us." 
frist@ccu.umanitoba.ca          | "Shoot it," said my attorney.
Office phone:   204-474-6085    |"Not yet," I said,"I want to study its habits"
FAX:            204-275-5128    |H.S. Thompson, FEAR & LOATHING IN LAS VEGAS
===============================================================================