frist@ccu.umanitoba.ca (05/04/91)
*****************************************************************************
GETOB - A GenBank Feature Extraction Program
Brian Fristensky, PhD
University of Manitoba
Winnipeg, CANADA
*****************************************************************************
Since there have been a number of requests recently for programs
that can extract subsequences using the GenBank/EMBL/DDBJ Features Table
Format, I am posting this announcement to those who may benefit from
such a program. GETOB is a Pascal program that can evaluate any legal
FEATURES expression, and write the result to an output file. It is an extremely
versatile and complete tool that was written as part of my XYLEM package.
The XYLEM package is a general purpose set of tools for maintaining and
manipulating complete databases (eg. GenBank, PIR, EMBL, VecBase, LiMB) or
subsets of these databases. XYLEM makes it possible to retrieve entries from
databases by name, accession number or keyword; to create database subsets; to
extract features from GenBank entries. Tools are provided to perform operations on database subsets, such as randomization or translation. Partly due to my
teaching load, and partly due to keeping up with changes in the Features
definition, I don't yet have all the documentation written to formally release
XYLEM to the world at large. I hope to be able to do this in the next couple of
months. However, to those who need a GenBank parser NOW, this may be your
ticket.
-------------------
WHAT GETOB CAN DO
Given a namefile containing GenBank LOCUS names or ACCESSION numbers, such as
POTPR1A
POTPSTH2
POTPSTH21
POTSTHA
and an instruction file containing commands telling GETOB what features to
get, (eg. CDS, mRNA, stem_loop, prim_transcript... in short any legal feature),
GETOB will create those features for each entry listed, in namefile, and write
them to an output file (outfile). For example, the CDS (protein coding sequence)for the first sequence specified in namefile would be written to outfile as:
>POTPR1A:1
atggcagaagtgaagttgcttggtctaaggtatagtccttttagccatag
agttgaatgggctctaaaaattaagggagtgaaatatgaatttatagagg
aagatttacaaaataagagccctttacttcttcaatctaatccaattcac
aagaaaattccagtgttaattcacaatggcaagtgcatttgtgagtctat
ggtcattcttgaatacattgatgaggcatttgaaggcccttccattttgc
ctaaagacccttatgatcgcgctttagcacgattttgggctaaatacgtc
gaagataag
ggggcagcagtgtggaaaagtttcttttcgaaaggagaggaacaagagaa
agctaaagaggaagcttatgagatgttgaaaattcttgataatgagttca
aggacaagaagtgctttgttggtgacaaatttggatttgctgatattgtt
gcaaatggtgcagcactttatttgggaattcttgaagaagtatctggaat
tgttttggcaacaagtgaaaaatttccaaatttttgtgcttggagagatg
aatattgcacacaaaacgaggaatattttccttcaagagatgaattgctt
atccgttaccgagcctacattcagcctgttgatgcttcaaaatga
A second file, containing a log of operations performed in evaluating the
Feature experession, would look like this:
GETOB Version 0.94 1 May 1991
POTPR1A:1
join
(
295 603
1011 1355
)
/note="pathogenesis-related protein (prp1)"
/codon_start=295
//----------------------------------------------
In the example, the CDS from entry POTPR1A has been written in two chunks,
corresponding to the two exon portions of the coding sequence. Each location
retrieved in constructing the object is written as a separate chunk of
sequence. This makes it easier to see the different pieces that were put
together to form the final product. (One additional point: outfile format is
directly compatible with Bill Pearson's FASTA programs.)
By comparing message file to outfile, human intervention is possible.
This can be very important in 1) catching errors in the database (not uncommon,
but the GenBank staff will never be able to find and correct these errors
unless YOU report them!) 2) errors in the program (none that I know about, but
this is a very complex program) 3) things you don't really want
(eg. pseudogenes, which will be documented in the qualifier lines).
GETOB can also be used to get specific labeled objects from a given entry.
Examples:
@k30576:polyprotein
@k30576:/label=polyprotein
@x10345:/product="hsp70"
@j00879:group(1..2200,mutation_37)
The first two constructs given above are equivalent. Both will extract the
feature called polyprotein. The third construct shows that any feature label
can be specified. If none is specified, as in the first example, then
/label= is assumed. One limitation, however, is that the label sought must be
unique within the entry in its first 15 characters including double quotes (").
Otherwise, only the first matching label expression will be evaluated.
Finally, the last example shows that a mutant sequence can be constructed
by first specifying an expression that evaluates to a sequence (ie. 1..2200)
and then a labeled expression that upon evaluation, uses replace() to modify
that sequence. The usage shown in examples 3 & 4 above represent extensions to
the DDBJ/EMBL/GenBank Features Table Format.
-------------------
CAVEATS THAT MAY KEEP YOU FROM GOING RIGHT OUT AND GETTING THE PROGRAMS
1) At present, the only documentation available is in the form of Unix manual
pages. In the future, I hope to have a readable User's Guide available, with
lots of examples. I simply haven't had time to write this yet!
2) To run GETOB, you need to compile the entire XYLEM Package. XYLEM has so far
only been run under SUN and ATT SysV Unix. For systems on which the c-shell
is available, shell scripts are provided that automate the installation
process.
3) GETOB reads entries in GenBank flat-file (tape) format, with one
qualification: annotation and sequence are split between separate files,
with an index file logically linking the two. Thus, to use GETOB you have
to be able to generate a database subset containing canonical GenBank
flatfile entries, and then use splitdb to reorganize the data as
described. A future version of GETOB may be able to directly read GenBank
files with both annotation and sequence together, but for now you need
to do things within the context of XYLEM. In fact, XYLEM is intended for
maintanence of complete databases (GenBank, EMBL, PIR, VecBase, LiMB), and
you may wish to use it as your general database package.
4) At present, all XYLEM programs are written to behave as Unix commands,
which makes them not terribly user-friendly. For example, the syntax for
getob is
getob [-frcn] infile namefile anofile seqfile indfile message outfile
An X-Windows front end is currently under development, and a menu-driven
shell script is provided that makes retrieval of database entries very
easy.
4) Non-Unix systems
These programs adhere religiously to Standard Pascal, and are written with
portability in mind. The only non-standard features are file I/O calls
which are carefully isolated in specialized subroutines. A programmer's
guide is included with a step-by-step instructions for adapting the
programs to different Pascal compilers, with the help of Tom Schneider's
module program.
5) At present, GETOB only works with GenBank entries. Since EMBL and DDBJ
also use the same Features format, it should be easy to adapt GETOB to
read these databases as well. I just haven't gottin around to doing it yet.
--------------------------
HOW TO GET GETOB AND XYLEM
The XYLEM Package can be obtained by anonymous FTP from ccu.umanitoba.ca.
XYLEM is stored as a compressed tar file in the directory
/var/spool/ftp/pub/psgendb
If you don't know how to use FTP, send me a message and I'll mail you
instructions.
===============================================================================
Brian Fristensky |
Department of Plant Science | "There's a big ... machine in the sky...
University of Manitoba | some kind of electric snake... coming
Winnipeg, MB R3T 2N2 CANADA | straight at us."
frist@ccu.umanitoba.ca | "Shoot it," said my attorney.
Office phone: 204-474-6085 |"Not yet," I said,"I want to study its habits"
FAX: 204-275-5128 |H.S. Thompson, FEAR & LOATHING IN LAS VEGAS
===============================================================================