[bionet.molbio.genbank] obtaining Genbank

jim@crom2.uucp (James P. H. Fuller) (03/30/91)

     What's the best strategy for obtaining Genbank if you're stuck out 
here in uucp-only-land?  I finally found the right button to push to get
the EMBL mail server to send me prices and order forms for the tapes and
CDs they distribute; is there a similar address for Genbank information?
I'm looking for the data in a format searchable by fasta as distributed
by the University of Virginia (fasta.shar).  I presently have a bionet
feed paid for out of personal funds and it's painful to see the updates
scroll past and then get dropped on the floor by C news expire after only
a week on the system.  
    I'm not averse to spending a reasonable amount of money to obtain
Genbank, but I'm not interested in spending thousands just to get a pro-
prietary format and a self-proclamed "user-friendly" front end.  I thought
(briefly) of trying to get it from UH via the Princeton BITFTP server but
suspect that if I tried to get anything that huge via BITFTP the people
at Princeton would blackball me for the rest of eternity (and quite right,
too.)  I'm not asking the net.people to do my homework for me, but I would
greatly appreciate any bright ideas and/or pointers to appropriate sources
of information.  Remember, I can't FTP.
                                              Thanks very much,
                                              James P. H. Fuller

---------------------------------- -----------------------------------------
 crom2 Athens Public Access Unix  |  i486 AT, 16 mg RAM, 600 mg online
                                  |  AT&T Unix System V release 3.2
 Molecular Biology                |  Tbit PEP 19200bps - V.32 - V.42/V.42bis
 Population Biology               |     
 Ecological Modelling             |  Admin: James P. H. Fuller
 Bionet/Usenet/cnews/nn           |  {jim,root}%crom2@nstar.rn.com
---------------------------------- -----------------------------------------

kristoff@genbank.bio.net (David Kristofferson) (04/01/91)

GenBank can be obtained on tapes, CDROM or floppies for very
reasonable prices.  You can call us at 415-962-7364 for details.  In
general these types of questions can be directed to our main address
genbank@genbank.bio.net.

				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (04/01/91)

P.S. - In regards to FASTA searching of GenBank I'll post our e-mail
server instructions for your reading enjoyment.

----------------------------------------------------------------------
FASTA Server Help
  
GenBank now offers the FASTA program for nucleic acid sequence and
protein similarity searching of sequence databases.  You can access
the GenBank FASTA Server through a number of different networks,
including Internet, BITNET, EARN, NETNORTH and JANET.

The FASTA program allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA Server at GenBank.  A FASTA sequence similarity search is then
performed against the specified database using the FASTA program
developed by William Pearson and David Lipman as described in their
paper:
 
	Pearson, W.R. and Lipman, D.J. 1988.  Improved Tools for 
	Biological Sequence Comparison.  Proc. Natl. Acad. Sci., 
	85: 2444-2448.
  
If you use FASTA as a research tool, we ask that this reference be
cited in your paper. The results of the FASTA search will be returned
to your local mail file as soon as they are processed and can be saved
in a separate disk file.

The following databases are currently available for FASTA searches:

   Designator                  Database
   ----------                  --------
   GenBank/all                 Latest GenBank quarterly release PLUS 
                               sequences added since last release.
   GenBank/new                 GenBank sequences added since last release.
   GenBank/primate             GenBank subdivisions
   GenBank/rodent
   GenBank/other_mammalian
   GenBank/other_vertebrate
   GenBank/invertebrate
   GenBank/plant
   GenBank/organelle
   GenBank/bacterial
   GenBank/structural_rna
   GenBank/viral
   GenBank/phage
   GenBank/synthetic
   GenBank/unannotated

   GenPept/all		       Translated protein reading frames from
			       the latest GenBank release.  Note that
			       GenPept contains translations only of
			       reading frames that are explicitly
			       mentioned in the GenBank sequence entry
			       annotations!
   GenPept/new		       Translated protein reading frames from
			       GenBank daily updates (translated from 
			       GenBank/new).

   EMBL/all                    Latest EMBL Data Library release PLUS
                               sequences added since last release.
   EMBL/new                    EMBL sequences added since last release.
   EMBL/bacteriophage	       EMBL subdivisions
   EMBL/fungi
   EMBL/invertebrate
   EMBL/organelle
   EMBL/other_mammalian
   EMBL/other_vertebrate
   EMBL/plant
   EMBL/primate
   EMBL/prokaryote
   EMBL/rodent
   EMBL/synthetic
   EMBL/unannotated
   EMBL/viral


   SWISS-PROT/all              All of the SWISS-PROT protein database.

GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is
a protein sequence database.  GenPept is produced by GenBank and
consists of translations of open reading frames as documented in the
sequence entry annotations.


Accessing the FASTA program

To access the program, send an electronic mail message containing the 
formatted query sequence (as described below) to the following Internet 
address:

	SEARCH@GENBANK.BIO.NET   

If you are not on Internet, you may need to change the format of the 
address.  Consult your systems manager to determine the correct address.

Obtaining Help

If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "Help" on
a single line of the mail message.  Leave the Subject line in the mail
header blank. The help text will be updated when new information is
available for FASTA searches (such as new databases on-line). For
additional help on using FASTA, contact GenBank at (415) 962-7307 or
send an electronic mail message to the address:

	CONSULTANT@GENBANK.BIO.NET

Formatting a Query

Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search.  The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below.  These lines are typed into the body of
the mail message in the order shown below:

 Search 
Parameter	Mandatory			Explanation

DATALIB		   Yes		This line specifies the database to be 
				searched (as described in the beginning of
				this text) for the query sequence and must 
				be included in the message.  
KTUP		   No		This line identifies the Ktup value which 
				specifies the sensitivity of the search. 
				Values range between 3 and 6 for nucleic acid
				searches and between 1 and 2 for protein 
				searches. Lower values specify more sensitive 
				searches but require more time to complete.  
				For DNA sequences longer than 200 base pairs, 
				use a Ktup value of 4 or greater; lower values
 				are unnecessary and take longer to complete.  
				Protein searches will benefit from having a 
				Ktup value of 1 if you expect significant 
				matches with evolutionary amino acid replace-
				ments but few exact amino acid matches. The 
				default value for nucleic acids is 4 and 1 
				for proteins.
SCORES		   No		This line specifies the number of best-ranked 
				sequences to be listed in the results.  The 
				default value is 100.
ALIGNMENTS	   No		This line identifies the maximum number of 
				best-ranked sequences to be aligned in the 
				results.  The default value is 20.
BEGIN		   Yes		This line must be included in the message.  No 
				other information is typed on it.

The remainder of the message contains the query sequence in either
Pearson FASTA format or in IntelliGenetics format.

Preparing Files for Similarity Searches

Only one sequence query is allowed per mail query.  The query sequence
that you would like searched in the database must be contained in its
own file.  Your sequence file must be in either Pearson format or
IntelliGenetics format.  GenBank database file format is not currently
accepted; however, it is possible to use an editor to change the file
to Pearson format as described below.  Note: all lines must be less
than 80 characters in length; larger lines will be truncated.

Pearson Format

Pearson is the preferred format to use for query sequences.  The format 
includes a mandatory comment line beginning with a greater-than sign ">" 
followed by the name of the sequence, a space, and an optional note 
about the sequence.  The sequence data begin on the next line without 
the greater-than sign.  For example:

>AGREP4 Monkey SV40-like genomic segment promoting transcription.
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg 
tttccatttactttggattatacgtcattataaa

IntelliGenetics Format

If your sequence was derived using one of the IntelliGenetics programs,
it can be used for a FASTA search.  Comment lines are optional and
begin with a semi-colon ";".  The name of the sequence and the
sequence data appear on separate lines without a semicolon.  At the
end of the sequence data a number must follow to indicate if the
sequence is linear (1) or circular (2).  For example:

;Monkey SV40-like genomic segment promoting transcription.
AGMREP4
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg 
tttccatttactttggattatacgtcattataaa1

GenBank Flat-File Format

GenBank database file format is NOT accepted for query searches.  The
files contain annotation data and residue numbers that cannot be
recognized by FASTA.  For example:

(annotation data)
 1 ccccttcaaa tctattacaa ggtgagcgtc tcgccaaggc aatgaaatcg caatatgatg  
61 taaccttgcg ctttggatta gacggactgt taaacggcaa
           
These files can be used only if they are changed to follow Pearson
format.  The files must be stripped of annotation data and the numbers
in the sequence; the mandatory comment line (starting with ">") must
then be added.

Sending the Query Sequence

Use your local mail program to send GenBank your query sequence.  Most
mail programs allow you to import a file into the mail message.  You
can import your sequence file into the mail message on the line after
"Begin".  Please follow the format in the following example of a FASTA
request PRECISELY, but note that the program is case-insensitive, i.e.
either upper or lower case letters may be used.

This is an example of a mail message sent for a FASTA search.  Note that 
the first four lines are a mail header that is automatically created 
when you address a mail message.  Nothing need be entered for the 
Subject.  Each line of information must be less than 80 
characters in length.  Longer lines will be truncated.  

From:  drbob@someaddress.somewhere.edu Tue Jun 14 21:36:38 1988
Date:  14 Jun 1988 2129:02-PDT
To:    SEARCH@GENBANK.BIO.NET  
Subject:  

The text that you enter into the body of the message begins with DATALIB 
(do not add blank lines in the message):

DATALIB GenBank/other_mammalian
KTUP 4
SCORES 100
ALIGNMENTS 20
BEGIN
>BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides. 
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
caccaccatggacagcaaa

The sequence is then sent to the FASTA Server at GenBank.  Once your
message is received, it is placed in a batch queue and processed in
the order it is received.  Two queues called the fast and slow queues
process FASTA requests.  The slow queue handles nucleic acid searches
of "genbank/all" and "embl/all."  All other requests are placed in the
fast queue.  Searches submitted to the fast queue require less CPU
time and are completed more quickly than those sent to the slow queue.

If you would like to know the status of the queues being processed,
you can send a mail message to the FASTA Server address
(SEARCH@GENBANK.BIO.NET) containing the word "QUEUE" on a single line
of the mail message (Leave the Subject field blank). 
The fast queue is labeled with the letter "d"; the slow queue is
labeled with "e".

You cannot have more than one search waiting in the slow queue at any
one time.  If you send an additional search to the slow queue before
your first request has been processed, the initial search will be
cancelled.  At MOST you can have one executing search and one waiting
job in the slow queue at the same time.  Multiple jobs are currently
permitted in the fast queue but please limit your zeal since others
also use the service.  For example, submitting ten jobs simultaneously
to the fast queue would definitely be in bad taste.  We would prefer
it if, after submitting 2 - 4 jobs to the fast queue, you wait until
your results are received before submitting additional runs.

 
Handling the Results of a FASTA Search

When the results are returned, use your local mail program to retrieve
them.  You can transfer the results of a FASTA search to a separate
disk file to free up space in your mail directory.  Consult the
documentation for your local mail program for the commands to transfer
and read mail.  If you wish to obtain sequences of interest, use the
e-mail retrieval server mentioned below or the IRX searching system
available through the GenBank On-line Service.  Contact GenBank for
details (415-962-7364).


Interpreting the Results of a FASTA Search

The mail message returned after the FASTA search will contain the 
sequence name and length, the database searched, and the scoring matrix 
used.  When searching all of GenBank, each subdivision of GenBank will 
also be displayed.

To achieve a rapid yet sensitive search, the FASTA program uses a 
hierarchy of steps to determine scores for the sequences searched in the 
database.  There are cut off points in each of the scoring steps so that 
only high scoring sequences are used in subsequent searching steps.  

Three scores are tallied and reported:  INITN, INIT1,  and OPT.  Each of 
these scores is assigned to a sequence based on its rank at a specific 
point in the similarity searching process.

In comparing the query sequence to a sequence in the database , the 
following steps are taken to determine the three scores:

1.  First, the ktup value is used to establish a matrix for comparing 
    sequences.  A value of 4 for a nucleic acid means that each group of 
    4 consecutive residues of the query sequence and the database  
    sequence will be compared. The sequences are compared on two 
    perpendicular axes and a diagonal line is created when ktup matches 
    with residues of the two sequences occur.
2.  By joining match regions along the same diagonal that are not 
    separated by excessive mismatches, initial regions of high similarity 
    are identified.  The 10 best diagonal regions of high similarity are 
    used for further analysis.  
3.  An INIT1 score is then assigned to each region of high similarity.
4.  Next, FASTA attempts to join  regions on the diagonal and assign 
    them an INITN score.  The INITN score is determined by adding each of the 
    INIT1 scores of the two regions to join and subtracting a constant 
    value of 20 as a joining penalty.  If the combined value of the 
    region is less than the INIT1 score of either region, the regions are 
    not joined.  In this case,  the INITN score will be equal to the 
    INIT1 score of each region.  Only the sequences that have an INITN 
    score above a set cutoff point are kept for possible alignment. 	
5.  Sequences with the highest INITN scores are then used for a 
    Needleman-Wunsch/Smith-Waterman alignment to determine their OPT score.  
    The OPT score is used to evaluate the alignments produced by FASTA.    

A histogram of the score distributions for both the INITN and INIT1 
scores will be displayed in the results.  The score value is given in 
the left column and the number of sequences that were in that interval 
is displayed in the two columns to the right.  In the following example, 
there were 377 sequences with INITN scores that were greater than 12 but 
less than or equal to 16.   In the graphic histogram, "+"'s and "-"'s 
are used to distinguish the bars for INITN and INIT1 scores, 
respectively,  if the number of scores differ.

Example:

     initn  init1
<  4    16    16:========
   8     0     0:
  12     1     1:=
  16   377   377:==================================================
  20  1272  1272:==================================================
  24  2224  2224:==================================================
  28  2717  2717:==================================================
  32  3147  3147:==================================================
  36  2921  2921:==================================================
  40  2064  2064:==================================================
  44  1243  1243:==================================================
  48   568   568:==================================================
  52   269   269:==================================================
  56   105   105:==================================================
  60    43    43:======================
  64    21    22:===========
  68     7     7:====
  72     3     3:==
  76    18    19:---------+
  80    11    11:======
  84    16    17:--------+
  88     8     8:====
  92     0     0:
  96     1     1:=
 100     0     0:
 104     1     0:+
 108     0     0:
 112     0     0:
 116     1     0:+
 120     0     0:
 124     1     0:+
 128     0     0:
 132     0     0:
 136     1     1:=
 140     0     0:
 144     0     0:
 148     0     0:
 152     0     0:
 156     0     0:
 160     0     0:
>160     0     0:

KEY:	 +	initn scores
	 -	init1 scores
	 =	no. of initn scores same as no. of init1 scores


The statistics of the search will be given after the histogram including 
the total number of residues in the database searched, the number of 
sequences searched, the average INITN and INIT1 scores with their 
respective standard deviations, the number of scores that were above the 
cutoff value, the value for ktup, and the value for fact.

searched 19156002 residues in 17047 sequences
mean initn score:  31.8 (s.d.= 8.44)
mean init1 score:  31.8 (s.d.= 8.44)
161 scores better than 55 saved, ktup: 4, fact: 4

The name and scores for the top 100 best-ranking sequences, as
determined by their INITN score, will be presented in the results.  In
addition, the optimized alignments for the top 20 ranking sequences
are given as shown below.  (Please note that the default values are
100 and 20 but may be more or less depending on the parameters and
query sequence submitted.)  Only the region that was considered
significant by the program will be displayed.


The best scores are:				      	initn init1  opt
>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi   134   134   140
>M13TG117  - Phage M13tg117 cloning vector in 5' end of  122    84    87
>SYNPUC92B - Plasmid PUC9-2, a modified pUC9 vector wi   114    74    80
>M13TG115  - Phage M13tg115 cloning vector in the 5' en  103    63    63
>MUSP53MR  - Mouse p53 cellular tumor antigen mRNA, com   96    96    96
 .
 .
 .


Alignments:

>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi 
initn= 134   init1= 134   opt= 140       80.0% identity in 65 nt overlap

               10        20        30        40        50        60
M13MP5 ATGACCATGATTACGAATTCCGGAATTCCGGAATT-CCGGAATTCCG--GAATTCC--CC
       X:::::::::::::::::: :::::::::: :::: v^::   ::::  ::  : :  ::
SYNPUC ATGACCATGATTACGAATTGCGGAATTCCGCAATTCCCGGGGATCCGTCGACCTGCAGCC
               10        20        30        40        50        60

               70        80      
M13MP5 AAGCTTGGGAATTCCGGAATT
       :::::                
SYNPUC AAGCTGCAAGCTTGCAGCTTG
               70        80 

 .
 .
 .

Library scan:  0:05:20  total CPU time:  0:05:23


After all the alignments are printed out, the CPU time used for the 
library (database) scan and the total CPU time will be displayed.

The following table shows the symbols used in the alignment and their 
representation.
__________________________________________________________________

Symbol	  Representation	

 :	  an exact match

 .	  an ambiguous match or a match with a conservatively replaced 
	  amino acid

 -	  a gap in the sequence

 X	  boundaries of the initial region that are associated with 
	  the INIT1 score

 ^ and v  boundaries shifted during the final optimization step which 
	  replace "X"	

__________________________________________________________________

Interpreting the Scores

The OPT score is derived from the alignment and is generally the best
score to evaluate the alignments produced by FASTA.  Please note that
the program prints the scores in the order given by the INITN scores
and not the OPT scores.  In general, sequences with high INITN scores
usually have high OPT scores but this is not always true.  Also, the
OPT scores are determined for only some of the database sequences,
therefore the mean and standard deviation are not calculated.  These
statistics can only be calculated for INITN and INIT1 scores.  For
more information on interpreting the scores produced by a FASTA
search, consult Pearson and Lipman's paper presented at the beginning
of the help text.

Calculating Time Usage

The processing time for a FASTA search depends on:  the size of the 
queued sequence, the database selected, the ktup value, the number of 
requests in the batch queue and the load on the GenBank computer. 

Retrieving DataBank Entries found with FASTA

Database entries can be retrieved by either locus name or accession
number.  To use the GenBank Retrieval System, send an electronic
message to RETRIEVE@GENBANK.BIO.NET containing as text (leave the
Subject: line blank) either an accession number, or an entry name, but
not both.  The message text should contain exactly one word.

The data banks are searched in the order: GenBank New Data, GenBank
current release, EMBL New Data, EMBL current release, GenPept New
Data, GenPept current release, and Swiss-Prot until a match is found.
If an entry exists in both GenBank and EMBL with the same accession
number (the usual case), a query on the accession number will return
the GenBank version of the entry.  If the EMBL-format version is
required, it can be retrieved from the file server at
NETSERV@EMBL.BITNET (for instructions send a message containing the
line HELP to that address).  To retrieve GenPept entries, use the
LOCUS name of the corresponding GenBank entry followed by a _1, or _n
where n represents the nth coding region in that GenBank entry.  For
example, ASNTUBBA_1 is the GenPept LOCUS name for the translation from
GenBank entry ASNTUBBA.

An electronic version of the sequence data submission form used by the
sequence data banks is also available through the RETRIEVE server.  To
receive a copy, send a message containing the word DATASUB as the only
line.  Instructions for completing and submitting the form are
included.


Obtaining FASTA 

The FASTA program (and other related programs) can be purchased for 
VAX/VMS, SUN/Unix, IBM-PC and Macintosh computers. To obtain the program 
for one of these systems, contact Dr. William Pearson at:

	Department of Biochemistry
	Box 440 Jordan Hall
	University of Virginia
	Charlottesville, VA 22908      

	or send electronic mail to:  wrp@Virginia.BITNET

You can also obtain the programs by anonymous FTP from 
uvaarpa.virginia.edu and accessing the file, public_access/fasta.shar

End of FASTA Server Help