kristoff@NET.BIO.NET (Dave Kristofferson) (11/10/89)
The GenBank On-line Service Servers for both FASTA nucleic acid and
protein sequence similarity searches and for entry retrieval are
operational and open to users on any accessible network. I include
the instructions for use below. The addresses for the servers are as
follows:
address purpose
------- -------
search@genbank.bio.net FASTA searching (GenBank, EMBL,
SWISS-PROT, including latest daily
updates of GenBank and EMBL databases)
retrieve@genbank.bio.net retrieval of search hits / entries by
locus name or by accession number
(GenBank and EMBL, SWISS-PROT to come)
Instructions can also be obtained by sending a mail message (without a
Subject: line) containing the single word HELP to either of the above
addresses.
Sincerely,
Dave Kristofferson
GenBank On-line Service Manager
kristoff@net.bio.net
----------------------------------------------------------------------
FASTA Server Help
GenBank now offers the FASTA program for nucleic acid sequence and
protein similarity searching of sequence databases. You can access
the GenBank FASTA Server through a number of different networks,
including Internet, BITNET, EARN, NETNORTH and JANET.
The FASTA program allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA Server at GenBank. A FASTA sequence similarity search is then
performed against the specified database using the FASTA program
developed by William Pearson and David Lipman as described in their
paper:
Pearson, W.R. and Lipman, D.J. 1988. Improved Tools for
Biological Sequence Comparison. Proc. Natl. Acad. Sci.,
85: 2444-2448.
If you use FASTA as a research tool, we ask that this reference be
cited in your paper. The results of the FASTA search will be returned
to your local mail file as soon as they are processed and can be saved
in a separate disk file.
The following databases are currently available for FASTA searches:
Designator Database
---------- --------
GenBank/all Latest GenBank quarterly release PLUS
sequences added since last release.
GenBank/new GenBank sequences added since last release.
GenBank/primate GenBank subdivisions
GenBank/rodent
GenBank/other_mammalian
GenBank/other_vertebrate
GenBank/invertebrate
GenBank/plant
GenBank/organelle
GenBank/bacterial
GenBank/structural_rna
GenBank/viral
GenBank/phage
GenBank/synthetic
GenBank/unannotated
EMBL/all Latest EMBL Data Library release PLUS
sequences added since last release.
EMBL/new EMBL sequences added since last release.
SWISS-PROT/all All of the SWISS-PROT protein database.
GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is
a protein sequence database.
Check this on-line help for new databases available in the future for
FASTA searches.
Accessing the FASTA program
To access the program, send an electronic mail message containing the
formatted query sequence (as described below) to the following Internet
address:
SEARCH@GENBANK.BIO.NET
If you are not on Internet, you may need to change the format of the
address. Consult your systems manager to determine the correct address.
Obtaining Help
If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "Help" on
a single line of the mail message. Leave the Subject line in the mail
header blank. The help text will be updated when new information is
available for FASTA searches (such as new databases on-line).
Formatting a Query
Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search. The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below. These lines are typed into the body of
the mail message in the order shown below:
Search
Parameter Mandatory Explanation
DATALIB Yes This line specifies the database to be
searched (as described in the beginning of
this text) for the query sequence and must
be included in the message.
KTUP No This line identifies the Ktup value which
specifies the sensitivity of the search.
Values range between 3 and 6 for nucleic acid
searches and between 1 and 2 for protein
searches. Lower values specify more sensitive
searches but require more time to complete.
For DNA sequences longer than 200 base pairs,
use a Ktup value of 4 or greater; lower values
are unnecessary and take longer to complete.
Protein searches will benefit from having a
Ktup value of 1 if you expect significant
matches with evolutionary amino acid replace-
ments but few exact amino acid matches. The
default value for nucleic acids is 4 and 1
for proteins.
SCORES No This line specifies the number of best-ranked
sequences to be listed in the results. The
default value is 100.
ALIGNMENTS No This line identifies the maximum number of
best-ranked sequences to be aligned in the
results. The default value is 20.
BEGIN Yes This line must be included in the message. No
other information is typed on it.
The remainder of the message contains the query sequence in either
Pearson FASTA format or in IntelliGenetics format.
Preparing Files for Similarity Searches
Only one sequence query is allowed per mail query. The query sequence
that you would like searched in the database must be contained in its
own file. Your sequence file must be in either Pearson format or
IntelliGenetics format. GenBank database file format is not currently
accepted; however, it is possible to use an editor to change the file
to Pearson format as described below. Note: all lines must be less
than 80 characters in length; larger lines will be truncated.
Pearson Format
Pearson is the preferred format to use for query sequences. The format
includes a mandatory comment line beginning with a greater-than sign ">"
followed by the name of the sequence, a space, and an optional note
about the sequence. The sequence data begin on the next line without
the greater-than sign. For example:
>AGREP4 Monkey SV40-like genomic segment promoting transcription.
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg
tttccatttactttggattatacgtcattataaa
IntelliGenetics Format
If your sequence was derived using one of the IntelliGenetics programs,
it can be used for a FASTA search. Comment lines are optional and
begin with a semi-colon ";". The name of the sequence and the
sequence data appear on separate lines without a semicolon. At the
end of the sequence data a number must follow to indicate if the
sequence is linear (1) or circular (2). For example:
;Monkey SV40-like genomic segment promoting transcription.
AGMREP4
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg
tttccatttactttggattatacgtcattataaa1
GenBank Flat-File Format
GenBank database file format is NOT accepted for query searches. The
files contain annotation data and residue numbers that cannot be
recognized by FASTA. For example:
(annotation data)
1 ccccttcaaa tctattacaa ggtgagcgtc tcgccaaggc aatgaaatcg caatatgatg
61 taaccttgcg ctttggatta gacggactgt taaacggcaa
These files can be used only if they are changed to follow Pearson
format. The files must be stripped of annotation data and the numbers
in the sequence; the mandatory comment line (starting with ">") must
then be added.
Sending the Query Sequence
Use your local mail program to send GenBank your sequence query. Most
mail programs allow you to import a file into the mail message. You
can import your sequence file into the mail message on the line after
"Begin". Please follow the format in the following example of a FASTA
request PRECISELY, but note that the program is case-insensitive, i.e.
either upper or lower case letters may be used.
This is an example of a mail message sent for a FASTA search. Note that
the first four lines are a mail header that is automatically created
when you address a mail message. Nothing need be entered for the
Subject. Each line of information must be less than 80
characters in length. Longer lines will be truncated.
From: drbob@someaddress.somewhere.edu Tue Jun 14 21:36:38 1988
Date: 14 Jun 1988 2129:02-PDT
To: SEARCH@GENBANK.BIO.NET
Subject:
The text that you enter into the body of the message begins with DATALIB
(do not add blank lines in the message):
DATALIB GenBank/other_mammalian
KTUP 4
SCORES 100
ALIGNMENTS 20
BEGIN
>BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides.
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
caccaccatggacagcaaa
The sequence is then sent to the FASTA Server at GenBank. Once your
message is received, it is placed in a batch queue and processed in
the order it is received. Two queues called the fast and slow queues
process FASTA requests. The slow queue handles nucleic acid searches
of "genbank/all" and "embl/all." All other requests are placed in the
fast queue. Searches submitted to the fast queue require less CPU
time and are completed more quickly than those sent to the slow queue.
If you would like to know the status of the queues being processed,
you can send a mail message to the FASTA Server address
(SEARCH@GENBANK.BIO.NET) containing the word "QUEUE" on a single line
of the mail message (Leave the Subject field blank).
The fast queue is labeled with the letter "d"; the slow queue is
labeled with "e".
You cannot have more than one search waiting in the slow queue at any
one time. If you send an additional search to the slow queue before
your first request has been processed the initial search will be
cancelled. At MOST you can have one executing search and one waiting
job in the slow queue at the same time. Multiple jobs are currently
permitted in the fast queue.
Handling the Results of a FASTA Search
When the results are returned, use your local mail program to retrieve
them. You can transfer the results of a FASTA search to a separate
disk file to free up space in your mail directory. Consult the
documentation for your local mail program for the commands to transfer
and read mail. If you wish to obtain sequences of interest, use the
IRX searching system available through the GenBank On-line Service.
Contact GenBank for details.
Interpreting the Results of a FASTA Search
The mail message returned after the FASTA search will contain the
sequence name and length, the database searched, and the scoring matrix
used. When searching all of GenBank, each subdivision of GenBank will
also be displayed.
To achieve a rapid yet sensitive search, the FASTA program uses a
hierarchy of steps to determine scores for the sequences searched in the
database. There are cut off points in each of the scoring steps so that
only high scoring sequences are used in subsequent searching steps.
Three scores are tallied and reported: INITN, INIT1, and OPT. Each of
these scores is assigned to a sequence based on its rank at a specific
point in the similarity searching process.
In comparing the query sequence to a sequence in the database , the
following steps are taken to determine the three scores:
1. First, the ktup value is used to establish a matrix for comparing
sequences. A value of 4 for a nucleic acid means that each group of
4 consecutive residues of the query sequence and the database
sequence will be compared. The sequences are compared on two
perpendicular axes and a diagonal line is created when ktup matches
with residues of the two sequences occur.
2. By joining match regions along the same diagonal that are not
separated by excessive mismatches, initial regions of high similarity
are identified. The 10 best diagonal regions of high similarity are
used for further analysis.
3. An INIT1 score is then assigned to each region of high similarity.
4. Next, FASTA attempts to join regions on the diagonal and assign
them an INITN score. The INITN score is determined by adding each of the
INIT1 scores of the two regions to join and subtracting a constant
value of 20 as a joining penalty. If the combined value of the
region is less than the INIT1 score of either region, the regions are
not joined. In this case, the INITN score will be equal to the
INIT1 score of each region. Only the sequences that have an INITN
score above a set cutoff point are kept for possible alignment.
5. Sequences with the highest INITN scores are then used for a
Needleman-Wunsch/Smith-Waterman alignment to determine their OPT score.
The OPT score is used to evaluate the alignments produced by FASTA.
A histogram of the score distributions for both the INITN and INIT1
scores will be displayed in the results. The score value is given in
the left column and the number of sequences that were in that interval
is displayed in the two columns to the right. In the following example,
there were 377 sequences with INITN scores that were greater than 12 but
less than or equal to 16. In the graphic histogram, "+"'s and "-"'s
are used to distinguish the bars for INITN and INIT1 scores,
respectively, if the number of scores differ.
Example:
initn init1
< 4 16 16:========
8 0 0:
12 1 1:=
16 377 377:==================================================
20 1272 1272:==================================================
24 2224 2224:==================================================
28 2717 2717:==================================================
32 3147 3147:==================================================
36 2921 2921:==================================================
40 2064 2064:==================================================
44 1243 1243:==================================================
48 568 568:==================================================
52 269 269:==================================================
56 105 105:==================================================
60 43 43:======================
64 21 22:===========
68 7 7:====
72 3 3:==
76 18 19:---------+
80 11 11:======
84 16 17:--------+
88 8 8:====
92 0 0:
96 1 1:=
100 0 0:
104 1 0:+
108 0 0:
112 0 0:
116 1 0:+
120 0 0:
124 1 0:+
128 0 0:
132 0 0:
136 1 1:=
140 0 0:
144 0 0:
148 0 0:
152 0 0:
156 0 0:
160 0 0:
>160 0 0:
KEY: + initn scores
- init1 scores
= no. of initn scores same as no. of init1 scores
The statistics of the search will be given after the histogram including
the total number of residues in the database searched, the number of
sequences searched, the average INITN and INIT1 scores with their
respective standard deviations, the number of scores that were above the
cutoff value, the value for ktup, and the value for fact.
searched 19156002 residues in 17047 sequences
mean initn score: 31.8 (s.d.= 8.44)
mean init1 score: 31.8 (s.d.= 8.44)
161 scores better than 55 saved, ktup: 4, fact: 4
The name and scores for the top 100 best-ranking sequences, as
determined by their INITN score, will be presented in the results. In
addition, the optimized alignments for the top 20 ranking sequences are
given as shown below. (Please note that the default values are 100 and
20 but may be less depending on the query sequence submitted.) Only the
region that was considered significant by the program will be displayed.
The best scores are: initn init1 opt
>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi 134 134 140
>M13TG117 - Phage M13tg117 cloning vector in 5' end of 122 84 87
>SYNPUC92B - Plasmid PUC9-2, a modified pUC9 vector wi 114 74 80
>M13TG115 - Phage M13tg115 cloning vector in the 5' en 103 63 63
>MUSP53MR - Mouse p53 cellular tumor antigen mRNA, com 96 96 96
.
.
.
Alignments:
>SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi
initn= 134 init1= 134 opt= 140 80.0% identity in 65 nt overlap
10 20 30 40 50 60
M13MP5 ATGACCATGATTACGAATTCCGGAATTCCGGAATT-CCGGAATTCCG--GAATTCC--CC
X:::::::::::::::::: :::::::::: :::: v^:: :::: :: : : ::
SYNPUC ATGACCATGATTACGAATTGCGGAATTCCGCAATTCCCGGGGATCCGTCGACCTGCAGCC
10 20 30 40 50 60
70 80
M13MP5 AAGCTTGGGAATTCCGGAATT
:::::
SYNPUC AAGCTGCAAGCTTGCAGCTTG
70 80
.
.
.
Library scan: 0:05:20 total CPU time: 0:05:23
After all the alignments are printed out, the CPU time used for the
library (database) scan and the total CPU time will be displayed.
The following table shows the symbols used in the alignment and their
representation.
__________________________________________________________________
Symbol Representation
: an exact match
. an ambiguous match or a match with a conservatively replaced
amino acid
- a gap in the sequence
X boundaries of the initial region that are associated with
the INIT1 score
^ and v boundaries shifted during the final optimization step which
replace "X"
__________________________________________________________________
Interpreting the Scores
The OPT score is derived from the alignment and is generally the best
score to evaluate the alignments produced by FASTA. Please note that
the program prints the scores in the order given by the INITN scores
and not the OPT scores. In general, sequences with high INITN scores
usually have high OPT scores but this is not always true. Also, the
OPT scores are determined for only some of the database sequences,
therefore the mean and standard deviation are not calculated. These
statistics can only be calculated for INITN and INIT1 scores. For
more information on interpreting the scores produced by a FASTA
search, consult Pearson and Lipman's paper presented at the beginning
of the help text.
Calculating Time Usage
The processing time for a FASTA search depends on: the size of the
queued sequence, the database selected, the ktup value, the number of
requests in the batch queue and the load on the GenBank computer.
Retrieving DataBank Entries found with FASTA
Database entries can be retrieved by either locus name or accession
number. To use the GenBank Retrieval System, send an electronic
message to RETRIEVE@GENBANK.BIO.NET containing as text (leave the
Subject: line blank) either an accession number, or an entry name, but
not both. The message text should contain exactly one word.
Currently both GenBank and EMBL, but not SWISS-PROT, entries are
accessible.
Obtaining FASTA
The FASTA program (and other related programs) can be purchased for
VAX/VMS, SUN/Unix, IBM-PC and Macintosh computers. To obtain the program
for one of these systems, contact Dr. William Pearson at:
Department of Biochemistry
Box 440 Jordan Hall
University of Virginia
Charlottesville, VA 22908
or send electronic mail to: wrp@Virginia.BITNET
You can also obtain the programs by anonymous FTP from
uvaarpa.virginia.edu and accessing the file, public_access/fasta.shar
End of FASTA Server Help