[bionet.molbio.proteins] SWISS-PROT via fastp-mail for homology searches, PLEASE.

TSPRINGER@BIONET-20.ARPA (04/15/88)

From: Timothy Springer <TSPRINGER@BIONET-20.ARPA>






      Searching for homologies in protein databases is much trickier than I
 ever anticipated.  THe problem is that all are incomplete and the degree of
 incompleteness varies greatly.  The local protein database based on NBRF which
 is assuredly updated very frequently with which I started using fastp lacked
 the most interesting sequences to me.  When I started searching supposedly the
 same database on the NBRF computer in Washington,D.C. some very interesting
 homologies appeared.  The protein sequence which was published some 2 years
 ago somehow was not present in our own version of the database.
      The NBRF computer is excellent in responsiveness.  An account for $400
 may be had on NBRF just like on Intelligenetics.  (See PIR bulletins).  We
 have given up on Intelligenetics for any online work (but see about batch jobs
 below).  We typically wait 6 to 14 hours on the terminal for a fastp search.
 The postdocs and graduate students in the lab will patiently do shifts, check-
 ing the terminal every 30 min for any sign of a response.  If there is one,
 they have to respond to the interactive prompt, lest they be timed out and
 lose all.  I have given up in disgust after waiting for 6 hours with no
 response (during nonpeak time).  Either the system is hopelessly overloaded or
 the efficient Lipmann-Pearson algorithim is implemented awkwardly; I also hear
 mumbling about lots of file openings and closings taking lots of time.  On
 NBRF, the online search is completed within about 2 minutes.  The PIR as well
 as the NEW database may be used.  The NBRF also has an excellent program for
 alignment, ALIGN, which is quick, and alignments with randomized sequences
 readily generate the statistical significance of the alignment.  I have yet to
 see anything giving meaningful statistics on the Intelligenetics programs.
      My enthusiasm for Intelligenetics skyrocketed last week when I read the
 PROTEIN-ANALYSIS  bulletin board messages by Amos Bairoch on the SWISS-PROT
 database he has compiled and made available.  While cDNA sequences may appear
 within a few months of publication in the nucleic acid databases, it may take
 a year or two for them to be translated and appear in the protein databases
 (even though the translated sequence was right there in the publication).  So
 Dr. Bairoch (as have others for other databases) has translated the nucleic
 acid databases via computer program and included these additional sequences in
 SWISS-PROT, which with 1,654,416 residues in 6,102 sequences is the biggest
 nonredundant database I have seen.
      Hot to try it, I ran into the same endless hours of waiting at the termi-
 nal with no response.  Pouring out my frustrations to Vickie Johncox at In-
 telligenetics, I learned about fastp-mail and batch jobs. (Documented under
 HELP FASTP-MAIL and HELP BATCH).  Fastp mail is a way of sending the search to
 the Sun computer via mail, with results coming back in less than a minute ex-
 actly as advertised (Why can't we get responsiveness like that online?).  How-
 ever, fastp mail cannot search the SWISS-PROT database!  For this you must
 write a batch file, and I don't recommend using the BFASTP command to build
 your file because it was set up before the availability of SWISS-PROT and
 doesn't have that option in it.  Instead, you should write your own batch
 file.  See documentation in HELP BAT-FASTP.  A model batch file follows which
 probes for homologies with a sequence in the file YOUR.PEP:

      @TAKE <BIONET>BATCH.CMD
      @XFASTP
      *YOUR.PEP
      *2
      *
      *YOUR.REC
      *50
      *











      *
      *20
      @LOGOUT


      You could name this file YOUR.CTL and send it from your wordprocessor via
 kermit or edit it yourself on Bionet.  You may then submit it to the batch
 queue: SUBMIT YOUR.CTL /TIME 00:10:00.  This allows for 10 minutes of cpu
 time, much more than the 5 minutes needed.  You can check on status with INFO
 BATCH.  Your results with the 50 best scores and 20 best alignments will ap-
 pear in the file YOUR.REC.  Note that YOUR.PEP should be in the in-
 telligenetics file format with lines of comments preceded by ";", one line of
 title not preceded by a ";", and then the sequence with no more than 499
 residues in one letter code, followed by a "1", rather than in the format
 recommended for fastp.  If you have what the computer thinks is an extra se-
 quence (like a line preceded by a "<") you will get an extra query about which
 sequence to search which will throw off the batch file.
      Our results?  An exciting hit with a sequence in SWISS-PROT not present
 in other databases, and which was published in 1987.
      The moral?  Intelligenetics/Bionet/good guys/gals, could we get SWISS-
 PROT availability on fastp-mail?  The PIR database is antiquated by SWISS-
 PROT.  And batch files are a real pain compared to fastp-mail.  Who wants to
 wait overnight for one search?
      Even better?  Could we get responsiveness online?  Could we be connected
 directly to programs and computers that would do this for us rather than hav-
 ing to use a mail connection?  The only problem I can see is that you would
 become so popular that you would be swamped with users and demands for your
 time and help.  And rather than just using Bionet as a way to get a taste of
 molecular biology computing before becoming frustrated by its slowness and
 moving on to other resources, or other program families such as those by the
 University of Wisconsin Genetics Computer Group, users might become devoted,
 longtime customers.






























-------

KRISTOFFERSON@BIONET-20.ARPA (04/16/88)

From: David Kristofferson <Kristofferson@BIONET-20.ARPA>


Dear Dr. Springer:

	You raise many valid points that are currently problems with
our system and that we are already in the process of addressing.
Please allow me to answer your queries in order.  For the benefit of
other bboard readers I have higlighted sections of Dr. Springer's
original bulletin with a > in the left hand column.

>     Searching for homologies in protein databases is much trickier than I
>ever anticipated.  THe problem is that all are incomplete and the degree of
>incompleteness varies greatly.  The local protein database based on NBRF which
>is assuredly updated very frequently with which I started using fastp lacked
>the most interesting sequences to me.  When I started searching supposedly the
>same database on the NBRF computer in Washington,D.C. some very interesting
>homologies appeared.  The protein sequence which was published some 2 years
>ago somehow was not present in our own version of the database.

BIONET currently has release 15.0 of PIR on-line.  This includes both
the sequences in the PROTEIN.DAT file and the NEW.DAT file.  It is
likely that PIR gets the latest version of the database up first on
their own machine, so one possibility may be that you are using a new
release that we have not yet received.  Another possibility may be
that the default parameters or algorithms used in the different
database searching programs may be dissimilar and some of the more
marginal hits may vary between each program.  We would like to know
what your query sequence was and what the hits were that were not
found on our system before reaching a final conclusion on this issue.

>     The NBRF computer is excellent in responsiveness.  An account for $400
>may be had on NBRF just like on Intelligenetics.  (See PIR bulletins).  We
>have given up on Intelligenetics for any online work (but see about batch jobs
>below).  

The PIR facility is another NIH-funded resource which, as Dr. Springer
mentioned, is available to researchers.  Many factors can affect the
response time on a machine ranging from obvious differences in
hardware to the number of users who access the system.  BIONET (note!:
not IntelliGenetics) in this respect is suffering from the
enthusiastic response of people who want access to the broad range of
software, databases, communications facilities, etc., which we
provide.  We are by far the largest of the resources.  Our number of
users grew 35% last year (to over 700 laboratories) without a
corresponding increase in our budget for new hardware.  The DEC-20
that BIONET currently uses was praised by the reviewers of the initial
BIONET proposal in 1983 as being an excellent choice and was lauded as
a machine with a user friendly interface.  Times have changed!

Through the initiative of our own staff, BIONET sought out and
obtained a donation of a new central computer facility from Sun
Microsystems as we announced earlier this year.  This was done without
any assistance from the NIH.  I am happy to be able to tell you that
the first shipment of these machines arrived yesterday and we will
soon be bringing this hardware on-line.  However, as you note below,
it is possible for further increases in the number of users to
eventually bring any system to its knees.  The advantage of our new
configuration is that it will be easily expandable and so it should
enable us to grow with the demand.  One can not, of course, expect
further gifts as demand increases, so it is our hope that the NIH will
provide us with adequate funds to allow this system to grow in the
future.  The demand for the service is obviously immense.  Sun has
given us enough equipment to lay the foundation for our new system,
but, without additions, it too can become overloaded.

>We typically wait 6 to 14 hours on the terminal for a fastp search.
>The postdocs and graduate students in the lab will patiently do shifts, check-
>ing the terminal every 30 min for any sign of a response.  If there is one,
>they have to respond to the interactive prompt, lest they be timed out and
>lose all.  I have given up in disgust after waiting for 6 hours with no
>response (during nonpeak time).  Either the system is hopelessly overloaded or
>the efficient Lipmann-Pearson algorithim is implemented awkwardly; I also hear
>mumbling about lots of file openings and closings taking lots of time.  

The time that you cite for the FASTP search astonished our staff
members as it runs contrary to our other experience with users.  We
really need to investigate this further with the query sequence that
was used.  We have tested two different implementations of the
algorithm (from sources outside of IntelliGenetics) and did not detect
any significant difference between them, so we do not believe that
this is the cause.  The FASTP-MAIL program that we have on-line, NOT
the interactive version, will do an entire PIR database search in as
little as 30 seconds during off-peak hours and in about 20-30 minutes
when the load is heavy.  The situation that you describe is clearly
intolerable, but, if you encounter something like this, *please* call
us.  If something is wrong with the system (and this instance seems
definitely unusual) we want to know about it and try to fix it.  While
we acknowledge that the DEC is a heavily loaded machine it should not
be taking that long for a FASTP search.  

In any event we have been working on and plan to complete soon
enhanced -MAIL versions of the programs for both nucleic acid and
protein database searches.  These mail servers will remove the heavy
computational tasks from the DEC and should dramatically improve the
response time for users who want to perform other tasks.  Eventually
the DEC will be phased out but this will probably occur over the
period of about a year.

>On
>NBRF, the online search is completed within about 2 minutes.  The PIR as well
>as the NEW database may be used.  

I'm glad to hear that you are satisfied with their service.  Our
FASTP-MAIL program can give comparable results and also uses the NEW
database. 

>The NBRF also has an excellent program for
>alignment, ALIGN, which is quick, and alignments with randomized sequences
>readily generate the statistical significance of the alignment.  I have yet to
>see anything giving meaningful statistics on the Intelligenetics programs.

I have not personally used the ALIGN program but from your description
it sounds like a useful tool.  We have acquired the PIR software and
have plans to make some of it available on the new BIONET mVAX
account.  BIONET does not own this machine, however, and time on it is
limited, so I can not make promises at this point as to what we will
ultimately provide.  Regarding the IntelliGenetics programs, the
SEQ:SEARCH:HOMOLOGY option provides statistics on the significance of
alignments.  The database searching program IFIND does not do this.

>     My enthusiasm for Intelligenetics skyrocketed last week when I read the
>PROTEIN-ANALYSIS  bulletin board messages by Amos Bairoch on the SWISS-PROT
>database he has compiled and made available.  While cDNA sequences may appear
>within a few months of publication in the nucleic acid databases, it may take
>a year or two for them to be translated and appear in the protein databases
>(even though the translated sequence was right there in the publication).  So
>Dr. Bairoch (as have others for other databases) has translated the nucleic
>acid databases via computer program and included these additional sequences in
>SWISS-PROT, which with 1,654,416 residues in 6,102 sequences is the biggest
>nonredundant database I have seen.

I should point out that Amos (a friend of mine) relies on PIR for
their data.  SWISS-PROT builds on PIR, i.e., it relies on the labor of
the staff at PIR in addition to Amos's and the EMBL's efforts.

>     Hot to try it, I ran into the same endless hours of waiting at the termi-
>nal with no response.  Pouring out my frustrations to Vickie Johncox at In-
>telligenetics, I learned about fastp-mail and batch jobs. (Documented under
>HELP FASTP-MAIL and HELP BATCH).  

One small but important point: Vickie Johncox is at BIONET, not
IntelliGenetics.  She and the rest of us are paid by the NIH, not by
IG.  We run a non-profit NIH-funded service for the research
community.  At several points in the original message BIONET is viewed
as identical with IntelliGenetics.  We are only a non-profit
department in the company.  IntelliGenetics also writes and markets
other software products which are not available on BIONET.  BIONET, on
the other hand, provides access to contributed academic software such
as FASTP and many other programs which are not available to commercial
customers of IntelliGenetics.

>Fastp mail is a way of sending the search to
>the Sun computer via mail, with results coming back in less than a minute ex-
>actly as advertised (Why can't we get responsiveness like that online?).  

I'm glad that our advertisements are endorsed <grin>!  We can't
provide responsiveness like that on-line until the compute-intensive
jobs are removed from the DEC.  The Sun is a fast machine and is not
as loaded as the DEC.  The Sun currently in use has been on loan from
IntelliGenetics and BIONET will shortly be transferring these jobs to
our own machine.  IntelliGenetics has been assisting BIONET in this
manner (also with the mVAX), but IG is still a small company, must pay
for all of the expenses of its own operation, building facilities,
etc., and does not have the resources of Sun Microsystems or the NIH.
Ultimately BIONET needs additional support from elsewhere.

>However, fastp mail cannot search the SWISS-PROT database!  

This is in the process of being implemented and will be finished very
soon.  We had to first reformat SWISS-PROT for use with FASTP (which
requires a special database format) and this was finished a few weeks
back.  The changes to FASTP-MAIL are not great, but we are currently
deciding whether to go this route or implement a FASTA-MAIL program
based on new code provided to us by Bill Pearson.  We are
investigating both options and I can assure you that this is one of
our highest priority items.

>For this you must
>write a batch file, and I don't recommend using the BFASTP command to build
>your file because it was set up before the availability of SWISS-PROT and
>doesn't have that option in it.  

This is also a very simple modification to BFASTP and was already in
the queue prior to your message.

>Instead, you should write your own batch
>file.  See documentation in HELP BAT-FASTP.  A model batch file follows which
>probes for homologies with a sequence in the file YOUR.PEP:

      @TAKE <BIONET>BATCH.CMD
      @XFASTP
      *YOUR.PEP
      *2
      *
      *YOUR.REC
      *50
      *
      *
      *20
      @LOGOUT


>     You could name this file YOUR.CTL and send it from your wordprocessor via
>kermit or edit it yourself on Bionet.  You may then submit it to the batch
>queue: SUBMIT YOUR.CTL /TIME 00:10:00.  This allows for 10 minutes of cpu
>time, much more than the 5 minutes needed.  

Curiously this is the same program run by batch that is taking so much
of your time to run interactively.  We really need to investigate this
problem further.  Please contact us.

(Dr. Springer gave detailed instructions for using the program in his
original message which I omit here.)

>     Our results?  An exciting hit with a sequence in SWISS-PROT not present
>in other databases, and which was published in 1987.
>     The moral?  Intelligenetics/Bionet/good guys/gals, could we get SWISS-
>PROT availability on fastp-mail?  

Coming *very* soon.

>The PIR database is antiquated by SWISS-PROT.  

No it isn't as I explained above.  This is an issue for the PIR staff
to answer more completely.  Users should appreciate the amount of
effort invested by the PIR staff in processing their data.  Without
them BIONET would not have the PIR database and SWISS-PROT would also
be impacted.

>And batch files are a real pain compared to fastp-mail.  Who wants to
>wait overnight for one search?

Agreed, but please keep in mind that BIONET is priced so low that we
have become victimized by our own success.  Despite the problems that
we acknowledge with the system we have users who tell us that BIONET
is still one of the best deals available.

>     Even better?  Could we get responsiveness online?  Could we be connected
>directly to programs and computers that would do this for us rather than hav-
>ing to use a mail connection?  The only problem I can see is that you would
>become so popular that you would be swamped with users and demands for your
>time and help.  

We are already in this situation and as you can see popularity breeds
unpopularity if resources do not keep up with demand.  Responsiveness
will improve as discussed above.  The Suns will be accessed remotely
by mail servers until we have everything prepared to start shifting
users directly on to those machines.  That is going to require further
software development, new documentation, etc., so the transition will
take place over approximately a year.

>And rather than just using Bionet as a way to get a taste of
>molecular biology computing before becoming frustrated by its slowness and
>moving on to other resources, or other program families such as those by the
>University of Wisconsin Genetics Computer Group, users might become devoted,
>longtime customers.

This is another reason why one should note the distinction between
BIONET and IntelliGenetics.  It is unfortunate for IntelliGenetics
that the demands on BIONET are giving its software a black eye.  IG
software running on BIONET is displayed in its worst hardware
environment, but this situation continues because of the history of
the original grant.  Any software which is placed in an environment
where everybody wants to use it and does not have to pay much to
access it will suffer a drastic reduction in response time.  As
evidenced by your comments above, IntelliGenetics' involvement with
BIONET is not always a boon for the company!  

I have outlined some of the steps above that BIONET is taking to
improve the resource.  There are other steps such as closing the
resource to new users, increasing fees to cover more of the expenses,
etc.  We are trying to avoid taking more draconian measures and have
been working hard to expand our resources for the community.  I hope
that this message answers some, if not all, of your concerns, and I
hope that the user community will endure along with us for a few more
months while we go through this transition to the new Sun system.  I
realize that recent times have not been easy on either some of our
users or on our staff, but we are working hard to improve things (as
evidenced by the time and date that you should see stamped on this
message.  My wife isn't going to be too happy if I don't get home soon
on Friday night!).

				Sincerely,

				David Kristofferson, Ph.D.
				BIONET Resource Manager

				kristofferson@bionet-20.arpa

-------