TSPRINGER@BIONET-20.ARPA (04/15/88)
From: Timothy Springer <TSPRINGER@BIONET-20.ARPA> Searching for homologies in protein databases is much trickier than I ever anticipated. THe problem is that all are incomplete and the degree of incompleteness varies greatly. The local protein database based on NBRF which is assuredly updated very frequently with which I started using fastp lacked the most interesting sequences to me. When I started searching supposedly the same database on the NBRF computer in Washington,D.C. some very interesting homologies appeared. The protein sequence which was published some 2 years ago somehow was not present in our own version of the database. The NBRF computer is excellent in responsiveness. An account for $400 may be had on NBRF just like on Intelligenetics. (See PIR bulletins). We have given up on Intelligenetics for any online work (but see about batch jobs below). We typically wait 6 to 14 hours on the terminal for a fastp search. The postdocs and graduate students in the lab will patiently do shifts, check- ing the terminal every 30 min for any sign of a response. If there is one, they have to respond to the interactive prompt, lest they be timed out and lose all. I have given up in disgust after waiting for 6 hours with no response (during nonpeak time). Either the system is hopelessly overloaded or the efficient Lipmann-Pearson algorithim is implemented awkwardly; I also hear mumbling about lots of file openings and closings taking lots of time. On NBRF, the online search is completed within about 2 minutes. The PIR as well as the NEW database may be used. The NBRF also has an excellent program for alignment, ALIGN, which is quick, and alignments with randomized sequences readily generate the statistical significance of the alignment. I have yet to see anything giving meaningful statistics on the Intelligenetics programs. My enthusiasm for Intelligenetics skyrocketed last week when I read the PROTEIN-ANALYSIS bulletin board messages by Amos Bairoch on the SWISS-PROT database he has compiled and made available. While cDNA sequences may appear within a few months of publication in the nucleic acid databases, it may take a year or two for them to be translated and appear in the protein databases (even though the translated sequence was right there in the publication). So Dr. Bairoch (as have others for other databases) has translated the nucleic acid databases via computer program and included these additional sequences in SWISS-PROT, which with 1,654,416 residues in 6,102 sequences is the biggest nonredundant database I have seen. Hot to try it, I ran into the same endless hours of waiting at the termi- nal with no response. Pouring out my frustrations to Vickie Johncox at In- telligenetics, I learned about fastp-mail and batch jobs. (Documented under HELP FASTP-MAIL and HELP BATCH). Fastp mail is a way of sending the search to the Sun computer via mail, with results coming back in less than a minute ex- actly as advertised (Why can't we get responsiveness like that online?). How- ever, fastp mail cannot search the SWISS-PROT database! For this you must write a batch file, and I don't recommend using the BFASTP command to build your file because it was set up before the availability of SWISS-PROT and doesn't have that option in it. Instead, you should write your own batch file. See documentation in HELP BAT-FASTP. A model batch file follows which probes for homologies with a sequence in the file YOUR.PEP: @TAKE <BIONET>BATCH.CMD @XFASTP *YOUR.PEP *2 * *YOUR.REC *50 * * *20 @LOGOUT You could name this file YOUR.CTL and send it from your wordprocessor via kermit or edit it yourself on Bionet. You may then submit it to the batch queue: SUBMIT YOUR.CTL /TIME 00:10:00. This allows for 10 minutes of cpu time, much more than the 5 minutes needed. You can check on status with INFO BATCH. Your results with the 50 best scores and 20 best alignments will ap- pear in the file YOUR.REC. Note that YOUR.PEP should be in the in- telligenetics file format with lines of comments preceded by ";", one line of title not preceded by a ";", and then the sequence with no more than 499 residues in one letter code, followed by a "1", rather than in the format recommended for fastp. If you have what the computer thinks is an extra se- quence (like a line preceded by a "<") you will get an extra query about which sequence to search which will throw off the batch file. Our results? An exciting hit with a sequence in SWISS-PROT not present in other databases, and which was published in 1987. The moral? Intelligenetics/Bionet/good guys/gals, could we get SWISS- PROT availability on fastp-mail? The PIR database is antiquated by SWISS- PROT. And batch files are a real pain compared to fastp-mail. Who wants to wait overnight for one search? Even better? Could we get responsiveness online? Could we be connected directly to programs and computers that would do this for us rather than hav- ing to use a mail connection? The only problem I can see is that you would become so popular that you would be swamped with users and demands for your time and help. And rather than just using Bionet as a way to get a taste of molecular biology computing before becoming frustrated by its slowness and moving on to other resources, or other program families such as those by the University of Wisconsin Genetics Computer Group, users might become devoted, longtime customers. -------
KRISTOFFERSON@BIONET-20.ARPA (04/16/88)
From: David Kristofferson <Kristofferson@BIONET-20.ARPA> Dear Dr. Springer: You raise many valid points that are currently problems with our system and that we are already in the process of addressing. Please allow me to answer your queries in order. For the benefit of other bboard readers I have higlighted sections of Dr. Springer's original bulletin with a > in the left hand column. > Searching for homologies in protein databases is much trickier than I >ever anticipated. THe problem is that all are incomplete and the degree of >incompleteness varies greatly. The local protein database based on NBRF which >is assuredly updated very frequently with which I started using fastp lacked >the most interesting sequences to me. When I started searching supposedly the >same database on the NBRF computer in Washington,D.C. some very interesting >homologies appeared. The protein sequence which was published some 2 years >ago somehow was not present in our own version of the database. BIONET currently has release 15.0 of PIR on-line. This includes both the sequences in the PROTEIN.DAT file and the NEW.DAT file. It is likely that PIR gets the latest version of the database up first on their own machine, so one possibility may be that you are using a new release that we have not yet received. Another possibility may be that the default parameters or algorithms used in the different database searching programs may be dissimilar and some of the more marginal hits may vary between each program. We would like to know what your query sequence was and what the hits were that were not found on our system before reaching a final conclusion on this issue. > The NBRF computer is excellent in responsiveness. An account for $400 >may be had on NBRF just like on Intelligenetics. (See PIR bulletins). We >have given up on Intelligenetics for any online work (but see about batch jobs >below). The PIR facility is another NIH-funded resource which, as Dr. Springer mentioned, is available to researchers. Many factors can affect the response time on a machine ranging from obvious differences in hardware to the number of users who access the system. BIONET (note!: not IntelliGenetics) in this respect is suffering from the enthusiastic response of people who want access to the broad range of software, databases, communications facilities, etc., which we provide. We are by far the largest of the resources. Our number of users grew 35% last year (to over 700 laboratories) without a corresponding increase in our budget for new hardware. The DEC-20 that BIONET currently uses was praised by the reviewers of the initial BIONET proposal in 1983 as being an excellent choice and was lauded as a machine with a user friendly interface. Times have changed! Through the initiative of our own staff, BIONET sought out and obtained a donation of a new central computer facility from Sun Microsystems as we announced earlier this year. This was done without any assistance from the NIH. I am happy to be able to tell you that the first shipment of these machines arrived yesterday and we will soon be bringing this hardware on-line. However, as you note below, it is possible for further increases in the number of users to eventually bring any system to its knees. The advantage of our new configuration is that it will be easily expandable and so it should enable us to grow with the demand. One can not, of course, expect further gifts as demand increases, so it is our hope that the NIH will provide us with adequate funds to allow this system to grow in the future. The demand for the service is obviously immense. Sun has given us enough equipment to lay the foundation for our new system, but, without additions, it too can become overloaded. >We typically wait 6 to 14 hours on the terminal for a fastp search. >The postdocs and graduate students in the lab will patiently do shifts, check- >ing the terminal every 30 min for any sign of a response. If there is one, >they have to respond to the interactive prompt, lest they be timed out and >lose all. I have given up in disgust after waiting for 6 hours with no >response (during nonpeak time). Either the system is hopelessly overloaded or >the efficient Lipmann-Pearson algorithim is implemented awkwardly; I also hear >mumbling about lots of file openings and closings taking lots of time. The time that you cite for the FASTP search astonished our staff members as it runs contrary to our other experience with users. We really need to investigate this further with the query sequence that was used. We have tested two different implementations of the algorithm (from sources outside of IntelliGenetics) and did not detect any significant difference between them, so we do not believe that this is the cause. The FASTP-MAIL program that we have on-line, NOT the interactive version, will do an entire PIR database search in as little as 30 seconds during off-peak hours and in about 20-30 minutes when the load is heavy. The situation that you describe is clearly intolerable, but, if you encounter something like this, *please* call us. If something is wrong with the system (and this instance seems definitely unusual) we want to know about it and try to fix it. While we acknowledge that the DEC is a heavily loaded machine it should not be taking that long for a FASTP search. In any event we have been working on and plan to complete soon enhanced -MAIL versions of the programs for both nucleic acid and protein database searches. These mail servers will remove the heavy computational tasks from the DEC and should dramatically improve the response time for users who want to perform other tasks. Eventually the DEC will be phased out but this will probably occur over the period of about a year. >On >NBRF, the online search is completed within about 2 minutes. The PIR as well >as the NEW database may be used. I'm glad to hear that you are satisfied with their service. Our FASTP-MAIL program can give comparable results and also uses the NEW database. >The NBRF also has an excellent program for >alignment, ALIGN, which is quick, and alignments with randomized sequences >readily generate the statistical significance of the alignment. I have yet to >see anything giving meaningful statistics on the Intelligenetics programs. I have not personally used the ALIGN program but from your description it sounds like a useful tool. We have acquired the PIR software and have plans to make some of it available on the new BIONET mVAX account. BIONET does not own this machine, however, and time on it is limited, so I can not make promises at this point as to what we will ultimately provide. Regarding the IntelliGenetics programs, the SEQ:SEARCH:HOMOLOGY option provides statistics on the significance of alignments. The database searching program IFIND does not do this. > My enthusiasm for Intelligenetics skyrocketed last week when I read the >PROTEIN-ANALYSIS bulletin board messages by Amos Bairoch on the SWISS-PROT >database he has compiled and made available. While cDNA sequences may appear >within a few months of publication in the nucleic acid databases, it may take >a year or two for them to be translated and appear in the protein databases >(even though the translated sequence was right there in the publication). So >Dr. Bairoch (as have others for other databases) has translated the nucleic >acid databases via computer program and included these additional sequences in >SWISS-PROT, which with 1,654,416 residues in 6,102 sequences is the biggest >nonredundant database I have seen. I should point out that Amos (a friend of mine) relies on PIR for their data. SWISS-PROT builds on PIR, i.e., it relies on the labor of the staff at PIR in addition to Amos's and the EMBL's efforts. > Hot to try it, I ran into the same endless hours of waiting at the termi- >nal with no response. Pouring out my frustrations to Vickie Johncox at In- >telligenetics, I learned about fastp-mail and batch jobs. (Documented under >HELP FASTP-MAIL and HELP BATCH). One small but important point: Vickie Johncox is at BIONET, not IntelliGenetics. She and the rest of us are paid by the NIH, not by IG. We run a non-profit NIH-funded service for the research community. At several points in the original message BIONET is viewed as identical with IntelliGenetics. We are only a non-profit department in the company. IntelliGenetics also writes and markets other software products which are not available on BIONET. BIONET, on the other hand, provides access to contributed academic software such as FASTP and many other programs which are not available to commercial customers of IntelliGenetics. >Fastp mail is a way of sending the search to >the Sun computer via mail, with results coming back in less than a minute ex- >actly as advertised (Why can't we get responsiveness like that online?). I'm glad that our advertisements are endorsed <grin>! We can't provide responsiveness like that on-line until the compute-intensive jobs are removed from the DEC. The Sun is a fast machine and is not as loaded as the DEC. The Sun currently in use has been on loan from IntelliGenetics and BIONET will shortly be transferring these jobs to our own machine. IntelliGenetics has been assisting BIONET in this manner (also with the mVAX), but IG is still a small company, must pay for all of the expenses of its own operation, building facilities, etc., and does not have the resources of Sun Microsystems or the NIH. Ultimately BIONET needs additional support from elsewhere. >However, fastp mail cannot search the SWISS-PROT database! This is in the process of being implemented and will be finished very soon. We had to first reformat SWISS-PROT for use with FASTP (which requires a special database format) and this was finished a few weeks back. The changes to FASTP-MAIL are not great, but we are currently deciding whether to go this route or implement a FASTA-MAIL program based on new code provided to us by Bill Pearson. We are investigating both options and I can assure you that this is one of our highest priority items. >For this you must >write a batch file, and I don't recommend using the BFASTP command to build >your file because it was set up before the availability of SWISS-PROT and >doesn't have that option in it. This is also a very simple modification to BFASTP and was already in the queue prior to your message. >Instead, you should write your own batch >file. See documentation in HELP BAT-FASTP. A model batch file follows which >probes for homologies with a sequence in the file YOUR.PEP: @TAKE <BIONET>BATCH.CMD @XFASTP *YOUR.PEP *2 * *YOUR.REC *50 * * *20 @LOGOUT > You could name this file YOUR.CTL and send it from your wordprocessor via >kermit or edit it yourself on Bionet. You may then submit it to the batch >queue: SUBMIT YOUR.CTL /TIME 00:10:00. This allows for 10 minutes of cpu >time, much more than the 5 minutes needed. Curiously this is the same program run by batch that is taking so much of your time to run interactively. We really need to investigate this problem further. Please contact us. (Dr. Springer gave detailed instructions for using the program in his original message which I omit here.) > Our results? An exciting hit with a sequence in SWISS-PROT not present >in other databases, and which was published in 1987. > The moral? Intelligenetics/Bionet/good guys/gals, could we get SWISS- >PROT availability on fastp-mail? Coming *very* soon. >The PIR database is antiquated by SWISS-PROT. No it isn't as I explained above. This is an issue for the PIR staff to answer more completely. Users should appreciate the amount of effort invested by the PIR staff in processing their data. Without them BIONET would not have the PIR database and SWISS-PROT would also be impacted. >And batch files are a real pain compared to fastp-mail. Who wants to >wait overnight for one search? Agreed, but please keep in mind that BIONET is priced so low that we have become victimized by our own success. Despite the problems that we acknowledge with the system we have users who tell us that BIONET is still one of the best deals available. > Even better? Could we get responsiveness online? Could we be connected >directly to programs and computers that would do this for us rather than hav- >ing to use a mail connection? The only problem I can see is that you would >become so popular that you would be swamped with users and demands for your >time and help. We are already in this situation and as you can see popularity breeds unpopularity if resources do not keep up with demand. Responsiveness will improve as discussed above. The Suns will be accessed remotely by mail servers until we have everything prepared to start shifting users directly on to those machines. That is going to require further software development, new documentation, etc., so the transition will take place over approximately a year. >And rather than just using Bionet as a way to get a taste of >molecular biology computing before becoming frustrated by its slowness and >moving on to other resources, or other program families such as those by the >University of Wisconsin Genetics Computer Group, users might become devoted, >longtime customers. This is another reason why one should note the distinction between BIONET and IntelliGenetics. It is unfortunate for IntelliGenetics that the demands on BIONET are giving its software a black eye. IG software running on BIONET is displayed in its worst hardware environment, but this situation continues because of the history of the original grant. Any software which is placed in an environment where everybody wants to use it and does not have to pay much to access it will suffer a drastic reduction in response time. As evidenced by your comments above, IntelliGenetics' involvement with BIONET is not always a boon for the company! I have outlined some of the steps above that BIONET is taking to improve the resource. There are other steps such as closing the resource to new users, increasing fees to cover more of the expenses, etc. We are trying to avoid taking more draconian measures and have been working hard to expand our resources for the community. I hope that this message answers some, if not all, of your concerns, and I hope that the user community will endure along with us for a few more months while we go through this transition to the new Sun system. I realize that recent times have not been easy on either some of our users or on our staff, but we are working hard to improve things (as evidenced by the time and date that you should see stamped on this message. My wife isn't going to be too happy if I don't get home soon on Friday night!). Sincerely, David Kristofferson, Ph.D. BIONET Resource Manager kristofferson@bionet-20.arpa -------