rsmith%mbcrr@HARVARD.HARVARD.EDU (Randall Smith) (07/26/90)
************************************************************ Announcing Rel. 4.0 of the MBCRR's Protein Pattern Library and Search Tool (PLSEARCH) ************************************************************ The MBCRR Protein Pattern Library is a database of "consensus-like" protein sequence patterns, each pattern derived from a set of homologous sequences in the SWISS-PROT Protein Sequence Database. Families of related protein sequences are identified by running the entire SWISS-PROT database against itself (using BLAST, the NLM/NCBI's new high-speed similarity search tool); the resulting set of pair-wise scores are then clustered into families using a maximal-linkage clustering algorithm. A pattern construc- tion algorithm (Smith and Smith 1990, PNAS 87:118-122) is then used to generate a single pattern for each family; the patterns, which we call amino acid class covering (AACC) patterns, are functionally equivalent to 'regular expres- sion' patterns and represent the conserved primary sequence elements common to all members of each family. This new release of the pattern library (based on SWISS-PROT rel. 13) contains 5199 entries: 2026 patterns derived from all fami- lies of 2 or more members (encompassing 10664 of the 13837 sequences in SWISS-PROT rel. 13) plus the remaining 3173 "non-related" sequences (i.e. from those loci that did not cluster into any family). The MBCRR distributes the pattern library with a dynamic programming-based search tool (PLSEARCH) for match- ing and aligning newly generated protein sequences against the pattern database. We have shown that covering patterns can be more diagnostic for family membership than any of the individual sequences used to construct a pattern (see Smith and Smith, 1990) thus pattern searches can be a more sensi- tive search technique than traditional sequence vs. sequence database search tools. Also included in the package is our new multi-sequence alignment program (PIMA: Pattern-Induced Multi-Alignment). This program is now being used routinely by the Human Retro- virus and AIDS Sequence Database Group (Los Alamos Natl. Labs) to multi-align HIV protein sequences for phylogenetic analyses. PLSEARCH is written in 'C' and can run under both Unix and VMS operating systems; PIMA employs Unix shell scripts and thus is currently a Unix-only implementation. The entire package is available electronically and is free of charge to non-profit organizations (commercial users must arrange payment of a distribution fee). Copies can be obtained: 1) directly from the MBCRR via INTERNET anonymous ftp: mbcrr.harvard.edu = 134.174.51.4; the package is in a single compressed tar file in the 'plsearch' sub- directory, 2) by electronic mail from the Univ. of Houston Genbank- Server: genbank-server@bchs.uh.edu (INTERNET) or genbank-server%bchs.uh.edu@cunyvm (BITNET/EARN). Send a mail message containing the line "SEND UNIX HELP" to start; the files are in the Unix area and are uuen- coded, compressed text files of approximately 300K each. The package is also available in the same form via anonymous FTP to lavaca.uh.edu, 129.7.1.19, in ~ftp/pub/genbank-server/Unix, as plsrchaa, plsrchab, plsrchac, etc. or 3) by electronic mail via the EMBL File Server: send the message "HELP SOFTWARE" to netserv@embl.bitnet to obtain specifics on retrieving the files. When using anonymous FTP or e-mail, remember to be sure to transfer files during off-hours (after 5 PM, machine's local time); when e-mailing, ask for only a few files at once to avoid filling up your mail spool area or mailbox. ------------------------------------------------------------ Randall Smith and Temple Smith Molecular Biology Computer Research Resource, Galleria Level 1 Dana-Farber Cancer Institute and School of Public Health Harvard University 44 Binney St., Boston MA 02115 USA (617)732-3746 INTERNET: rsmith@mbcrr.harvard.edu BITNET: rsmith%mbcrr@husc6.bitnet ------------------------------------------------------------