[bit.listserv.info-gcg] Prosite

BAIROCH@CGECMU51.BITNET (Amos Bairoch) (02/07/90)

<        I, too, am interested in consensus sequence databases. I have a
<document from EMBL called "PROSITE: A dictionary of protein sites and
<patterns" which seems to have quite a lot of work put into it, but it has
<two major problems. First, it is hardcopy (very disappointing for something
<from the EMBL biocomputing group), and second, it has no table of contents
<or index. So far I haven't managed to find a listing for signal peptides.
<
<        If anyone knows of a source of consensus sequence information in
<computer-readable form, could they please post a message to this list?
<
<
<Stephen Clark
<
<clark@mshri.utoronto.ca  (Internet)
<sinai@utoroci            (Netnorth/Bitnet)


In November I have posted to this list (as well as other lists) a news
bulletin that explained that, while release 4 of PROSITE is available,
in a printed form, future releases will be distributed in a computer
readable form.

In fact PROSITE has been all along available in a computer form, in my
sequence analysis package (PC/Gene), but as I have now decided to make
the data bank public domain, I had to define a new distribution format,
it was not possible to do so for release 4.

So before flaming on the fact that it is "very disappointing for something
from the EMBL biocomputing group"  it would be nice if the info that's
available on the net was read. Furthermore the introduction of the PROSITE
book says that:

"This dictionary will very soon be available, on-line, in Europe on the
 EMBL Data Library file server, and in the U.S.A. on the GenBank on-line
 service computer facility."

---------------------------------------------------------------------

<it has no table of contents

There is a table of contents (pages 5 to 9).

---------------------------------------------------------------------

<So far I haven't managed to find a listing for signal peptides.

And you will never see one in PROSITE, signal peptides are not found using
consensus patterns, but using a matrix (profile) approach, as implemented
in the very reliable method of von Heijne [1] which is available in the
majority of sequence analysis packages.

[1] Von Heijne G.
    A new method for predicting signal sequences cleavage sites.
    Nucleic Acids Res. 14:4683-4690(1986).

In addition to von Heijne, there have been three other publications
that describe methods to find signal peptide similar or based on that
method.

Folz R.J., Gordon J.I.
Computer-assisted predictions of signal peptidase processing sites.
Biochem. Biophys. Res. Commun. 146:870-877(1987).

Pascarella S., Bossa F.
CLEAVAGE: a microcomputer program for predicting signal sequence cleavage
sites.
CABIOS 5:53-54(1989).

Popowicz A.M., Dash P.F.
SIGSEQ: a computer program for predicting signal sequence cleavage sites.
CABIOS 4:405-406(1988).
The program for this method is available on the EMBL file server
(get dos_software:SIGSEQ$.UUE)

---------------------------------------------------------------------

<        If anyone knows of a source of consensus sequence information in
<computer-readable form, could they please post a message to this list?

PROSITE release 5.0, will be available on the EMBL file server somewhere
in March.

*****************************************************************************
* Amos Bairoch                * Email: bairoch@cgecmu51                     *
* Dept. Medical Biochemistry  * Tel  : +(41 22) 61 84 92                    *
* CMU                         ***********************************************

* 1, rue Michel Servet        *                                             *
* 1211 Geneva 4               * H(2)O is hot water, CO(2) is cold water     *
* Switzerland                 * --High school chemistry exam response-      *
*****************************************************************************

clark@MSHRI.UTORONTO.CA (02/07/90)

   Amos Bairoch  (bairoch@cgecmu51), author/compiler/distibuter of PROSITE,
writes (quoting me):

/<        I, too, am interested in consensus sequence databases. I have a
/<document from EMBL called "PROSITE: A dictionary of protein sites and
/<patterns" which seems to have quite a lot of work put into it, but it has
/<two major problems. First, it is hardcopy (very disappointing for something
/<from the EMBL biocomputing group), and second, it has no table of contents
/<or index. So far I haven't managed to find a listing for signal peptides.
/<
/<Stephen Clark
/
/In November I have posted to this list (as well as other lists) a news
/bulletin that explained that, while release 4 of PROSITE is available,
/in a printed form, future releases will be distributed in a computer
/readable form.

   Yes, I saw your message concerning PROSITE, which is how I got hold of
it. Thank-you for sending it to me. Actually, I saw your message on one of
the BIONET lists and don't recall seeing it on info-gcg. It probably came
through when our link was down and bounced. I'll never know how many useful
messages I have missed because of this problem.

/So before flaming on the fact that it is "very disappointing for something
/from the EMBL biocomputing group"  it would be nice if the info that's
/available on the net was read. Furthermore the introduction of the PROSITE
/book says that:
/"This dictionary will very soon be available, on-line, in Europe on the
/ EMBL Data Library file server, and in the U.S.A. on the GenBank on-line
/ service computer facility."

   I'm sorry if you took this comment to be a flame; that wasn't my
intention at all. You have obviously gone to a lot of work to produce this
document that contains loads of useful information. Nevertheless, I _am_
disappointed because, short of keying in all the data by hand, there is no
way for me to hack together something to allow me to search new protein
sequences for any of these motifs. I'll just have to wait for the
computer-readable form. Haven't you heard that molecular biologists are
very impatient?

/---------------------------------------------------------------------
/<it has no table of contents
/
/There is a table of contents (pages 5 to 9).
/---------------------------------------------------------------------

   To flame, or not to flame; the temptation is strong.

   Let me just say, so the people who read this list and haven't seen
PROSITE don't think that I'm a total idiot, that the table of contents does
not mention page numbers for the patterns.

/<So far I haven't managed to find a listing for signal peptides.
/
/And you will never see one in PROSITE, signal peptides are not found using
/consensus patterns, but using a matrix (profile) approach, as implemented
/in the very reliable method of von Heijne [1] which is available in the
/majority of sequence analysis packages.
/
/[1] Von Heijne G.
/    A new method for predicting signal sequences cleavage sites.
/    Nucleic Acids Res. 14:4683-4690(1986).

   Thanks for the reference, I wasn't aware of this work. I'll look it up
in the library as soon as I can, along with the other three you mentioned.
Nevertheless, the distinction between a profile and pattern is quite fine,
in fact, non-existant to your average molecular biologist, so it might be a
good idea to include this information in your next release. Given the fact
that so much can be gained from having a reasonable idea whether a protein
is cytoplasmic or secreted/membrane-bound, you might want to use large
letters (and put it on the front cover).

/<        If anyone knows of a source of consensus sequence information in
/<computer-readable form, could they please post a message to this list?
/
/PROSITE release 5.0, will be available on the EMBL file server somewhere
/in March.

   Fantastic! Will it be a flat ascii file so that someone can mail it to
me? I don't have access to ftp. :^(   And how about DNA sequences?
_Everybody_ wants to look for regulatory protein binding sites. I vaguely
recall that there was such a thing on BIONET when they were still alive.
Does anyone know if this is right, and if so, is it in the public domain
(considering they were supported by NIH money), or does it belong to
IntelliGenetics? :vQ


Stephen Clark

clark@mshri.utoronto.ca  (Internet)
sinai@utoroci            (Netnorth/Bitnet)

"We should be quite remiss not to emphasize that despite the popularity of
secondary structural prediction schemes, and the almost ritual performance
of these calculations, the information available from this is of limited
reliability. This is true even of the best methods now known, and much more
so of the less successful methods commonly available in sequence analysis
packages. Running a secondary structure prediction on a newly-determined
sequence just because everyone else does so, is to be deplored, and the
fact that the results of such predictions are generally ignored is
insufficient justification for doing and publishing them."
   - Arthur Lesk, 1988