[bionet.software] Prosite Index

clark@mshri.utoronto.ca (06/05/90)

   The Prosite database, compiled by Amos Bairoch at the Centre Medical
Universitaire in Geneva, Switzerland, is a great resource for identifying
short patterns (or motifs) which are typical of proteins of a certain class
or function. Release 5 of the database contains over 300 patterns. The
database has been set up to make it relatively easy to write a program that
can read the patterns and search for them in a protein of unknown function.
It has not been designed to answer the question "What patterns have been
identified for this protein of known function?", even though the database
contains that information. To make these data easily accessible, I have
written the program Proindex which creates an index of all the proteins in
the database. For example, somebody interested in proteolytic enzymes could
immediately find that there are five references to "protease" and eight to
"proteases", then look up their patterns and associated information.

   A small part of the index file is shown here as an example:

ANHYDRASES      00146 Carbonic anhydrases signature.
ANION           00192 Anion exchangers family signature 1.
ANION           00192 Anion exchangers family signature 2.
ANNEXINS        00195 Annexins phospholipid/calcium-binding domain signature.
ANTENNAPEDIA-TY 00032 'Homeobox' antennapedia-type protein signature.
ANTIGEN         00265 Proliferating cell nuclear antigen signature.
ARAC            00040 Bacterial activator proteins, araC family signature.
ARGINASE        00135 Arginase signature 2.
ARGINASE        00135 Arginase signature 1.
ARRESTIN        00267 Arrestin signature.

   Each line is a maximum of 79 bytes, with three fields. The first field
of 15 characters is the keyword, the next field of 5 bytes is the pointer
to the Prosite documentation, and the last field is as much of the pattern
description from the "DE" line as will fit. The keywords are obtained from
the DE line of the data file. (The uninformative words like "site",
"signature", "family", etc are screened out during the indexing process.) 

   The source for the program, written in VAX Fortran, the associated 
files, and a sorted index for Prosite version 5 have been deposited with
EMBL. Anyone who would like copies of these and who doesn't have access to 
the EMBL file server can ask me for them directly.

   To my knowledge, this is the second program to take advantage of the 
Prosite database, the first being Kay Hofmann's for converting it to 
a format that can be used by the GCG programs. I expect there will be many 
others as this gem of a database becomes more and more widely distributed.
(If others are already available, please let us know about them!)


Steve Clark

clark@mshri.utoronto.ca  (Internet)
clark@utoroci            (Netnorth/Bitnet)