[bionet.molbio.genbank] consensus sequences, motifs, and patterns

gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") (02/08/91)

In response to recent questions about whether GenBank and PIR should
establish and maintain entries that correspond to motifs and patterns in
sequences: 

My opinion is that it is now well established that consensus sequences,
that is the representation of a pattern as a single sequence with the
majority or plurality residue/base at each position, is now considered
to be an extremely poor way to represent patterns.  To this extent I
agree with Tom Schneider. 

I think that a database of patterns represented as alignments or weight
matrices would be VERY valuable. Such a database would be great
improvement on the current situation where, at best, you have to search
the database to find each sequence that references a given motif, and
then construct alignments de novo. In cases where keywords and
terminology have not been standardized this is quite difficult.  The
superfamily searching mechanism of PSQ is of course valuable, but is not
perfectly adapted to describing patterns that may be much smaller than
the entire sequence, and hence difficult to locate, in additional many
protein sequences show enough divergence that even when you know a motif
is present, it is still difficult to get them correctly aligned. 

However, you have to keep in mind that we have not yet heard the final
word on the best way to represent sequence patterns.  I think that it is
therefore critical that any set of patterns maintain a set of pointers
that directly enable you to access the original sequences used to derive
the pattern.  PROSITE is a good example of how this might be done,
although for protein motifs it would be nice if there was a good compact
encoding that could be used to reconstruct aligned sequences. 


I guess it is clear that what I'm suggesting is to not add this 
information to the existing sequence entries, but to have either a 
separate section or distinct kind of entry.  This would seem to be 
especially appropriate since it is clearly derived information and
will require a lot of judgement calls in defining the patterns.

Michael Gribskov
gribskov@ncifcrf.gov

kristoff@genbank.bio.net (David Kristofferson) (02/08/91)

Michael,

	What you are describing fits in with the "boutique" database
concept that would be layered upon the Geninfo backbone database which
NCBI is advocating.  This plan makes a lot of sense as long as the
integration is well thought out and executed properly.

	It might be helpful if someone at NCBI would elaborate on
their plans so that the scientific community would get a better idea
of what is planned for the future.  Given the tempest that we had
about the rather small GenBank/EMBL/DDBJ features table change which
caught some people by surprise, it would seem make sense to alert
people in advance.  To date, this information about NCBI's plans has
been distributed primarily through a developers meeting last July in
Bethesda and via a rather infrequently used mailing list at NLM.  I
will also note that at a recent meeting which I attended in DC it was
apparent that even some of the people who **should** be informed about
these issues were not up to date on developments.  This seems to make
a clear case that a better PR job is needed here.

				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

kristoff@genbank.bio.net (David Kristofferson) (02/08/91)

P.S. - I just saw that Jim Ostell from NCBI put out an announcement on
BIO-SOFTWARE/bionet.software which included a nice description of some
of their development work.  Although his announcement and NCBI's
mailing list are intended primarily for developers, I would hope that
information suitable for the general scientific community could also
be made available in forums such as this, BIONEWS, or
HUMAN-GENOME-PROGRAM.

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/09/91)

In article <9102071611.AA03799@genbank.bio.net> gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") writes:

I agree with Michael for the most part.  The only point were we differ is:

>...since it is clearly derived information and
>will require a lot of judgement calls in defining the patterns.

I think that a list of acceptable criteria can be drawn up to define the
locations.  For example, cap (crp) binding sites can be defined genetically, by
DNAse footprinting, by methylation protection or intereference and by
ethylation phosphate blockage.  Splice junctions are well defined (for the most
part) by comparing DNA to spliced RNA sequences.  The only tricky step is the
final alignment, but in most cases this can be done closely enough not to be a
problem, and the sites could be realigned by a researcher if desired.  Giving
approximate alignments would begin to address the problem Mike brings up about
difficulty in aligning.  So I don't think that judgement calls are so important
- simply list the kind of data used to define the location.  In general the
sequence data would be used only at the last step to get the exact location.
(See my previous note about being fooled as to what is a pattern).  Clearly
using sequence data alone is not a good idea at this stage.  That is, purely
derived information should not be in the database, or marked as such.  Then a
smart program or researcher could simply ignore the guesses.

As Mike points out, the last word on site definitions is not in yet.  So we
must store pointers to the original raw data locations in the database.

>Michael Gribskov
>gribskov@ncifcrf.gov

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov