[bionet.molbio.genbank] Eukaryotic cis-acting transcription regulatory elements

<BACKD@QUCDN.QueensU.CA> (02/05/91)

Is there a database of consensus transcription factor recognition sequences
that can be used to scan a promoter sequence? I find running a promoter
sequence against the entire database at low enough "stringency" to
identify short regulatory elements generates a very large number of
irrelevent matches. If not, are there any moves to compiling and maintaining
such a database?

jeh@REPLICON.LANL.GOV (Jamie Hayden) (02/06/91)

forgive me if this is the second time this goes out, but as I have
not received a copy of the message I posted...

-------------------------------------------------------------------------
There is such a database of Eukaryotic transcription factors, compiled
and maintained by Joseph Locker and Gregory Buzard (described in
DNA Sequence-J. DNA Sequencing and Mapping, Vol 1, 3-11).
The address they give is:
J.Locker
Dept. Pathology
U. of Pittsburgh
Pitt., PA  15261  
USA
(412) 648 9508
(412) 648 1916  fax

GenBank's policy has been to not store consensus sequences, but rather
the sequences from which such consensus sequences were derived.  Times
change, however, and we are currently evaluating the idea of creating
such a division within the database, a subsection of promotors and other
important, conserved regions.

Jamie Hayden
Annotator Coordinator
GenBank
----------------------------------------------------------------------

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/06/91)

In article <9102051618.AA02080@replicon.lanl.gov> jeh@REPLICON.LANL.GOV
(Jamie Hayden) writes:
>
>GenBank's policy has been to not store consensus sequences, but rather
>the sequences from which such consensus sequences were derived.  Times
>change, however, and we are currently evaluating the idea of creating
>such a division within the database, a subsection of promotors and other
>important, conserved regions.

GenBank already has the feature table, this would be completely sufficient to
satisfy everyone IF ONLY IT WERE KEPT UP TO DATE.  There are already hundreds
of binding sites of various kinds, but only certain kinds are recorded in the
database.  You don't need to go looking to create yet another feature of
GenBank, all you need to do is use what is there.

If you create consensus sequences you will be doing a huge disservice to
molecular biology by perpertrating this poor method.

>Jamie Hayden
>Annotator Coordinator
>GenBank

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

pgil@HISTONE.LANL.GOV (Paul Gilna) (02/07/91)

----- Begin Included Message -----


In article <9102051618.AA02080@replicon.lanl.gov> jeh@REPLICON.LANL.GOV
(Jamie Hayden) writes:
>
>GenBank's policy has been to not store consensus sequences, but rather
>the sequences from which such consensus sequences were derived.  Times
>change, however, and we are currently evaluating the idea of creating
>such a division within the database, a subsection of promotors and other
>important, conserved regions.

And Tom Schneider is quick to reply:


If you create consensus sequences you will be doing a huge disservice to
molecular biology by perpertrating this poor method.


	Tom, we would agree, at least in part. We have not consistently
	tracked such sequences in the past, indeed in part because of
	community responses similar in content to yours. It is not our
	intention at this point to place these sequences in the Flat
	File release of the database. Further, were this to be the
	case, we would give you and other members of the community
	ample opportunity to express their concern (or support). The
	primary purpose for these sequences is to aid us in our
	integrity checks placed upon incoming submitted data. We are
	currently completing software development on a module to check
	for the presence of vector sequence in submitted data.
	A preliminary version already in use has been enlightening!
	Such a module looks to a sub database of vector sequences
	to seek out similarities. One can quickly see that this method
	could be extended to other sub-databases, such as one
	containing consensus or signal sequences.

	So far, this has been discussed as an internal feature. I agree
	that the presence of such sequences in the public release
	database might add clutter to the conventional type of search.
	However, there is clearly a usefulness to the community in
	being able to search their own sequence data for specific signal
	sequences. *If* we place these data in the public domain, you
	may be assured that we will do so in a robust, parseable,
	unambiguous, clutter-free, and well-heralded manner.


Regards,

--paul

Paul Gilna
GenBank, Los Alamos

<BACKD@QUCDN.QueensU.CA> (02/07/91)

Thanks to all of those who responded to my request.
This is a brief summary of my E-mail on the subject:

- David Gosh (Nucl. Acids Res. 18, 1749-1756) has developed a relational
 database of transcription factor recognition elements called TFD. The
 TFD database can be obtained in one of two ways:
   1. Anonymous FTP from NCBI.NLM.NIH.GOV in /repository/TFD
   2. From EMBL mailer. Send a message to the mailer as follows
      Address: NETSERVER@EMBL.BITNET
               HELP
               HELP TFD
               DIR TFD
     The mailer will return the appropriate help files to get you started.

- Dan Prestridge from Los Alamos Nat'l Labs writes that he has a program,
  SIGNAL SCAN available in MS-DOS or UNIX format that will scan a sequence
  file against the TFD database. The database is included on the disk.
  Mail Dan a request for either format at DXP%LIFE@LANL.GOV. The UNIX
  version can be E-mailed back, the MS-DOS version travels better on a
  diskette, so be prepared to send along two formatted discs.

- Those operating in the UNIX world have a third option.
  MBCRR.Harvard.Edu have a UNIX program called DYNAMIC that used the TFD
  database directly. D. Gosh refers to the program in his NAR paper.

 -Be aware the TFD database is about 500K.

 -Good luck and thanks for all the help.

Don Back, BACKD@QUCDN.QueensU.CA.BITNET
(613) 545-2982

kristoff@genbank.bio.net (David Kristofferson) (02/07/91)

>	integrity checks placed upon incoming submitted data. We are
>	currently completing software development on a module to check
>	for the presence of vector sequence in submitted data.
>	A preliminary version already in use has been enlightening!
>	Such a module looks to a sub database of vector sequences
>	to seek out similarities. One can quickly see that this method
>	could be extended to other sub-databases, such as one
>	containing consensus or signal sequences.


Paul,

	I would assume, however, that it would be easier to pull out
vector sequences than consensus sequences as one often needs to use
different searching methods for these latter purposes depending upon
the extent of the ambiguity in the consensus sequence.  I think that
it might be dangerous for GenBank staff to try to make those kinds of
assessments.  Instead I would opt for leaving calls on consensus
sequences to GenBank Curators who had specialized knowledge in the
field.  I would not even rely to a great extent on extracting such
information from published literature since there is often
considerable divergence of opinion here.


Now to Tom,

	Ouch!!  Are we bad or what 8-) 8-)??  The thing that I always
find entertaining about this field is that when I was at the NCBI
developers meeting last July, GenBank was being excoriated **for**
including annotations for things like promoter sequences precisely for
reasons along the lines I mentioned above.  It did not appear then
that NCBI intended to include this type of information in their up and
coming GenInfo database, preferring instead a less elaborately
annotated entry.  However, the latest version I have heard indicated
that their position was under revision due to input from yet other
sections of the community.

	Never a dull moment, is there? 8-)  In the absence of a
concrete consensus, GenBank could spend a considerable amount of time
doing and then undoing things.

Dave

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/08/91)

In article <Feb.6.16.08.05.1991.11751@genbank.bio.net> kristoff@genbank.bio.net (David Kristofferson) writes:
>	Ouch!!  Are we bad or what 8-) 8-)??  The thing that I always
>find entertaining about this field is that when I was at the NCBI
>developers meeting last July, GenBank was being excoriated **for**
>including annotations for things like promoter sequences precisely for
>reasons along the lines I mentioned above.  It did not appear then
>that NCBI intended to include this type of information in their up and
>coming GenInfo database, preferring instead a less elaborately
>annotated entry.  However, the latest version I have heard indicated
>that their position was under revision due to input from yet other
>sections of the community.
>
>	Never a dull moment, is there? 8-)  In the absence of a
>concrete consensus, GenBank could spend a considerable amount of time
>doing and then undoing things.

Well!  If in last July they were objecting to annotations of promoter LOCATIONS
then some people have gone completely bonkers (besides me! :-).  The
coordinates of many binding sites are highly well specified, and it would be a
great service to molecular biologists to record the ones for which there is
experimental evidence (along with what the evidence is).  What should NOT be
recorded in any way is the sequence of the site (because that's redundant) nor
any consensus derived from the sites.  These things are total guesses for the
most part.  The best example I can give you is the T7 promoters I work on.  We
know the coordinates at which the T7 RNA polymerase initiates transcription,
down to the base.  What is wrong is to assume that the patterns around that
point are in fact at all associated with transcription!  Now, that may seem to
be a silly statement, but bare with me.  Consider position -3 relative to the
initiation base (0).  It is always an A, and a consensus would place an A
there.  But it turns out that extremely strong normal promoters can be had
which have any of the other 3 bases there.  SO THAT A IS NOT A PART OF THE
PROMOTER!  To read more about this, look at NAR 17:  659-674 (1989).  Consensus
bites the dust.

Also, only one point on the site should be specified.  Anything else
is interpretation.  Just where does a ribosome binding site end?  That's
an open scientific question, not one to be decided arbitrarily.  The solution
is simply to record one point as the 'zero coordinate', perhaps with
and orientation.

Another example is the initiation codon in E. coli.  A few percent (5 or 7?) of
the time it is GTG; most of the time it is ATG.  The consensus model THROWS OUT
DATA and ignores the GTG, even though they exist.  So no matter how you define
consensus, anything less than the frequencies will require data-destruction.
So I object to having consensus in GenBank because it is a horrible model.

The bottom line:  only experimentally verified data should be stored in
GenBank.  If you don't you'll have to fix it later (and be red in the face).

I understand that NCBI is going for things that they can do WELL, and that
they are not adverse to doing more later.  In the longer run, we will want
these 'signals' because it gets so hard to do surveys as the base gets bigger.

>Dave

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/08/91)

> The
> coordinates of many binding sites are highly well specified, and it would be a
> great service to molecular biologists to record the ones for which there is
> experimental evidence (along with what the evidence is).

I don't believe that NCBI felt a collection of promoter data was
inappropriate.  Instead I was under the impression at their July
meeting that they believed this should be included in one of the
"boutique" databases (mentioned in my earlier reply today to Gribskov)
which would be layered on top of GenInfo instead of being part of
GenInfo itself.  Obviously I am not their spokesman, so perhaps Jim
Ostell or someone else at NCBI can elaborate on this further.

Dave Kristofferson