<BACKD@QUCDN.QueensU.CA> (02/05/91)
Is there a database of consensus transcription factor recognition sequences that can be used to scan a promoter sequence? I find running a promoter sequence against the entire database at low enough "stringency" to identify short regulatory elements generates a very large number of irrelevent matches. If not, are there any moves to compiling and maintaining such a database?
jeh@REPLICON.LANL.GOV (Jamie Hayden) (02/06/91)
forgive me if this is the second time this goes out, but as I have not received a copy of the message I posted... ------------------------------------------------------------------------- There is such a database of Eukaryotic transcription factors, compiled and maintained by Joseph Locker and Gregory Buzard (described in DNA Sequence-J. DNA Sequencing and Mapping, Vol 1, 3-11). The address they give is: J.Locker Dept. Pathology U. of Pittsburgh Pitt., PA 15261 USA (412) 648 9508 (412) 648 1916 fax GenBank's policy has been to not store consensus sequences, but rather the sequences from which such consensus sequences were derived. Times change, however, and we are currently evaluating the idea of creating such a division within the database, a subsection of promotors and other important, conserved regions. Jamie Hayden Annotator Coordinator GenBank ----------------------------------------------------------------------
toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/06/91)
In article <9102051618.AA02080@replicon.lanl.gov> jeh@REPLICON.LANL.GOV (Jamie Hayden) writes: > >GenBank's policy has been to not store consensus sequences, but rather >the sequences from which such consensus sequences were derived. Times >change, however, and we are currently evaluating the idea of creating >such a division within the database, a subsection of promotors and other >important, conserved regions. GenBank already has the feature table, this would be completely sufficient to satisfy everyone IF ONLY IT WERE KEPT UP TO DATE. There are already hundreds of binding sites of various kinds, but only certain kinds are recorded in the database. You don't need to go looking to create yet another feature of GenBank, all you need to do is use what is there. If you create consensus sequences you will be doing a huge disservice to molecular biology by perpertrating this poor method. >Jamie Hayden >Annotator Coordinator >GenBank Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov
pgil@HISTONE.LANL.GOV (Paul Gilna) (02/07/91)
----- Begin Included Message ----- In article <9102051618.AA02080@replicon.lanl.gov> jeh@REPLICON.LANL.GOV (Jamie Hayden) writes: > >GenBank's policy has been to not store consensus sequences, but rather >the sequences from which such consensus sequences were derived. Times >change, however, and we are currently evaluating the idea of creating >such a division within the database, a subsection of promotors and other >important, conserved regions. And Tom Schneider is quick to reply: If you create consensus sequences you will be doing a huge disservice to molecular biology by perpertrating this poor method. Tom, we would agree, at least in part. We have not consistently tracked such sequences in the past, indeed in part because of community responses similar in content to yours. It is not our intention at this point to place these sequences in the Flat File release of the database. Further, were this to be the case, we would give you and other members of the community ample opportunity to express their concern (or support). The primary purpose for these sequences is to aid us in our integrity checks placed upon incoming submitted data. We are currently completing software development on a module to check for the presence of vector sequence in submitted data. A preliminary version already in use has been enlightening! Such a module looks to a sub database of vector sequences to seek out similarities. One can quickly see that this method could be extended to other sub-databases, such as one containing consensus or signal sequences. So far, this has been discussed as an internal feature. I agree that the presence of such sequences in the public release database might add clutter to the conventional type of search. However, there is clearly a usefulness to the community in being able to search their own sequence data for specific signal sequences. *If* we place these data in the public domain, you may be assured that we will do so in a robust, parseable, unambiguous, clutter-free, and well-heralded manner. Regards, --paul Paul Gilna GenBank, Los Alamos
<BACKD@QUCDN.QueensU.CA> (02/07/91)
Thanks to all of those who responded to my request. This is a brief summary of my E-mail on the subject: - David Gosh (Nucl. Acids Res. 18, 1749-1756) has developed a relational database of transcription factor recognition elements called TFD. The TFD database can be obtained in one of two ways: 1. Anonymous FTP from NCBI.NLM.NIH.GOV in /repository/TFD 2. From EMBL mailer. Send a message to the mailer as follows Address: NETSERVER@EMBL.BITNET HELP HELP TFD DIR TFD The mailer will return the appropriate help files to get you started. - Dan Prestridge from Los Alamos Nat'l Labs writes that he has a program, SIGNAL SCAN available in MS-DOS or UNIX format that will scan a sequence file against the TFD database. The database is included on the disk. Mail Dan a request for either format at DXP%LIFE@LANL.GOV. The UNIX version can be E-mailed back, the MS-DOS version travels better on a diskette, so be prepared to send along two formatted discs. - Those operating in the UNIX world have a third option. MBCRR.Harvard.Edu have a UNIX program called DYNAMIC that used the TFD database directly. D. Gosh refers to the program in his NAR paper. -Be aware the TFD database is about 500K. -Good luck and thanks for all the help. Don Back, BACKD@QUCDN.QueensU.CA.BITNET (613) 545-2982
kristoff@genbank.bio.net (David Kristofferson) (02/07/91)
> integrity checks placed upon incoming submitted data. We are > currently completing software development on a module to check > for the presence of vector sequence in submitted data. > A preliminary version already in use has been enlightening! > Such a module looks to a sub database of vector sequences > to seek out similarities. One can quickly see that this method > could be extended to other sub-databases, such as one > containing consensus or signal sequences. Paul, I would assume, however, that it would be easier to pull out vector sequences than consensus sequences as one often needs to use different searching methods for these latter purposes depending upon the extent of the ambiguity in the consensus sequence. I think that it might be dangerous for GenBank staff to try to make those kinds of assessments. Instead I would opt for leaving calls on consensus sequences to GenBank Curators who had specialized knowledge in the field. I would not even rely to a great extent on extracting such information from published literature since there is often considerable divergence of opinion here. Now to Tom, Ouch!! Are we bad or what 8-) 8-)?? The thing that I always find entertaining about this field is that when I was at the NCBI developers meeting last July, GenBank was being excoriated **for** including annotations for things like promoter sequences precisely for reasons along the lines I mentioned above. It did not appear then that NCBI intended to include this type of information in their up and coming GenInfo database, preferring instead a less elaborately annotated entry. However, the latest version I have heard indicated that their position was under revision due to input from yet other sections of the community. Never a dull moment, is there? 8-) In the absence of a concrete consensus, GenBank could spend a considerable amount of time doing and then undoing things. Dave
toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/08/91)
In article <Feb.6.16.08.05.1991.11751@genbank.bio.net> kristoff@genbank.bio.net (David Kristofferson) writes: > Ouch!! Are we bad or what 8-) 8-)?? The thing that I always >find entertaining about this field is that when I was at the NCBI >developers meeting last July, GenBank was being excoriated **for** >including annotations for things like promoter sequences precisely for >reasons along the lines I mentioned above. It did not appear then >that NCBI intended to include this type of information in their up and >coming GenInfo database, preferring instead a less elaborately >annotated entry. However, the latest version I have heard indicated >that their position was under revision due to input from yet other >sections of the community. > > Never a dull moment, is there? 8-) In the absence of a >concrete consensus, GenBank could spend a considerable amount of time >doing and then undoing things. Well! If in last July they were objecting to annotations of promoter LOCATIONS then some people have gone completely bonkers (besides me! :-). The coordinates of many binding sites are highly well specified, and it would be a great service to molecular biologists to record the ones for which there is experimental evidence (along with what the evidence is). What should NOT be recorded in any way is the sequence of the site (because that's redundant) nor any consensus derived from the sites. These things are total guesses for the most part. The best example I can give you is the T7 promoters I work on. We know the coordinates at which the T7 RNA polymerase initiates transcription, down to the base. What is wrong is to assume that the patterns around that point are in fact at all associated with transcription! Now, that may seem to be a silly statement, but bare with me. Consider position -3 relative to the initiation base (0). It is always an A, and a consensus would place an A there. But it turns out that extremely strong normal promoters can be had which have any of the other 3 bases there. SO THAT A IS NOT A PART OF THE PROMOTER! To read more about this, look at NAR 17: 659-674 (1989). Consensus bites the dust. Also, only one point on the site should be specified. Anything else is interpretation. Just where does a ribosome binding site end? That's an open scientific question, not one to be decided arbitrarily. The solution is simply to record one point as the 'zero coordinate', perhaps with and orientation. Another example is the initiation codon in E. coli. A few percent (5 or 7?) of the time it is GTG; most of the time it is ATG. The consensus model THROWS OUT DATA and ignores the GTG, even though they exist. So no matter how you define consensus, anything less than the frequencies will require data-destruction. So I object to having consensus in GenBank because it is a horrible model. The bottom line: only experimentally verified data should be stored in GenBank. If you don't you'll have to fix it later (and be red in the face). I understand that NCBI is going for things that they can do WELL, and that they are not adverse to doing more later. In the longer run, we will want these 'signals' because it gets so hard to do surveys as the base gets bigger. >Dave Tom Schneider National Cancer Institute Laboratory of Mathematical Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov
kristoff@GENBANK.BIO.NET (Dave Kristofferson) (02/08/91)
> The > coordinates of many binding sites are highly well specified, and it would be a > great service to molecular biologists to record the ones for which there is > experimental evidence (along with what the evidence is). I don't believe that NCBI felt a collection of promoter data was inappropriate. Instead I was under the impression at their July meeting that they believed this should be included in one of the "boutique" databases (mentioned in my earlier reply today to Gribskov) which would be layered on top of GenInfo instead of being part of GenInfo itself. Obviously I am not their spokesman, so perhaps Jim Ostell or someone else at NCBI can elaborate on this further. Dave Kristofferson