jes@mbio.med.upenn.edu (Joe Smith) (08/14/90)
Could someone quote or point me to the policy on review of data submitted to GenBank (and/or the other databases)? Is it entirely up to the submitter, or are there guidelines similar to those for reviewed publications (e.g. at least one reading of both strands). There must be a policy on this, but it came up last week and I couldn't find the authoritative answer. The specific question was: ``under what circumstances is it acceptable for N's to appear in a submitted sequence? (I count 2155 sequences with 'other' bases in GB53)'' It would seem to be a difficult issue with regard to submission of 'preliminary' data. <Joe -- Joe Smith University of Pennsylvania jes@mbio.med.upenn.edu Dept. of Biochemistry and Biophysics (215) 898-8348 Philadelphia, PA 19104-6059
pgil%histone@LANL.GOV (Paul Gilna) (08/14/90)
The link between GenBank data and the published literature is an historic artifact which existed simply because this was the only forum from which the databank could obtain the sequence data. As such, it was assumed that these data were "peer reviewed" and complied with editorial directives, such as data determination from both strands. The realities are that the vast majority of sequence data per se are not subjected to any rigorous form of integrity review by the conventional editorial peer review process. In fact in the past, most of the verification was performed after-the-fact by the annotation process conducted by databank staff. Our estimates are that as much as 30% of the data appearing in the literature was incorrect. However this is changing: the combined efforts of the databanks and the journals to encourage direct submissions are beginning to pay off. Genbank currently receives 80% of its data by direct submission. The majority of this data comes in early enough, and is processed fast enough that any errors spotted by our in-house verification processes can be passed back to the author in time to have the errors corrected for publication. So in a sense the databanks themselves have become an adjunct to the conventional peer-review process. All e-mail submissions are currently passed back to the author for review upon completion of the annotation process (usually within weeks of receipt of the submission. We hope to expand this system soon to encompass all submissions. At the GenBank project it is clear to us that journal publication will not be the primary forum for dissemination of sequence data in the future: Journals are already exercizing editorial prudence on the quantity of data they choose to print--we estimate that about 10% of the sequence data we receive will not be printed in the published article which reports those data. Accordingly we have identified the need to apply an even greater degree of data validation than is presently used. While we currently check entities such as ORF's (and data does not enter the database until it passes these checks), we have embarked on the development of a Sequence Validation Suite of software which will be used to check all sequence data coming in and will incorporate such checks as promotor verification, vector contamination (coming soon), splice site verification, and more. Our recent announcment of the pilot phase of the curator program represents our thoughts on how to enhance further the quality and depth of data integrity by enlisting and supporting the help of the scientific community. The primary message here is that the quality of submitted data is probably better than that of printed data, and is destined to increase both through the use of enhanced in-house QC software, the curator program, and through the use of automated submission processes which will free up our staff to pay more attention to the quality and depth of sequence annotation and integrity. Finally to answer your specific questions: Our rule of thumb is that ambiguous sequence data must not represent more than 10% of the length of sequence; beyond that we will split the sequence and note the ambiguous span in the features table. Be aware that we also accept the IUPAC codes for uncertainty in nucleotide sequences--these may have contributed to your "other" figures. We do not at this point apply an editorial standard which dictates that both strands be sequenced to qualify for entry into the database. That datum is easy to collect (through Authorin) and we would prefer to denote the quality attributes rather than use them as editorial standards. With the advent of automated sequencing and image processing techniques, it is concievable that all data could come with a statistical "confidence index" generated by the base calling software; again this is information that we could easily track and gather. Though it may seem as if I have digressed in my answer to what might have been conceived as a simple issue, your question touches upon a great deal of complex issues, many of which have yet to be properly answered. Our approach has been to try and anticipate most of the possible answers in the design of our database systems, and design in the flexibility to incorporate the answers to as yet unasked questions. Regards, Paul Gilna Genbank Biology Domain Leader GenBank/LANL
jes@mbio.med.upenn.edu (Joe Smith) (08/15/90)
Thanks for the information. If I can summarize what you're saying (in a biased sort of way): There are no requirements WRT sequence data quality except as may be required for paper publication, and even that is becoming less applicable since the database(s) are now the first place a sequence appears. If this is accurate, I worry that it may encourage the 'publication' of sequence data 'before its time'. For example - a group submits its sequence as soon as they finish one complete strand (with the expectation of submitting changes later), a competing group waits until they finish both strands with at least two readings at each base. This could delay their submission significantly compared to the other group. I'm not suggesting that this is actually happening, only that the current policy seems to encourage it. As 'publication' of sequence data shifts toward the databases (at least the initial publication), the enticement will become greater. Maybe there needs to be some clear policy on this or some clear indication in the submitted data as to its quality. We already do this (subjectively) to some extent with published articles based on the editorial standards of the journal in which they're published. A related question: I just noticed that a sequence we work with has had changes made in newer releases of the DB. Is there an easy way to check for this when you're working with a private copy of the entry, excerpted from the main database? It's trivial when you have one or two sequences to maintain, but not if you have a whole family. <Joe -- Joe Smith University of Pennsylvania jes@mbio.med.upenn.edu Dept. of Biochemistry and Biophysics (215) 898-8348 Philadelphia, PA 19104-6059
kristoff@genbank.BIO.NET (David Kristofferson) (08/15/90)
Joe, Paul can correct me if I'm wrong on this, but I believe that additional work which results in revision of the sequence data is also referenced in the data bank entries. This raises the spector of one's speedy publication being recorded in the data bank followed by a public listing of references in which all of one's errors are corrected by one's peers. Who would be eager to jump the gun in order to gain such recognition? The fear of public exposure might even have the opposite effect from what you suggest. The basic fact which has been brought up by journal editors repeatedly is that the vast majority of reviewers who get a paper containing sequence data in hardcopy are not going to take the time to enter the data into a computer. Without such analysis it is hard to say that the data has been thoroughly reviewed even though the paper was accepted for publication. The GenBank curator program mentioned by Paul in earlier postings is an attempt to have experts in defined areas actually review the data in the data bank with the aid of software. Of course, it may not be possible to find curators to carefully examine the entire gamut of sequence data, and some data problems will undoubtedly have to be resolved through the time-honored scientific method of repeating and either confirming or contradiciting earlier results. Similarity searches with closely related sequences may also reveal problems even if the exact same gene is not resequenced. Even peer-reviewed papers are often shown to be wrong at some later date. As long as good lines of communication are available for getting feedback into the database about possible discrepancies, I believe that the system will be "self-correcting." I personally think that such an open approach should be welcomed. There is always a certain fear of allowing free reign to competition but, in the end, we always seem to relearn the lesson that this approach can produce good results. -- Sincerely, Dave Kristofferson GenBank On-line Service Manager kristoff@genbank.bio.net
jes@mbio.med.upenn.edu (Joe Smith) (08/15/90)
> ...to gain such recognition? The fear of public exposure might even have > the opposite effect from what you suggest. That is one force against premature submissions. Another that I've been reminded of is that the advent of PCR has made the appearance of a sequence in the database nearly equivalent to giving out the clone. Certainly those in competitive situations will be reluctant to do that. > enter the data into a computer. Without such analysis it is hard to > say that the data has been thoroughly reviewed even though the paper > was accepted for publication. ... Note that the kind of error I'm concerned with would never be caught by analysis of the sequence data - it's experimental error. If I say I read an 'A' at that position, it is entirely dependent on my interpretation of the raw data. If the raw data are poor at that position and I only have one reading of one strand, you'll never know short of an independent repetition of the experiment. I hope the honor system, and the previously mentioned counter-forces continue to uphold the integrity of the databases. <Joe -- Joe Smith University of Pennsylvania jes@mbio.med.upenn.edu Dept. of Biochemistry and Biophysics (215) 898-8348 Philadelphia, PA 19104-6059
roy@phri.nyu.edu (Roy Smith) (08/15/90)
kristoff@genbank.BIO.NET (David Kristofferson) writes: > The basic fact which has been brought up by journal editors repeatedly is > that the vast majority of reviewers who get a paper containing sequence > data in hardcopy are not going to take the time to enter the data into a > computer. Dave, You seem to be implying that this is the fault of the reviewers, that they are not taking the time to do their job properly (or are implying that the editors are implying that). Perhaps what journals should require is that any manuscript containing a non-trivial amount of sequence data be submitted along with N copies of a floppy disk (one for each reviewer) containing the sequence in machine readable form. Yes, I know that there are all sorts of media compatability problems, but I think you would be hard pressed to find a lab that couldn't read (or get read for them) a plain ascii file on a 720k DOS disk. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)
Ellen, There is no implication in my last message that reviewers should type in the data as part of their routine, although as Roy suggested it might eventually become a standard practice for reviewers to request an electronic copy of the data. How do you think an author might respond right now if as part of your comments you requested a disk of the sequence data for review purposes beforing reaching a decision on the paper? The point I made is simply that this data entry hurdle quite often stands in the way of a thorough review. Given the size of this hurdle, in addition to the not uncommon problem of people sitting on papers and then hurriedly going through them after the editors start bugging them for return of the manuscript, no one should assume that the integrity of the database can be assured by typical reviews alone without experimental verification by other labs. This should come as a surprise to no one. I am encouraged by your thoughtfulness, however. Perhaps you might be interested in contacting Paul about the GenBank curator program?! -- Sincerely, Dave Kristofferson GenBank On-line Service Manager kristoff@genbank.bio.net
kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)
P.S. - I will await LANL's comment on how they think unpublished GenBank data should be officially cited. I would assume that this would minimally include a mention of the authors and the entry's accession number. Also, on one of your other points, as humble as it may be, simple checks of things such as ORF's and the presence of vector sequences do sometimes reveal problems with the data. Obviously it is hard to detect a single base change due to the misreading of a gel, but this is where other experimental confirmation (e.g., is the protein sequence consistent?) would eventually be needed. This kind of info will most likely not come out during the typical course of a review. -- Sincerely, Dave Kristofferson GenBank On-line Service Manager kristoff@genbank.bio.net
murphy@phri.nyu.edu (Ellen Murphy) (08/16/90)
In article <Aug.15.01.48.02.1990.13594@genbank.BIO.NET> kristoff@genbank.BIO.NET (David Kristofferson) writes: > > The basic fact which has been brought up by journal editors >repeatedly is that the vast majority of reviewers who get a paper >containing sequence data in hardcopy are not going to take the time to >enter the data into a computer. Surely you are not suggesting that reviewers are expected to type sequences into their computers whenever they get a sequence paper to review? And to what end? Just to verify that what the author claims to be an ORF really is? Are we supposed to request copies of their films so we can re-read their gels? Sequencing gels are raw data like any other, most of which never makes it into manuscripts. As reviewers we have to assume that the data as presented accurately reflects the data collected, even if it is several stages removed. That doesn't mean that I won't comment on the interpretation; most people way overinterpret the sequence features and homologies that they find. If somebody presents a sequence as a promoter, I ask for the S1 data or at least the insertion of the word "putative". However I haven't noticed journal editors making much of a fuss about this. I do always request (and also do not usually get) a statement of what percent was sequenced on both strands. I also think that any ambiguities should be pointed out, with an explanation of why one reading was chosen over another. I once did manage to correct an error of this sort (the authors had an ambiguous base, chose to go with one strand, ended up in the wrong frame for the C-terminal 20% of the protein, and then chose the wrong ATG to compensate, since the size of the protein was known). The paper came to my attention at the galley stage, and I noticed the error only because I had just finished the sequence of a protein with 30% identity. The ambiguity was not mentioned in the paper, but was admitted on the telephone; it got corrected before going to press, but just barely. There's no way anybody else could have caught this-certainly the reviewers of this paper couldn't have been expected to, based on the data in the paper. Finally, a question: how is one supposed to refer to sequences in Genbank that have not been, and probably never will be, published elsewhere? I think it's fantastic that people are willing to send unpublished sequences to Genbank and I don't want to discourage the practice by not giving proper credit. Ellen Murphy The Public Health Research Institute murphy@phri.nyu.edu
lamoran@gpu.utcs.utoronto.ca (L.A. Moran) (08/16/90)
David Kristofferson (kristoff@genbank.BIO.NET) writes:
"The basic fact which has been brought up by journal editors repeatedly
is that the vast majority of reviewers who get a paper containing
sequence data in hardcopy are not going to take the time to enter the
data into a computer."
and Roy Smith (roy@alanine.phri.nyu.edu) replies;
"You seem to be implying that this is the fault of the reviewers,
that they are not taking the time to do their job properly (or are
implying that the editors are implying that)."
Ellen Murphy (murphy@phri.nyu.edu) also responded;
"Surely you are not suggesting that reviewers are expected to type
sequences into their computers whenever they get a sequence paper to
review? And to what end? Just to verify that what the author claims
to be an ORF really is?"
If an incorrect sequence is published the most guilty party is the author.
In many cases it was impossible for the reviewer to recognize that the sequence
was wrong. However, there are examples in the literature that clearly reflect
incompetence on the part of the reviewer (IMHO). Allow me to present some case
histories for discussion.
I. The sequence of a gene is published and aligned with that of an orthologous
gene from another species. Many deletions and insertions are added to align
the sequence. These include one and two base pair deletions in the coding
region which destroy the reading frame. The paper does not mention that
their gene could not possibly encode a homologous protein. (published in
NAR)
CONCLUSION: the authors are stupid and the reviewers incompetent
II. The sequence of a gene is published with a transposition of a 75bp
fragment from one part of the gene to another. The figure showing the
predicted amino acid sequence is correct. (published in PNAS)
CONCLUSION: the authors were careless, reviewers could have detected it
III. A new sequence is aligned with that of a homologous gene from another
species and the alignment includes deletions and substitutions. A much
better alignment can be obtained by assuming a small number of sequencing
errors. (published in MCB)
CONCLUSION: the authors were careless, and so was the reviewer
IV. The sequence of a gene fragment is published and the authors recognize that
it is closely related (probably orthologous) to a gene in another species.
The gene is decoded from the first methionine codon in the available
sequence because there is an upstream in frame stop codon. The presumed
initiation codon is clearly an internal methionine and the stop codon is a
sequencing error. (published in NAR)
CONCLUSION: the authors were stupid and the reviewers were careless
V. A new sequence is published which is similar to one which is already in
the database but the similarity is not noted. (published in JCB)
CONCLUSION: ?
VI. The complete sequence of a gene is published by a laboratory that has
previously published a partial sequence of the same gene. There are several
differences in the regions which overlap but these are not mentioned.
(published in NAR)
CONCLUSION: ?
Now here's a hypothetical problem for you to grapple with. I have a
database of many examples of a highly conserved gene. Assume that I receive a
paper to review which includes a new sequence of one of these genes. From my
analysis of the sequence I recognize that it almost certainly contains many
errors because there are nucleotide substitutions in highly conserved regions,
unusual codons, and because the sequence does not fit into the phylogeny.
How do I respond as a reviewer of this paper?
There are many examples of sequences in the GenBank database which I know
to be incorrect (see above). Is there any way that my doubts can be
communicated to users of the database? The accuracy of sequences that I
analyze ranges from 95-100% and half of the sequences have an accuracy of less
than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes
in a highly conserved gene family where workers are able to compare their data
with published sequences. Imagine what the accuracy of sequences of newly
discovered genes must be?
-Larry Moran
Dept. of Biochemistry
Faculty of Medicine
University of Toronto
kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)
Larry, Thanks for several rather entertaining examples! > There are many examples of sequences in the GenBank database which I know > to be incorrect (see above). Is there any way that my doubts can be > communicated to users of the database? Christian Burks and I started the GENBANK-BB (bionet.molbio.genbank) newsgroup over three years ago precisely for these kinds of issues. This forum allows rapid communication with the data bank staff and is open to as many users of the data bank as care to sign on. As I stated in my reply to Ellen, open lines of communication/feedback are obviously essential for any self-correcting mechanism to exist/succeed. > The accuracy of sequences that I > analyze ranges from 95-100% and half of the sequences have an accuracy of less > than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes > in a highly conserved gene family where workers are able to compare their data > with published sequences. Imagine what the accuracy of sequences of newly > discovered genes must be? Whether this is a real problem depends on where the errors are. All scientific measurements are obviously prone to error but there is no easy way to put "error bars" on sequence data. However, unless the data is horrendously flawed, wouldn't you rather have data that was reasonably accurate versus no data at all? At the very least it could serve as a basis for further refinements. -- Sincerely, Dave Kristofferson GenBank On-line Service Manager kristoff@genbank.bio.net