[bionet.molbio.genbank] Quality of submitted data

jes@mbio.med.upenn.edu (Joe Smith) (08/14/90)

Could someone quote or point me to the policy on review of data
submitted to GenBank (and/or the other databases)?  Is it entirely up
to the submitter, or are there guidelines similar to those for
reviewed publications (e.g. at least one reading of both strands).

There must be a policy on this, but it came up last week and I
couldn't find the authoritative answer.  The specific question was:
``under what circumstances is it acceptable for N's to appear in a
submitted sequence? (I count 2155 sequences with 'other' bases in
GB53)'' It would seem to be a difficult issue with regard to
submission of 'preliminary' data.

<Joe

--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059

pgil%histone@LANL.GOV (Paul Gilna) (08/14/90)

The link between GenBank data and the published literature is an
historic artifact which existed simply because this was the only forum
from which the databank could obtain the sequence data. As such, it was
assumed that these data were "peer reviewed" and complied with
editorial directives, such as data determination from both strands.

The realities are that the vast majority of sequence data per se are
not subjected to any rigorous form of integrity review by the
conventional editorial peer review process. In fact in the past, most
of the verification was performed after-the-fact by the annotation
process conducted by databank staff. Our estimates are that as much as
30% of the data appearing in the literature was incorrect.

However this is changing: the combined efforts of the databanks and the
journals to encourage direct submissions are beginning to pay off.
Genbank currently receives 80% of its data by direct submission. The
majority of this data comes in early enough, and is processed fast
enough that any errors spotted by our in-house verification processes
can be passed back to the author in time to have the errors corrected
for publication.  So in a sense the databanks themselves have become an
adjunct to the conventional peer-review process.

All e-mail submissions are currently passed back to the author for
review upon completion of the annotation process (usually within weeks
of receipt of the submission. We hope to expand this system soon to
encompass all submissions.

At the GenBank project it is clear to us that journal publication will
not be the primary forum for dissemination of sequence data in the
future:  Journals are already exercizing editorial prudence on the
quantity of data they choose to print--we estimate that about 10% of
the sequence data we receive will not be printed in the published
article which reports those data.


Accordingly we have identified the need to apply an even greater degree
of data validation than is presently used. While we currently check
entities such as ORF's (and data does not enter the database until it
passes these checks), we have embarked on the development of a Sequence
Validation Suite of software which will be used to check all sequence
data coming in and will incorporate such checks as promotor
verification, vector contamination (coming soon), splice site
verification, and more. Our recent announcment of the pilot phase of
the curator program represents our thoughts on how to enhance further
the quality and depth of data integrity by enlisting and supporting the
help of the scientific community.

The primary message here is that the quality of submitted data is
probably better than that of printed data, and is destined to increase
both through the use of enhanced in-house QC software, the curator
program, and through the use of automated submission processes which
will free up our staff to pay more attention to the quality and depth
of sequence annotation and integrity.


Finally to answer your specific questions:

Our rule of thumb is that ambiguous sequence data must not represent
more than 10% of the length of sequence; beyond that we will split the
sequence and note the ambiguous span in the features table. Be aware
that we also accept the IUPAC codes for uncertainty in nucleotide
sequences--these may have contributed to your "other" figures.

We do not at this point apply an editorial standard which dictates that
both strands be sequenced to qualify for entry into the database. That
datum is easy to collect (through Authorin) and we would prefer to
denote the quality attributes rather than use them as editorial
standards.  With the advent of automated sequencing and image
processing techniques, it is concievable that all data could come with
a statistical "confidence index" generated by the base calling
software; again this is information that we could easily track and
gather.


Though it may seem as if I have digressed in my answer to what might
have been conceived as a simple issue, your question touches upon a
great deal of complex issues, many of which have yet to be properly
answered.  Our approach has been to try and anticipate most of the
possible answers in the design of our database systems, and design in
the flexibility to incorporate the answers to as yet unasked
questions.

Regards,

Paul Gilna
Genbank Biology Domain Leader
GenBank/LANL

jes@mbio.med.upenn.edu (Joe Smith) (08/15/90)

Thanks for the information.  If I can summarize what you're saying
(in a biased sort of way):

  There are no requirements WRT sequence data quality except as
  may be required for paper publication, and even that is becoming
  less applicable since the database(s) are now the first place a
  sequence appears.

If this is accurate, I worry that it may encourage the 'publication' of
sequence data 'before its time'.  For example - a group submits its
sequence as soon as they finish one complete strand (with the
expectation of submitting changes later), a competing group waits until
they finish both strands with at least two readings at each base.  This
could delay their submission significantly compared to the other group.

I'm not suggesting that this is actually happening, only that the
current policy seems to encourage it.  As 'publication' of sequence
data shifts toward the databases (at least the initial publication),
the enticement will become greater.  Maybe there needs to be some clear
policy on this or some clear indication in the submitted data as to its
quality.  We already do this (subjectively) to some extent with
published articles based on the editorial standards of the journal in
which they're published.

A related question: I just noticed that a sequence we work with has had
changes made in newer releases of the DB.  Is there an easy way to
check for this when you're working with a private copy of the entry,
excerpted from the main database?  It's trivial when you have one or two
sequences to maintain, but not if you have a whole family.

<Joe
--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059

kristoff@genbank.BIO.NET (David Kristofferson) (08/15/90)

Joe,

	Paul can correct me if I'm wrong on this, but I believe that
additional work which results in revision of the sequence data is also
referenced in the data bank entries.  This raises the spector of one's
speedy publication being recorded in the data bank followed by a
public listing of references in which all of one's errors are
corrected by one's peers.  Who would be eager to jump the gun in order
to gain such recognition?  The fear of public exposure might even have
the opposite effect from what you suggest.

	The basic fact which has been brought up by journal editors
repeatedly is that the vast majority of reviewers who get a paper
containing sequence data in hardcopy are not going to take the time to
enter the data into a computer.  Without such analysis it is hard to
say that the data has been thoroughly reviewed even though the paper
was accepted for publication.  The GenBank curator program mentioned
by Paul in earlier postings is an attempt to have experts in defined
areas actually review the data in the data bank with the aid of
software.  Of course, it may not be possible to find curators to
carefully examine the entire gamut of sequence data, and some data
problems will undoubtedly have to be resolved through the time-honored
scientific method of repeating and either confirming or contradiciting
earlier results.  Similarity searches with closely related sequences
may also reveal problems even if the exact same gene is not
resequenced.  Even peer-reviewed papers are often shown to be wrong at
some later date.  As long as good lines of communication are available
for getting feedback into the database about possible discrepancies, I
believe that the system will be "self-correcting."

I personally think that such an open approach should be welcomed.
There is always a certain fear of allowing free reign to competition
but, in the end, we always seem to relearn the lesson that this
approach can produce good results.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

jes@mbio.med.upenn.edu (Joe Smith) (08/15/90)

> ...to gain such recognition?  The fear of public exposure might even have
> the opposite effect from what you suggest.

That is one force against premature submissions.  Another that I've
been reminded of is that the advent of PCR has made the appearance of
a sequence in the database nearly equivalent to giving out the clone.
Certainly those in competitive situations will be reluctant to do
that.

> enter the data into a computer.  Without such analysis it is hard to
> say that the data has been thoroughly reviewed even though the paper
> was accepted for publication. ...

Note that the kind of error I'm concerned with would never be caught
by analysis of the sequence data - it's experimental error.  If I say
I read an 'A' at that position, it is entirely dependent on my
interpretation of the raw data.  If the raw data are poor at that
position and I only have one reading of one strand, you'll never know
short of an independent repetition of the experiment.

I hope the honor system, and the previously mentioned counter-forces
continue to uphold the integrity of the databases.

<Joe
--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059

roy@phri.nyu.edu (Roy Smith) (08/15/90)

kristoff@genbank.BIO.NET (David Kristofferson) writes:
> The basic fact which has been brought up by journal editors repeatedly is
> that the vast majority of reviewers who get a paper containing sequence
> data in hardcopy are not going to take the time to enter the data into a
> computer.

Dave,
	You seem to be implying that this is the fault of the reviewers,
that they are not taking the time to do their job properly (or are implying
that the editors are implying that).  Perhaps what journals should require
is that any manuscript containing a non-trivial amount of sequence data be
submitted along with N copies of a floppy disk (one for each reviewer)
containing the sequence in machine readable form.  Yes, I know that there
are all sorts of media compatability problems, but I think you would be hard
pressed to find a lab that couldn't read (or get read for them) a plain
ascii file on a 720k DOS disk.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)

Ellen,

	There is no implication in my last message that reviewers
should type in the data as part of their routine, although as Roy
suggested it might eventually become a standard practice for reviewers
to request an electronic copy of the data.  How do you think an author
might respond right now if as part of your comments you requested a
disk of the sequence data for review purposes beforing reaching a
decision on the paper?

	The point I made is simply that this data entry hurdle quite
often stands in the way of a thorough review.  Given the size of this
hurdle, in addition to the not uncommon problem of people sitting on
papers and then hurriedly going through them after the editors start
bugging them for return of the manuscript, no one should assume that
the integrity of the database can be assured by typical reviews alone
without experimental verification by other labs.  This should come as
a surprise to no one.  

I am encouraged by your thoughtfulness, however.  Perhaps you might be
interested in contacting Paul about the GenBank curator program?!
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)

P.S. - I will await LANL's comment on how they think unpublished
GenBank data should be officially cited. I would assume that this
would minimally include a mention of the authors and the entry's
accession number.

	Also, on one of your other points, as humble as it may be,
simple checks of things such as ORF's and the presence of vector
sequences do sometimes reveal problems with the data.  Obviously it is
hard to detect a single base change due to the misreading of a gel,
but this is where other experimental confirmation (e.g., is the
protein sequence consistent?) would eventually be needed.  This kind
of info will most likely not come out during the typical course of a
review.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

murphy@phri.nyu.edu (Ellen Murphy) (08/16/90)

In article <Aug.15.01.48.02.1990.13594@genbank.BIO.NET> kristoff@genbank.BIO.NET (David Kristofferson) writes:
>
>	The basic fact which has been brought up by journal editors
>repeatedly is that the vast majority of reviewers who get a paper
>containing sequence data in hardcopy are not going to take the time to
>enter the data into a computer.

     Surely you are not suggesting that reviewers are expected to type
sequences into their computers whenever they get a sequence paper to
review?  And to what end?  Just to verify that what the author claims
to be an ORF really is?  Are we supposed to request copies of their
films so we can re-read their gels?  Sequencing gels are raw data like
any other, most of which never makes it into manuscripts.  As reviewers
we have to assume that the data as presented accurately reflects the data
collected, even if it is several stages removed.  That doesn't mean
that I won't comment on the interpretation; most people way overinterpret
the sequence features and homologies that they find.  If somebody
presents a sequence as a promoter, I ask for the S1 data or at least
the insertion of the word "putative".  However I haven't noticed
journal editors making much of a fuss about this.

     I do always request (and also do not usually get) a statement of what
percent was sequenced on both strands.  I also think that any
ambiguities should be pointed out, with an explanation of why one
reading was chosen over another.  I once did manage to correct an error
of this sort (the authors had an ambiguous base, chose to go with one
strand, ended up in the wrong frame for the C-terminal 20% of the
protein, and then chose the wrong ATG to compensate, since the size of
the protein was known).  The paper came to my attention at the galley
stage, and I noticed the error only because I had just finished the
sequence of a protein with 30% identity.  The ambiguity was not
mentioned in the paper, but was admitted on the telephone; it got
corrected before going to press, but just barely.  There's no way
anybody else could have caught this-certainly the reviewers of this
paper couldn't have been expected to, based on the data in the paper.

    Finally, a question:  how is one supposed to refer to sequences in
Genbank that have not been, and probably never will be, published
elsewhere?  I think it's fantastic that people are willing to send
unpublished sequences to Genbank and I don't want to discourage the practice
by not giving proper credit.

Ellen Murphy
The Public Health Research Institute
murphy@phri.nyu.edu

lamoran@gpu.utcs.utoronto.ca (L.A. Moran) (08/16/90)

David Kristofferson (kristoff@genbank.BIO.NET) writes:
     "The basic fact which has been brought up by journal editors repeatedly 
      is that the vast majority of reviewers who get a paper containing 
      sequence data in hardcopy are not going to take the time to enter the 
      data into a computer."

and Roy Smith (roy@alanine.phri.nyu.edu) replies;
     "You seem to be implying that this is the fault of the reviewers,
      that they are not taking the time to do their job properly (or are 
      implying that the editors are implying that)."

Ellen Murphy (murphy@phri.nyu.edu) also responded;
     "Surely you are not suggesting that reviewers are expected to type
      sequences into their computers whenever they get a sequence paper to
      review?  And to what end?  Just to verify that what the author claims
      to be an ORF really is?"


     If an incorrect sequence is published the most guilty party is the author.
In many cases it was impossible for the reviewer to recognize that the sequence
was wrong. However, there are examples in the literature that clearly reflect
incompetence on the part of the reviewer (IMHO). Allow me to present some case
histories for discussion.

I. The sequence of a gene is published and aligned with that of an orthologous
   gene from another species. Many deletions and insertions are added to align
   the sequence. These include one and two base pair deletions in the coding
   region which destroy the reading frame. The paper does not mention that 
   their gene could not possibly encode a homologous protein. (published in
   NAR)
   CONCLUSION: the authors are stupid and the reviewers incompetent

II. The sequence of a gene is published with a transposition of a 75bp 
   fragment from one part of the gene to another. The figure showing the 
   predicted amino acid sequence is correct. (published in PNAS)
   CONCLUSION: the authors were careless, reviewers could have detected it

III. A new sequence is aligned with that of a homologous gene from another
  species and the alignment includes deletions and substitutions. A much 
  better alignment can be obtained by assuming a small number of sequencing
  errors. (published in MCB)
  CONCLUSION: the authors were careless, and so was the reviewer

IV. The sequence of a gene fragment is published and the authors recognize that
  it is closely related (probably orthologous) to a gene in another species.
  The gene is decoded from the first methionine codon in the available 
  sequence because there is an upstream in frame stop codon. The presumed 
  initiation codon is clearly an internal methionine and the stop codon is a
  sequencing error. (published in NAR)
  CONCLUSION: the authors were stupid and the reviewers were careless

V. A new sequence is published which is similar to one which is already in 
  the database but the similarity is not noted. (published in JCB)
  CONCLUSION: ?

VI. The complete sequence of a gene is published by a laboratory that has
  previously published a partial sequence of the same gene. There are several
  differences in the regions which overlap but these are not mentioned.
  (published in NAR)
  CONCLUSION: ?

    Now here's a hypothetical problem for you to grapple with. I have a 
database of many examples of a highly conserved gene. Assume that I receive a 
paper to review which includes a new sequence of one of these genes. From my 
analysis of the sequence I recognize that it almost certainly contains many 
errors because there are nucleotide substitutions in highly conserved regions, 
unusual codons, and because the sequence does not fit into the phylogeny. 
How do I respond as a reviewer of this paper?

    There are many examples of sequences in the GenBank database which I know
to be incorrect (see above). Is there any way that my doubts can be 
communicated to users of the database? The accuracy of sequences that I 
analyze ranges from 95-100% and half of the sequences have an accuracy of less
than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes
in a highly conserved gene family where workers are able to compare their data
with  published sequences. Imagine what the accuracy of sequences of newly
discovered genes must be?


-Larry Moran
Dept. of Biochemistry
Faculty of Medicine
University of Toronto

kristoff@genbank.BIO.NET (David Kristofferson) (08/16/90)

Larry,

	Thanks for several rather entertaining examples!

>     There are many examples of sequences in the GenBank database which I know
> to be incorrect (see above). Is there any way that my doubts can be 
> communicated to users of the database? 

Christian Burks and I started the GENBANK-BB (bionet.molbio.genbank)
newsgroup over three years ago precisely for these kinds of issues.
This forum allows rapid communication with the data bank staff and is
open to as many users of the data bank as care to sign on.  As I
stated in my reply to Ellen, open lines of communication/feedback are
obviously essential for any self-correcting mechanism to
exist/succeed.

> The accuracy of sequences that I 
> analyze ranges from 95-100% and half of the sequences have an accuracy of less
> than 99.6% or 4 errors in every 1000 nucleotides. These are sequences of genes
> in a highly conserved gene family where workers are able to compare their data
> with  published sequences. Imagine what the accuracy of sequences of newly
> discovered genes must be?

Whether this is a real problem depends on where the errors are.  All
scientific measurements are obviously prone to error but there is no
easy way to put "error bars" on sequence data.  However, unless the
data is horrendously flawed, wouldn't you rather have data that was
reasonably accurate versus no data at all?  At the very least it could
serve as a basis for further refinements.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net