[bionet.general] Sequence data direct submission

benton@GENBANK.IG.COM (David Benton) (11/17/89)

In an earlier message, Rob Harper wrote:

>      The key to the submission of data is user education. If the central
>      data libraries would make clear statements about what they want
>      from scientists, and provide them with simple straight forward
>      methods to accomplish these goals, (submission forms, programmes
>      for codifying data, network addresses to drop off data) then I
>      think this would be the most important investment for the future.

While we cannot speak for the EMBL Data Library, DDBJ, or PIR
International, it is safe to say that GenBank does want electronic
submissions and now strongly prefers submissions in "transaction
protocol" (tp) format.  The primary reason for this preference is that
such submissions can be automatically processed on receipt.  This
allows us to provide the submitted data in a much more timely fashion
than we could if the data accompanying every submitted sequence
required human examination, interpretation, and re-formulation.  To
aid sequencers in generating tp-format submissions, we have developed
and are distributing the Authorin program.  We are presently in the
"shake down" period and are processing tp submissions both manually
and automatically, but expect to move to essentially fully automated
handling in the first quarter of next year.

The advantages to GenBank of tp submissions are obvious (less manual
labor for us), but why should a scientist want to go to the trouble of
installing, learning, and using a program to prepare transaction protocol
submissions (Authorin, e.g.)?

1.  More rapid turn around.  Although many scientists may not be
particularly interested in when their own data appear in the database,
the mandatory deposition policies adopted by numerous journals
(mentioned in previous postings to this newsgroup), now make getting
acknowledgment of data deposition a prerequisite to publication.  The
databanks currently provide this acknowledgment by sending the
submitter a database accession number for each submitted sequence.
Right now, the nucleotide sequence databases guarantee a reply to
direct submissions within seven days (and generally reply well within
that period).  But, in tests earlier this year, we found that the
automatic data handling routines in place at Los Alamos National
Laboratory (data collection site for GenBank), could return an
accession number (or the reason the data could not be added to the
database) in about three minutes.  Clearly, any procedure which
requires human intervention cannot be expected to give this kind of
response time.

2.  Improved accuracy of data representation.  The strength and the
weakness of the sequence data submission form is that all the fields
in it are free form: the sequencer can write any kind of information
in any field.  This means that databank staff members must read the
form, interpret what is written according to the databank's data
representation conventions, and then (essentially) re-code the data.
While interpreting the data submission form is much more straightforward
than extracting the same information from a standard scientific publication,
the problem of interpretation is essentially the same.  Authorin provides
several mechanisms for improving data accuracy.  Where the databanks have
established a controlled vocabulary for a field, Authorin enforces the
use of a term from that vocabulary (usually by menu choices).  Other
integrity and consistency checks are also performed (e.g., Authorin
prevents accidental entry of sequence feature locations beyond the ends
of the submitted sequence).  Perhaps more important than the direct
data validation provided is the one-to-one correspondence between
fields in Authorin forms and fields in the GenBank relational database.
Since the fields in the data collection tool and the fields in the database
itself are the same, no transformation (with the possibility of data
definition mismatch) is required.

3.  Improved completeness.  Since Authorin provides for the collection
of essentially all the data items which can be represented in the database,
it prompts the investigator to submit useful data which might not normally
be published or added to the data submission form.

One less obvious advantage is that, because a data collection tool
like Authorin can collect virtually all the data items which can be
represented in the database, it is possible for a submitter to provide
a much richer and deeper level of annotation than is usually available
through the data submission form or even through publication (and
certainly deeper than a database staff annotator usually has time to
extract from a paper).  This advantage accrues primarily to users of the
database, not to the submitter (of that sequence) or to the database
staff.  We, however, regard (and hope molecular biologists regard)
GenBank as community property and idealistically hope that, since most
submitters are also sequence database uses, they will recognize that
"what goes around, comes around".

Authorin for the IBM/PC and compatibles has been in public release
since July.  We have been quite active in encouraging user-feedback on
the program.  A maintenance release is planned for first quarter 1990
and a Mac version for autumn 1990.  While Authorin may be somewhat
more complicated for the average molecular biologist to use than a pencil
and a paper form, our goal was to make it easy enough to use without
reference to the user manual except in special cases (where the data
being submitted are inherently complex, for example).  Based on user-
reaction thus far, we believe we have largely met this goal.  Of something
like 45 individuals who've used the program and sent us comments, only
one did not want to take the time to use (actually install) the program
because he had one sequence to submit and did not think he would ever
do any more sequencing.  A typical user response follows (I'm using it
because it is the most recent, having arrived by e-mail yesterday, and
is *truly* typical):

    "I have just submitted sequence data to Genbank using AUTHORIN.
     Overall I think it is quite a good package, a clear improvement over
     the old method of sequence preparation.  The use is intuitive for
     experienced computer users, and there is sufficient documentation and
     help for novices.  I have no major criticisms or suggestions for
     improvements."

To reiterate what Paul Gilna said earlier:

Authorin for IBM/PC and compatibles is available free of charge.

  Call                (415) 962-7364,

  or send e-mail to   authorin@genbank.bio.net,

  or write to         GenBank c/o IntelliGenetics
		      700 East El Camino Real
		      Mountain View, CA 94040

At the moment, Authorin distribution is on 360-kb 5-1/4" floppies, so
please send a postal (or, preferably, FED-X-able) address if you write.
(On request, we can send the program on 3-1/2" 770-kb diskettes,
also.)

On-line distribution of Authorin by anonymous ftp, archive server,
etc. is currently under investigation.  Please send recommendations
and suggestions directly to me at the e-mail address below.  FYI: the
Authorin exe file is approximately 320kb; menus and help are stored in
external files which amount to approximately 300kb.


					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com