benton@GENBANK.IG.COM (David Benton) (11/17/89)
In an earlier message, Rob Harper wrote: > The key to the submission of data is user education. If the central > data libraries would make clear statements about what they want > from scientists, and provide them with simple straight forward > methods to accomplish these goals, (submission forms, programmes > for codifying data, network addresses to drop off data) then I > think this would be the most important investment for the future. While we cannot speak for the EMBL Data Library, DDBJ, or PIR International, it is safe to say that GenBank does want electronic submissions and now strongly prefers submissions in "transaction protocol" (tp) format. The primary reason for this preference is that such submissions can be automatically processed on receipt. This allows us to provide the submitted data in a much more timely fashion than we could if the data accompanying every submitted sequence required human examination, interpretation, and re-formulation. To aid sequencers in generating tp-format submissions, we have developed and are distributing the Authorin program. We are presently in the "shake down" period and are processing tp submissions both manually and automatically, but expect to move to essentially fully automated handling in the first quarter of next year. The advantages to GenBank of tp submissions are obvious (less manual labor for us), but why should a scientist want to go to the trouble of installing, learning, and using a program to prepare transaction protocol submissions (Authorin, e.g.)? 1. More rapid turn around. Although many scientists may not be particularly interested in when their own data appear in the database, the mandatory deposition policies adopted by numerous journals (mentioned in previous postings to this newsgroup), now make getting acknowledgment of data deposition a prerequisite to publication. The databanks currently provide this acknowledgment by sending the submitter a database accession number for each submitted sequence. Right now, the nucleotide sequence databases guarantee a reply to direct submissions within seven days (and generally reply well within that period). But, in tests earlier this year, we found that the automatic data handling routines in place at Los Alamos National Laboratory (data collection site for GenBank), could return an accession number (or the reason the data could not be added to the database) in about three minutes. Clearly, any procedure which requires human intervention cannot be expected to give this kind of response time. 2. Improved accuracy of data representation. The strength and the weakness of the sequence data submission form is that all the fields in it are free form: the sequencer can write any kind of information in any field. This means that databank staff members must read the form, interpret what is written according to the databank's data representation conventions, and then (essentially) re-code the data. While interpreting the data submission form is much more straightforward than extracting the same information from a standard scientific publication, the problem of interpretation is essentially the same. Authorin provides several mechanisms for improving data accuracy. Where the databanks have established a controlled vocabulary for a field, Authorin enforces the use of a term from that vocabulary (usually by menu choices). Other integrity and consistency checks are also performed (e.g., Authorin prevents accidental entry of sequence feature locations beyond the ends of the submitted sequence). Perhaps more important than the direct data validation provided is the one-to-one correspondence between fields in Authorin forms and fields in the GenBank relational database. Since the fields in the data collection tool and the fields in the database itself are the same, no transformation (with the possibility of data definition mismatch) is required. 3. Improved completeness. Since Authorin provides for the collection of essentially all the data items which can be represented in the database, it prompts the investigator to submit useful data which might not normally be published or added to the data submission form. One less obvious advantage is that, because a data collection tool like Authorin can collect virtually all the data items which can be represented in the database, it is possible for a submitter to provide a much richer and deeper level of annotation than is usually available through the data submission form or even through publication (and certainly deeper than a database staff annotator usually has time to extract from a paper). This advantage accrues primarily to users of the database, not to the submitter (of that sequence) or to the database staff. We, however, regard (and hope molecular biologists regard) GenBank as community property and idealistically hope that, since most submitters are also sequence database uses, they will recognize that "what goes around, comes around". Authorin for the IBM/PC and compatibles has been in public release since July. We have been quite active in encouraging user-feedback on the program. A maintenance release is planned for first quarter 1990 and a Mac version for autumn 1990. While Authorin may be somewhat more complicated for the average molecular biologist to use than a pencil and a paper form, our goal was to make it easy enough to use without reference to the user manual except in special cases (where the data being submitted are inherently complex, for example). Based on user- reaction thus far, we believe we have largely met this goal. Of something like 45 individuals who've used the program and sent us comments, only one did not want to take the time to use (actually install) the program because he had one sequence to submit and did not think he would ever do any more sequencing. A typical user response follows (I'm using it because it is the most recent, having arrived by e-mail yesterday, and is *truly* typical): "I have just submitted sequence data to Genbank using AUTHORIN. Overall I think it is quite a good package, a clear improvement over the old method of sequence preparation. The use is intuitive for experienced computer users, and there is sufficient documentation and help for novices. I have no major criticisms or suggestions for improvements." To reiterate what Paul Gilna said earlier: Authorin for IBM/PC and compatibles is available free of charge. Call (415) 962-7364, or send e-mail to authorin@genbank.bio.net, or write to GenBank c/o IntelliGenetics 700 East El Camino Real Mountain View, CA 94040 At the moment, Authorin distribution is on 360-kb 5-1/4" floppies, so please send a postal (or, preferably, FED-X-able) address if you write. (On request, we can send the program on 3-1/2" 770-kb diskettes, also.) On-line distribution of Authorin by anonymous ftp, archive server, etc. is currently under investigation. Please send recommendations and suggestions directly to me at the e-mail address below. FYI: the Authorin exe file is approximately 320kb; menus and help are stored in external files which amount to approximately 300kb. Sincerely, David Benton GenBank Manager 415-962-7360 benton@genbank.ig.com