[bionet.general] Genome Data Submission and NIH Policy

kristoff@NET.BIO.NET (Dave Kristofferson) (11/10/89)

Dear BIONEWS readers:

I was at the Wolf Trap Genome Sequencing Conference in Washington,
D.C. a couple of weeks back and noted along with the other attendees
that much progress has been made in sequencing technology.  PCR
methods appear to be coming on strong and the use of automated
sequenators for large projects is increasingly practical.  There was a
definite air of optimism that large scale sequencing may commence
earlier than previously anticipated in the Genome Project timeline.

However, the meeting started and ended on a policy question from the
NIH which was not really resolved.  Both Jim Watson at the start of
the meeting and Elke Jordan at the close of the meeting asked for the
conferees input on the following question:

   $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

  How long should researchers be allowed to sit on data before it is
		     submitted to the databanks?

   $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Watson began the meeting by saying that they don't have a good answer
to this question yet, and the meeting ended in essentially the same
state of affairs with the discussion as usual revolving around
technical issues during the various sessions.

I'd like to make you all aware of the fact that several people at the
NIH regularly read these newsgroups including members of the Genome
Office.  If you have opinions on this subject and would like to make
them known, please post them to one of the following addresses for
this group:

Address					Location	Network
-------					--------	-------
bionews@irlearn.ucd.ie			Ireland		EARN/BITNET
bionews@uk.ac.daresbury			U.K.		JANET
bionews@bmc.uu.se			Sweden		Internet
bionews@net.bio.net			U.S.A. 		Internet/BITNET

(if you don't already participate in this newsgroup you can subscribe
by mailing to the address "biosci" instead of "bionews" at any of the
four sites above.  Please choose the most accessible one for your
location/network.)

I have heard the gamut of opinions run from tolerating no delay at all
(i.e., contract out the sequencing effort to technicians at large
centers and require that all of the data be immediately accessible) to
allowing a six month to one year "waiting period" while the sequencers
milk the info for every publication that can be squeezed out of it
(the accompanying question is "how else do they get credit for this
arduous work?").  Waiting until the data is "suitable" for publication
is obviously too subjective of a standard to go by (the case of
missing crystallography data was noted in this regards).

Unfortunately for hoarders, however, finding sites of interest in the
sequence data usually requires fairly complete databases for
comparisons, so one tends to shoot oneself in the foot to some extent
by not being open with the information.  I believe that this problem
applies equally to proposals to limit initial release of the data to
members of the community sequencing the species in question (e.g.,
nematode sequence data is pre-released to the nematode sequencing
labs, but not to the rest of the community).  Attempts to keep all of
one's information and study it in alone will probably result in
missing much that is of importance.

What do you think?  You have a chance to voice opinions before the
rules are written in stone.  This message has been configured so that
replies to it will be automatically posted to BIONEWS.

				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@net.bio.net

HARPER@cc.helsinki.fi (ROBERT HARPER, FINLAND) (11/14/89)

In article <CMM.0.88.626683887.kristoff@NET.BIO.NET>, kristoff@NET.BIO.NET (Dave Kristofferson) writes:
> 
>   How long should researchers be allowed to sit on data before it is
> 		     submitted to the databanks?
> 

    At the moment there are two methods whereby data can be entered into a
    database. For example at EMBL they have anotators who go through
    journals and pick up sequence data... slow and tedious work. Then there
    is direct sumbission by authors, and I think that this is the direction
    that most databases would like to persue.

    Submission could either be by disk or by E-mail. In either case what
    would be a good idea would be a programme to help researchers fill in
    their data in an acceptable form for either EMBL or Genbank... so that
    the anotators would only need to do a mimimum amount to work to get the
    data into the database.

    Suggestions:
   	1) Network addresses for submission of data
    	2) Programme to simplify the codifying of data
    	3) Programme available of some server for downloading.

    -=ROB=-
                   

roberts@cshlab.bitnet (11/14/89)

Tom
        I already saw it.

        Basically I think there should be some time limit imposed, probably
a year or so at the outside.  However, I think people should be encouraged
to deposit data faster than that.  Good analytical programs will help make
the gap shorter.
rich

FUCHS@embl.bitnet ("Rainer Fuchs ", EMBL) (11/14/89)

 Rob Harper wrote in a previous message:

>    Submission could either be by disk or by E-mail. In either case what
>    would be a good idea would be a programme to help researchers fill in
>    their data in an acceptable form for either EMBL or Genbank... so that
>    the anotators would only need to do a mimimum amount to work to get the
>    data into the database.
>
>    Suggestions:
>       1) Network addresses for submission of data
>       2) Programme to simplify the codifying of data
>       3) Programme available of some server for downloading.

 Data submission to EMBL via e-mail is already becoming more and more
 popular. Computer-readable submission forms are available from the EMBL file
 server. For details on e-mail submissions just send a short message to
 DATALIB@EMBL.

 A program to facilitate data input by researchers themselves is currently
 being developed by Genbank/IG. But I don't share Rob's enthusiasm about such
 kind of a program. The main problem is not to get the data in a preprocessed
 form, but to get them *at all*. At EMBL (and I guess, at GenBank too) we
 waste much effort by scanning journals and entering sequence data from
 published articles. Unfortunately, not all publishers are willing to
 collaborate by making accession numbers mandatory for publication.

 I am not sure whether a data entry program would actually help us very much.
 A "normal" scientist produces sequence data not so frequently that he will
 become familiar with this program. So it's much more work for him to enter
 the data using his computer than to fill in a submission form manually. The
 situation may change when the genome project will produce output, so that
 sequence data will arise daily. But then, the amount of data will make it
 impractible for a scientist to enter them into a submission program. We
 should rather think about ways to cooperate with manufacturers of sequencing
 machines to include appropriate software directly in their machines.

 I think that at the moment a submission program will not help us very much,
 and that in the future we need much more sophisticated means of data entry
 and submission than a simple data entry program.

 Rainer

 Disclaimer: this is my personal opinion and is in no way representative for
 the whole EMBL Data Libray

 -----------------------------------------------------------------------------
 Rainer Fuchs, Ph.D.                          | Post:    EMBL Data Library
                                              |          European Molecular
 EARN/Bitnet: fuchs@embl.bitnet               |          Biology Laboratory
 Internet: fuchs%embl.bitnet@cunyvm.cuny.edu  |          Meyerhofstr. 1
                                              |          D-6900 Heidelberg
 "Waiter, there's a bug in my soup!"          |          FRG
 "No, Sir, it's not a bug, it's a feature!"   | Phone:   +49-6221-387467

HARPER@cc.helsinki.fi (ROBERT HARPER, FINLAND) (11/15/89)

In article <8911141641.AA08811@net.bio.net>, FUCHS@embl.bitnet ("Rainer Fuchs ", EMBL) writes:

>  Data submission to EMBL via e-mail is already becoming more and more
>  popular. Computer-readable submission forms are available from the EMBL file
>  server. For details on e-mail submissions just send a short message to
>  DATALIB@EMBL.
     
      It is good to know that EMBL is open submissions via E-mail. At the
      moment what would you say were the percentages of submissions that
      come over the network and those that are entered directly from
      journals?
 
>  A program to facilitate data input by researchers themselves is currently
>  being developed by Genbank/IG. 

      From another source I heard that a programme called AUTHORIN has
      been developed for the codifying of data. What are the chances of
      making this available on the network?

>  I think that at the moment a submission program will not help us very much,
>  and that in the future we need much more sophisticated means of data entry
>  and submission than a simple data entry program.

      You may be right, but I think your views are rather visionary. It
      may be that for all their sophistication scientists still work with
      pencils an paper first, with computers if they can type, and with
      networks if they are really enthusiastic about communication. 
      If you had a software interface to collect data from sequencers the
      problem still remains of how to get the data from the lab bench to
      a central data library... and that is all to do with work habits.

      The key to the submission of data is user education. If the central
      data libraries would make clear statements about what they want
      from scientists, and provide them with simple straight forward
      methods to accomplish these goals, (submission forms, programmes
      for codifying data, network addresses to drop off data) then I
      think this would be the most important investment for the future.
      Then when your multi_channel_hyper_super_online_network_sequencer,
      finally does arrive, people will have the knowledge and expertise
      of how to dump their stuff from one place to another. Technology in
      itself is not the complete answer... dare I say it that the real
      battle is for the minds and the hearts of the sequencers:-)

      Rob "Percept apon precept" Harper

chen@gene.com (Ellson Chen) (12/05/89)

Dave,

Sorry for a slow response to this issue.
Frankly, I think this is a little too early to worry about this problem.
People will need to see some closer examples to get a true feeling about
this whole issue.   The large scale sequencing programs currently going
on will require at least 2 more years to reach meaningful stage and I think
we need some feedback from those who worked on these projects.

   I suggest that we keep this question in mind for a couple more years
without engraving any rules on stone.   Other questions such as sequencing
strategies, accuracies level & interlab coordination, etc. are far more
important at this time. 

Sincerely,

Ellson Chen
Genentech Inc.

[posted with the author's permission - D.K.]