[news.software.nntp] Distributing GenBank over the Internet

roy@phri.nyu.edu (Roy Smith) (12/08/89)

	Would it be possible to get rid of all this magtape and distribute
GenBank over the Internet?  One possibility is being able to ftp the tar
files.  Another more interesting possibility is using NNTP, or something
like it.  Make each locus into a netnews article.  Multiple links would
ensure connectivity in the case of some machine being down and NNTP's
IHAVE/SENDME processing would eliminate duplicates.  People could even
subscribe to just those parts of the database they are interested in by
just getting, for example, bionet.database.gb.bacterial.staph, or whatever.
Keep in mind that usenet has progressed from magtape to NNTP for data
transport, why shouldn't the genetic databases follow suit?
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

benton@presto.IG.COM (David Benton) (12/08/89)

Roy Smith wrote:

> Would it be possible to get rid of all this magtape and distribute
> GenBank over the Internet?  One possibility is being able to ftp the
> tar files.  Another more interesting possibility .......

We are pleased to announce that the current GenBank release is now
available for FTP from the GenBank On-Line Service FTP directory.

Release 61 can be found in the directory ftp/pub/db/gb-rel61.  All
data and index files in compressed format can be found in that
directory.  The data files alone comprise approximately 27 MB
(compressed) and the indexes and text files an additional 4 MB.

Anonymous access to the GenBank FTP directories (which include,
in addition to the most recent quarterly release of GenBank, weekly
update files for both the GenBank and EMBL nucleotide sequence
data banks and contributed software) is available by login to
genbank.bio.net.  Use the username "anonymous" and give your
surname as the password.

While we have not tested the recent Bitnet FTP on large files, we
are urging Bitnet users to exercise restraint until the impact of
transferring large databases on the network can be determined.

					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com

cavrak@uvm-gen.UUCP (Steve Cavrak,Waterman 113,656-1483,) (12/11/89)

From article <1989Dec7.213027.8591@phri.nyu.edu>, by roy@phri.nyu.edu
(Roy Smith, of the Public Health Research Institute):

> 	Would it be possible to get rid of all this magtape and distribute
> GenBank over the Internet?  One possibility is being able to ftp the tar
> files.  Another more interesting possibility is using NNTP, or something
> like it.  Make each locus into a netnews article.  Multiple links would
> ensure connectivity in the case of some machine being down and NNTP's
> IHAVE/SENDME processing would eliminate duplicates.  People could even
> subscribe to just those parts of the database they are interested in by
> just getting, for example, bionet.database.gb.bacterial.staph, or whatever.
> Keep in mind that usenet has progressed from magtape to NNTP for data
> transport, why shouldn't the genetic databases follow suit?
> -- 

Generally this is a good idea and makes a lot of sense --- especially
if the database could be broken up to small pieces.  The thought of
redistributing ALL of the database 4 times a year to ALL of the
subscribers should cause someones teeth to grind, however.  

The suggestion to distribute information per demand makes more sense
interms of lowering the network traffic, but then how would the
individual user know her copy of a database were "up to date"?  The
"news" model would almost be essential to this point.

Taking the sugestion one step further, why "distribute" the database at
all ?  Why not pursue a "server" model where queries against the
database could be directed to one (or several) "database servers".  

The other alternative is to just publish the database on CD-ROM and
distribute it that way.  

Dave Hill, the instructor of a networking course I took a few years
back, pointed out that the bandwidth of a 747 loaded with floppy disks,
was nothing to yawn at. Somewhere along the line, the 747 may be more
economical than a network; and that in our environment these costs
may change daily.  

Just where the crossover point is today might be very interesting to
calculate.   Just how many installations out there receive copies of
the data ?

Steve

  _______
||       | Stephen J. Cavrak, Jr.               BITNET:  sjc@uvmvm
 |*     |                                       CSNET :  cavrak@uvm
 |     /   Academic Computing Services          USENET:  cavrak@uvm-gen
 |    |    University of Vermont
 |   |     Burlington, Vermont 05405
 ----

roy@phri.nyu.edu (Roy Smith) (12/12/89)

[NOTE: this is only marginally related to nntp issues, so I've directed
followups to bionet.molbio.genbank only]

In  <1364@uvm-gen.UUCP> cavrak@uvm-gen.UUCP (Steve Cavrak) writes:
>Taking the sugestion one step further, why "distribute" the database at
>all ?  Why not pursue a "server" model where queries against the
>database could be directed to one (or several) "database servers".

	Several reasons.  First, the stupid one, but possibly the one which
will prove most significant. People want their own copy on their own disk.
Never mind that they don't really need it, they want it.  Similar arguments
recently surfaced in another forum on the "central 50 MIPS machine with an
X-terminal on each desk vs. lots of 2 MIPS Unix workstations" issue.

	That said, the real problem with a query server is that you limit
the types of queries you allow people to do.  I havn't used the currently
available servers, but I gather they allow you to retrieve an entry by locus
name or accession number, or do searches based using one or another of the
fasta family of programs.  That's great, but what if you want to do
something different?

	One of the things we do is translate the whole genbank data base
into a protein data base using some code kindly provided by Jim Fickett
years ago which parses the feature tables.  True, with tfasta you get sort
of the same effect, but running tfasta against genbank is a lot slower than
running fasta against our ficketized database.  There are advantages and
disadvantages to both ways, but the point is with just a query server, we
would not have had the option to do it the way we do.  Sometimes people do
searches by just grepping the genbank files; the keyword indicies don't
always have what you want and sometimes it's nice to just grep the
definition or comment lines.  Maybe it would be possible to make the
databases available via a publicly (read-only!) mountable NFS file system?

	On the other hand, we are able to devote significant amounts of disk
space to the databases (our /usr/database file system is something over 100
Mbytes) and have the CPU power and time to make use of the material.  I
would imagine that for people with a PC and a 40 Meg hard disk in their lab,
a query server might be exactly what they need.  I honestly don't know which
type of installation is more typical.

>The other alternative is to just publish the database on CD-ROM and
>distribute it that way.

	CD-ROM is nice, but doesn't really solve the problems that tape has.
You still have to get a physical object from point A to point B, and you
still have to produce those objects.  How long does it take to press CDs
compared to the time it takes to cut tapes?  Also, from what I know of CDs,
they are much slower than magnetic hard disks.  Also, I'm not sure that
CD-ROM is really practical yet.  Maybe in a couple of years, but it's still
pretty much of a specialty item today.

> the bandwidth of a 747 loaded with floppy disks, was nothing to yawn at.

	I've always heard it expressed in terms of a station wagon full of
mag tapes, but the point is well taken.  In the best case, FedEx can get a
magtape from me to you in about 16 hours.  I usually figure you can fit
about 150 Mbytes on a 2400' reel at 6250bpi with a large blocking factor.
Unless I did the math wrong, that works out to an effective bandwidth of
about 22 kbps.  Of course, both the magtape and the serial link can gain a
factor of 2-4 by using L-Z compression.  But then again, is it unreasonable
to assume that most of the people who want genbank have 1.5Mbps or (at the
very least) 56kbps connections to something connected to NSFNet, or will
have such within a couple of years (i.e. the same time scale I hypothesize
for the ubiquitization of CD-ROMs)?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

benton@presto.IG.COM (David Benton) (12/12/89)

I won't reply to all the points made in two recent postings (from
Stephen Cavrak and Roy Smith) on this topic, since I take no exception
with most of them.  I will try to fill in some gaps as to what GenBank
is doing to distribute the database and what we have planned.

> Generally this is a good idea and makes a lot of sense --- especially
> if the database could be broken up to small pieces.  The thought of
> redistributing ALL of the database 4 times a year to ALL of the
> subscribers should cause someones teeth to grind, however.  

The main reason I see for distributing the data by network is so
we can increase the frequency from 4 times to 52 times a year.  I
don't know how many sites will actually want to FTP the entire
quarterly release (compressed), but it's not much extra work for
us to provide it, so why begrudge those who want it that way.  I
might just point out that neither IntelliGenetics nor the NIH makes
any money from the distribution of GenBank on mag tapes or floppy
diskettes, so we certainly have no objection to anyone getting the
data off the net.

> The suggestion to distribute information per demand makes more sense
> interms of lowering the network traffic, but then how would the
> individual user know her copy of a database were "up to date"?  The
> "news" model would almost be essential to this point.

In the simple system now operating, she knows by virtue of the file
and directory names she FTP'ed.  This, of course, puts the burden
of asking for the data on the recipient.

> Taking the sugestion one step further, why "distribute" the database at
> all ?  Why not pursue a "server" model where queries against the
> database could be directed to one (or several) "database servers".  
> 
> The other alternative is to just publish the database on CD-ROM and
> distribute it that way.  

EMBL has distributed at least one release on CD ROM and we plan to
start quarterly releases of GenBank on CD ROM in the spring of '90.
Our main reason is to furnish the data to the large number of users
who are now ordering GenBank on floppy diskettes.  GenBank Rel 61
(Sep 1989) required 98 360-kb floppies (and the floppy disk format
entries are stripped of comments and reference titles).

> Just where the crossover point is today might be very interesting to
> calculate.   Just how many installations out there receive copies of
> the data ?

GenBank ships about 125 copies of each quarterly release on magnetic
tape (3 file formats, 3 media types) and about 350 copies of each
semi-annual release on floppy diskette (1 file format on XT, AT, and
Mac disks).  Some of the mag tape recipients are secondary
distributors of the data (usually reformatted in some way).  From the
information they have sent us, it appears that something like 450
additional copies of the data are sent to individuals who do not
get the data directly from GenBank.

(Part 2 to follow.)


					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com