[bionet.molbio.genbank] Distributing GenBank over the Internet

roy@phri.nyu.edu (Roy Smith) (12/08/89)

	Would it be possible to get rid of all this magtape and distribute
GenBank over the Internet?  One possibility is being able to ftp the tar
files.  Another more interesting possibility is using NNTP, or something
like it.  Make each locus into a netnews article.  Multiple links would
ensure connectivity in the case of some machine being down and NNTP's
IHAVE/SENDME processing would eliminate duplicates.  People could even
subscribe to just those parts of the database they are interested in by
just getting, for example, bionet.database.gb.bacterial.staph, or whatever.
Keep in mind that usenet has progressed from magtape to NNTP for data
transport, why shouldn't the genetic databases follow suit?
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

benton@presto.IG.COM (David Benton) (12/08/89)

Roy Smith wrote:

> Would it be possible to get rid of all this magtape and distribute
> GenBank over the Internet?  One possibility is being able to ftp the
> tar files.  Another more interesting possibility .......

We are pleased to announce that the current GenBank release is now
available for FTP from the GenBank On-Line Service FTP directory.

Release 61 can be found in the directory ftp/pub/db/gb-rel61.  All
data and index files in compressed format can be found in that
directory.  The data files alone comprise approximately 27 MB
(compressed) and the indexes and text files an additional 4 MB.

Anonymous access to the GenBank FTP directories (which include,
in addition to the most recent quarterly release of GenBank, weekly
update files for both the GenBank and EMBL nucleotide sequence
data banks and contributed software) is available by login to
genbank.bio.net.  Use the username "anonymous" and give your
surname as the password.

While we have not tested the recent Bitnet FTP on large files, we
are urging Bitnet users to exercise restraint until the impact of
transferring large databases on the network can be determined.


					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com

mike@pasteur.cvm.uiuc.edu (Michael Trogni) (12/10/89)

Great idea, Roy, but I don't think most sites are network-wise
enough to install NNTP and getting it running smoothly.  Aren't most
of the biologists on these BIONET lists from BITNET?
Also, does the data have a Expire time of never?

ps: new updates of genbank *are* available now from FTP in tar files.

cavrak@uvm-gen.UUCP (Steve Cavrak,Waterman 113,656-1483,) (12/11/89)

From article <1989Dec7.213027.8591@phri.nyu.edu>, by roy@phri.nyu.edu
(Roy Smith, of the Public Health Research Institute):

> 	Would it be possible to get rid of all this magtape and distribute
> GenBank over the Internet?  One possibility is being able to ftp the tar
> files.  Another more interesting possibility is using NNTP, or something
> like it.  Make each locus into a netnews article.  Multiple links would
> ensure connectivity in the case of some machine being down and NNTP's
> IHAVE/SENDME processing would eliminate duplicates.  People could even
> subscribe to just those parts of the database they are interested in by
> just getting, for example, bionet.database.gb.bacterial.staph, or whatever.
> Keep in mind that usenet has progressed from magtape to NNTP for data
> transport, why shouldn't the genetic databases follow suit?
> -- 

Generally this is a good idea and makes a lot of sense --- especially
if the database could be broken up to small pieces.  The thought of
redistributing ALL of the database 4 times a year to ALL of the
subscribers should cause someones teeth to grind, however.  

The suggestion to distribute information per demand makes more sense
interms of lowering the network traffic, but then how would the
individual user know her copy of a database were "up to date"?  The
"news" model would almost be essential to this point.

Taking the sugestion one step further, why "distribute" the database at
all ?  Why not pursue a "server" model where queries against the
database could be directed to one (or several) "database servers".  

The other alternative is to just publish the database on CD-ROM and
distribute it that way.  

Dave Hill, the instructor of a networking course I took a few years
back, pointed out that the bandwidth of a 747 loaded with floppy disks,
was nothing to yawn at. Somewhere along the line, the 747 may be more
economical than a network; and that in our environment these costs
may change daily.  

Just where the crossover point is today might be very interesting to
calculate.   Just how many installations out there receive copies of
the data ?

Steve

  _______
||       | Stephen J. Cavrak, Jr.               BITNET:  sjc@uvmvm
 |*     |                                       CSNET :  cavrak@uvm
 |     /   Academic Computing Services          USENET:  cavrak@uvm-gen
 |    |    University of Vermont
 |   |     Burlington, Vermont 05405
 ----

roy@phri.nyu.edu (Roy Smith) (12/12/89)

[NOTE: this is only marginally related to nntp issues, so I've directed
followups to bionet.molbio.genbank only]

In  <1364@uvm-gen.UUCP> cavrak@uvm-gen.UUCP (Steve Cavrak) writes:
>Taking the sugestion one step further, why "distribute" the database at
>all ?  Why not pursue a "server" model where queries against the
>database could be directed to one (or several) "database servers".

	Several reasons.  First, the stupid one, but possibly the one which
will prove most significant. People want their own copy on their own disk.
Never mind that they don't really need it, they want it.  Similar arguments
recently surfaced in another forum on the "central 50 MIPS machine with an
X-terminal on each desk vs. lots of 2 MIPS Unix workstations" issue.

	That said, the real problem with a query server is that you limit
the types of queries you allow people to do.  I havn't used the currently
available servers, but I gather they allow you to retrieve an entry by locus
name or accession number, or do searches based using one or another of the
fasta family of programs.  That's great, but what if you want to do
something different?

	One of the things we do is translate the whole genbank data base
into a protein data base using some code kindly provided by Jim Fickett
years ago which parses the feature tables.  True, with tfasta you get sort
of the same effect, but running tfasta against genbank is a lot slower than
running fasta against our ficketized database.  There are advantages and
disadvantages to both ways, but the point is with just a query server, we
would not have had the option to do it the way we do.  Sometimes people do
searches by just grepping the genbank files; the keyword indicies don't
always have what you want and sometimes it's nice to just grep the
definition or comment lines.  Maybe it would be possible to make the
databases available via a publicly (read-only!) mountable NFS file system?

	On the other hand, we are able to devote significant amounts of disk
space to the databases (our /usr/database file system is something over 100
Mbytes) and have the CPU power and time to make use of the material.  I
would imagine that for people with a PC and a 40 Meg hard disk in their lab,
a query server might be exactly what they need.  I honestly don't know which
type of installation is more typical.

>The other alternative is to just publish the database on CD-ROM and
>distribute it that way.

	CD-ROM is nice, but doesn't really solve the problems that tape has.
You still have to get a physical object from point A to point B, and you
still have to produce those objects.  How long does it take to press CDs
compared to the time it takes to cut tapes?  Also, from what I know of CDs,
they are much slower than magnetic hard disks.  Also, I'm not sure that
CD-ROM is really practical yet.  Maybe in a couple of years, but it's still
pretty much of a specialty item today.

> the bandwidth of a 747 loaded with floppy disks, was nothing to yawn at.

	I've always heard it expressed in terms of a station wagon full of
mag tapes, but the point is well taken.  In the best case, FedEx can get a
magtape from me to you in about 16 hours.  I usually figure you can fit
about 150 Mbytes on a 2400' reel at 6250bpi with a large blocking factor.
Unless I did the math wrong, that works out to an effective bandwidth of
about 22 kbps.  Of course, both the magtape and the serial link can gain a
factor of 2-4 by using L-Z compression.  But then again, is it unreasonable
to assume that most of the people who want genbank have 1.5Mbps or (at the
very least) 56kbps connections to something connected to NSFNet, or will
have such within a couple of years (i.e. the same time scale I hypothesize
for the ubiquitization of CD-ROMs)?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

Mats.Sundvall@bio.embnet.se (12/12/89)

In article <1989Dec9.190611.10877@ux1.cso.uiuc.edu>, mike@pasteur.cvm.uiuc.edu (Michael Trogni) writes:
> Great idea, Roy, but I don't think most sites are network-wise
> enough to install NNTP and getting it running smoothly.  Aren't most
> of the biologists on these BIONET lists from BITNET?
> Also, does the data have a Expire time of never?
> 
> ps: new updates of genbank *are* available now from FTP in tar files.


Well, you could use the protocol in an installation that is not dependant
an news. I discussed this with Roy Omond from EMBL at a meating in Norway.
Maybe we try it out one of these days.

Maybe I shall clarify what I mean with not dependant on news.
If you write a program that works as an NNTP server but with another
TCP port, that easy to install, and that pick entries of the network
and put them into a database or wherever you want them, you could establish
a network for sequence distribution.
If you want to establish redistribution of sequences around the network
you also write a feed program.

You have to keep a list of NNTP ids that you have in your database. These ids
can be the accession number plus the revision date of the sequence.
Then you can implement ways of recieving updates of already existing sequences.
You can of course use the some hedaer to classify sequences if you are only
interested in parts of the database.
I also would like a checksum calculation in the protocol.

Not very hard to implement, but useful will it be in the light of Genbanks
"new" transaction protocol. Is this protocol better? Probably not as simple.

	Mats Sundvall
	University of Uppsala
	Sweden

benton@presto.IG.COM (David Benton) (12/12/89)

I won't reply to all the points made in two recent postings (from
Stephen Cavrak and Roy Smith) on this topic, since I take no exception
with most of them.  I will try to fill in some gaps as to what GenBank
is doing to distribute the database and what we have planned.

> Generally this is a good idea and makes a lot of sense --- especially
> if the database could be broken up to small pieces.  The thought of
> redistributing ALL of the database 4 times a year to ALL of the
> subscribers should cause someones teeth to grind, however.  

The main reason I see for distributing the data by network is so
we can increase the frequency from 4 times to 52 times a year.  I
don't know how many sites will actually want to FTP the entire
quarterly release (compressed), but it's not much extra work for
us to provide it, so why begrudge those who want it that way.  I
might just point out that neither IntelliGenetics nor the NIH makes
any money from the distribution of GenBank on mag tapes or floppy
diskettes, so we certainly have no objection to anyone getting the
data off the net.

> The suggestion to distribute information per demand makes more sense
> interms of lowering the network traffic, but then how would the
> individual user know her copy of a database were "up to date"?  The
> "news" model would almost be essential to this point.

In the simple system now operating, she knows by virtue of the file
and directory names she FTP'ed.  This, of course, puts the burden
of asking for the data on the recipient.

> Taking the sugestion one step further, why "distribute" the database at
> all ?  Why not pursue a "server" model where queries against the
> database could be directed to one (or several) "database servers".  
> 
> The other alternative is to just publish the database on CD-ROM and
> distribute it that way.  

EMBL has distributed at least one release on CD ROM and we plan to
start quarterly releases of GenBank on CD ROM in the spring of '90.
Our main reason is to furnish the data to the large number of users
who are now ordering GenBank on floppy diskettes.  GenBank Rel 61
(Sep 1989) required 98 360-kb floppies (and the floppy disk format
entries are stripped of comments and reference titles).

> Just where the crossover point is today might be very interesting to
> calculate.   Just how many installations out there receive copies of
> the data ?

GenBank ships about 125 copies of each quarterly release on magnetic
tape (3 file formats, 3 media types) and about 350 copies of each
semi-annual release on floppy diskette (1 file format on XT, AT, and
Mac disks).  Some of the mag tape recipients are secondary
distributors of the data (usually reformatted in some way).  From the
information they have sent us, it appears that something like 450
additional copies of the data are sent to individuals who do not
get the data directly from GenBank.

(Part 2 to follow.)


					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com

benton@presto.IG.COM (David Benton) (12/12/89)

>	CD-ROM is nice, but doesn't really solve the problems that tape has.
>You still have to get a physical object from point A to point B, and you
>still have to produce those objects.  How long does it take to press CD's
>compared to the time it takes to cut tapes?  Also, from what I know of CD's,
>they are much slower than magnetic hard disks.  Also, I'm not sure that
>CD-ROM is really practical yet.  Maybe in a couple of years, but it's still
>pretty much of a specialty item today.

No argument about getting the physical object from point A to point B,
but Release 62 (in the works) is going to require 5 reels @ 1600bpi
(yes, there are sites that want GenBank and cannot receive 6250) and
Rel 63 (March, the next release on floppies) will probably require
something like 125 360-kb floppies.  Since we are distributing more
that 100 copies in the 360-kb density, I don't think we can
morally stop the pain of mastering, packaging, and shipping those
floppy releases until we provide a viable alternative.  (And even
if our morals would let us, the GenBank Project Officer would remind
us of our contractual obligation to release on floppies.  We're hoping
the floppy release (at least on the LD disks) will die a natural
death after the CD ROMs are available.)

I'll be able to answer your questions about actual production of the
CD's after we've had the experience, but the CD ROM pressing
operations are quoting prices for 1, 3, and 5 day turn-around times.
The big difference between CD's and mag tapes is that the time to
create one release is really independent of both the ammount of data
in the release and the number of copies being produced.  Right now,
producing one copy of the mag tape release at 1600 bpi takes over an
hour (even assuming the operator is waiting to remove the reel as soon
as it rewinds).  It is more economical to produce CD ROMs than to spin
tapes now and the cost differential will only increase as the size of
the database increases.  Since the GenBank contract requires that the
incremental costs of distributing the data be recovered from the
users, the price of a GenBank release must go up every time the size
of the database requires another reel of tape for the release or more
floppies (every release for floppies).  Even though there is a
significant set-up charge for pressing CD's, we are hoping that there
will be enough demand to keep the per-unit costs of the CD below or
very near the cost of the most economical release on mag tape.  So we
are hoping for a greater ubiquity of CD ROM's in a few months than you
are anticipating.

As a distribution medium, CD ROM's transfer rates are certainly in the
same ballpark as those of 1600-bpi mag tapes.  In timing tests I've
done on raw read rates, the CD ROM was about half the speed of a hard
disk (both running on a 386 machine) if one used block reads with a
large buffer (30-60 kbytes).  CD ROM seek times are notoriously slow,
but that simply means you must design the database file formats to
reduce the number of seeks (keeping files contiguous on CD ROM is no
problem).  We hope the database format we've designed for the CD ROM
will result in acceptable performance for those who want to use the
CD as a database medium, but I won't go into the details of that
format here.

I hope my digression into the arcana of the GenBank contract's
requirements has not stifled all discussion on this newsgroup, but
has helped explain out motivation for the CD ROM release.  I also
hope that through the discussion, it's become clear that whether
by FTP or mail or mag tape or floppies or CD ROM or through the
interactive GenBank On-Line Service our goal is to make the data
available to the greatest number of researchers in the most useful
forms we can.  In that spirit, I'd just add that we are always
interested to learn how we can better serve the database's users
and are happy to propose (to the NIH) any reasonable request for
a modification in the service we provide.  And I mean that,

					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com

Mats.Sundvall@bio.embnet.se (12/13/89)

In article <Dec.11.22.42.24.1989.24946@presto.IG.COM>, benton@presto.IG.COM (David Benton) writes:
> 
> 
>> The suggestion to distribute information per demand makes more sense
>> interms of lowering the network traffic, but then how would the
>> individual user know her copy of a database were "up to date"?  The
>> "news" model would almost be essential to this point.
> 
Distributing per demand does not nessecarily lower the network traffic.
A distribution scheeme with redistribution like NNTP will spread the database
around the world without multiple distributions through the same link as is the
case with FTP.
  
	Mats Sundvall,
	University of Uppsala
	Sweden

kristoff@genbank.BIO.NET (David Kristofferson) (12/14/89)

> Distributing per demand does not nessecarily lower the network traffic.
> A distribution scheeme with redistribution like NNTP will spread the database
> around the world without multiple distributions through the same link as is the
> case with FTP.
>   
> 	Mats Sundvall,
> 	University of Uppsala
> 	Sweden

Yes, Mats, but as you and I both know from our joint efforts the
biological community is unfortunately often on the trailing edge of
the information technology revolution.  Here it is going on the 21st
century and one still hears talk of moving into the 20th century 8-)!
Too many people still have access only to BITNET capabilities and
would never receive the information through the scheme that you
proposed without auxiliary mailing lists.  As it is, I have had
queries about using the BITNET FTP servers to try and retrieve the
massive GenBank files and the thought of people attempting this on a
large scale resulting in massive network congestion brings tears to my
eyes 8-(.  When is the academic community going to wake up to the need
for better computing and networking facilities for biology?  In my
former incarnation as BIONET manager I have spoke to heads of
departments who expressed the attitude, "There is always some guy in
every lab who would rather ____-around with a computer than work."
(Guess I always was rather lazy 8-)!!  I think that this attitude is
more widespread than many would like to admit.  Perhaps, to distort
Max Planck's famous dictum about scientific progress, there will be
hope for the next generation???!!!

In my humble opinion every biology department in any self-respecting
university should have a department committee on computing and
networking.  They should be actively investigating setting up local
networks, connecting these to larger scale, high speed networks like
the Internet, getting access to electronic communications capabilities
particularly newsreading software (which is in the public domain),
etc.  I've been on this soapbox for going on four years now, and I
must admit that things are starting to change.  However, the goal is
still a long way off and was set back somewhat with the hit taken by
BIONET.  Nonetheless both you and I are still smiling and forging
onward 8-)!  Just think of the stories we'll be able to tell our
grandkids...  

"What grandpa???  You mean that people used to work at 1200 baud???????????"  

"Yes, Virginia, but let me tell you about a group that I knew who only
used 300 baud!!!!!"

"Clunk"  (sound of fainting little body hitting the floor)

Merry Christmas and Happy New Year Everyone!!!
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

roy@phri.nyu.edu (Roy Smith) (12/14/89)

kristoff@genbank.BIO.NET (David Kristofferson) writes:
> the biological community is unfortunately often on the trailing edge of the
> information technology revolution. [...] every biology department in any
> self-respecting university should have a department committee on computing
> and networking.  They should be actively investigating setting up local
> networks, connecting these to larger scale, high speed networks like the
> Internet, getting access to electronic communications capabilities
> particularly newsreading software (which is in the public domain), etc

	Dave, I think you hit the nail directly on the head.  Less than two
years ago, after much planning, scheeming, begging favors, and scrounging
spare equipment, I proposed setting the PHRI up with a 9600 buad SLIP
connection to the Internet which would have involved a total outlay of $300
to install a LADC circuit and $75 a month to maintain it.  I was floored when
the response came back that I was proposing an extremely expensive way to
send email (the value of which had not been demonstrated) and my request was
turned down!  I did keep at it though, and somehow (I'm still not sure how) I
managed to pry loose about $10k in capital costs to set up a 56kbps link
(with the same $75/month LADC rental charge).

	Little by little, people are making use of the infrastructure I
fought so hard to put in place.  I was somewhat surprised (pleasantly) when a
new person came to the Institute a few months ago and asked me how to send
email.  He didn't ask me *if* we had it, he took it for granted that we did
and just wanted to know the local juju.

	For the past week or two, we've been bantering around ways to deal
with distributing genbank.  I think it's all pretty much agreed that some
sort of network file transfer over high-speed links would be the ideal way.
Unfortunately, that possibility has been dismissed as the standard
distribution channel because not enough genbank subscribers have the needed
links.  Perhaps the solution is to get those links in place, and the local
expertise in place to use those links effectively and keep them running.

	I am also constantly surprised when visiting scientists remark,
either directly to me, or to other scientists who pass it on, that they are
really impressed with our computer setup.  Every one of our scientists has at
worst an ASCII terminal in his or her office.  Some have diskless Sun
workstations.  Some have Macs.  A few poor misguided souls have PCs :-) They
are all somehow networked with everything else.  We have a random assortment
of sequence analysis programs and the Internet connection described above.
So why am I surprised when people are so impressed?  Because I consider what
we have to be just barely passable.  If most of the visitors here are
impressed, that must mean that what most people have available to them in the
way of computer resources is a disaster.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

smith@mcclb0.med.nyu.edu (Ross Smith: (212) 340-5356) (12/14/89)

In article <1989Dec7.213027.8591@phri.nyu.edu>, roy@phri.nyu.edu (Roy Smith) writes:

> 	Would it be possible to get rid of all this magtape and distribute
> GenBank over the Internet?  One possibility is being able to ftp the tar
> files.  Another more interesting possibility is using NNTP, or something
> like it.  

While this idea is not bad, I do not think that it is all that practical.  We
are at the end of a slow link and FTPing all or most of a GENBANK release,
appart from the system manager time taken up, would be horribly slow.
Distribution via NNTP may not be all that reliable and involve major manager
time committment to organize. 

We do not have this kind of time.  The 'system' we have here for dealing with
Genbank is essentially automatic. Once it has run successfully we have a
working COMPLETE bank.  I think that the increasing size of the bank is a
strong argument for KEEPING the tape distributions not abandoning them since
these tapes serve as complete 'backup' of the bank each quarter. 

Having GenBank available for FTP is very helpfull.  Having the updates
available inbetween releases is becomming more and more important.  But I
think it is a big mistake to think that the quarterly distribution of Genbank
via tape (1600/6250) can or should be abandoned any time soon. 

kristoff@genbank.BIO.NET (David Kristofferson) (12/15/89)

>	 Little by little, people are making use of the infrastructure I
> fought so hard to put in place.  I was somewhat surprised (pleasantly) when a
> new person came to the Institute a few months ago and asked me how to send
> email.  He didn't ask me *if* we had it, he took it for granted that we did
> and just wanted to know the local juju.

Hurrah!  Keep the faith, Roy.  Remember the old Chinese adage, "A
journey of a thousand miles begins with a single step!"
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

roy@phri.nyu.edu (Roy Smith) (12/15/89)

In <573@mcclb0.med.nyu.edu> smith@mcclb0.med.nyu.edu (Ross Smith) writes:
> We are at the end of a slow link and FTPing all or most of a GENBANK
> release, appart from the system manager time taken up, would be horribly
> slow.

	Why should it take any system manager time?  To the contrary, I
envision one of the prime advantages of network access vs. physical media
being the savings in human time and effort.  I don't know about the braindead
VMS system you run [note to outsiders: Ross works about 1/2 kilometer from me
and we often abuse each other about our respective tastes in operating
systems] but under Unix it would be simple to set up a totally automated
system whereby each night the genbank ftp server was polled for new files,
those files downloaded, and our local customizations done, perhaps ending in
mailing a note to all interested parties saying that a new update was
installed.  With tape, no matter how much I automate the installation
process, somebody still has to open the box and mount the reel of tape on the
drive.  Not to mention the labor saved at the other end in making and mailing
all those tapes.

	By way of analogy, many people use news (and not even NNTP, just
plain old b-news over dialup 2400 bps uucp) to distribute the uucp map files
and process the raw data into a uucp path database totally without human
intervention.  I think the current size of the uucp map database is about 4
Mbytes.  An order of magnitude smaller than genbank, but still a substantial
amount of data.

	I'm also not sure why it should be slow.  I regularly get about 4
kbytes/sec to anywhere on NSFNet.  At that rate, even the absurdly large
10/30 update should only take about 15 minutes in its compressed form.
Remember, the idea is to just download the changes, not the whole database.
Of course, reality is determined to show me up; I just got gb1030.seq.Z (3.7
Mbytes) from genbank.bio.net using ftp and only got 2.2 kbytes/sec.  But then
again, it's 19:00 here, so it's 16:00 in California, still prime time.  My
guess is that if I tried it again in 6 or 8 hours I'd get double the
throughput.  It also looks like the NYSERNet gateway is down and we're going
via JVNC; I don't know how much, if any, that hurts performance.  Besides,
when our 56 kbps link is replaced by fiber...

> I think that the increasing size of the bank is a strong argument for
> KEEPING the tape distributions not abandoning them since these tapes
> serve as complete 'backup' of the bank each quarter.

	Why do you need (or want) a backup?  I recycle the genbank tapes for
other uses after a while.  Should anything ever happen and you lose your
on-line database, you can just ftp another copy from the source.  Certainly,
the data is so precious that the Keepers Of The Sacred Knowledge should be
making backups of the master copy of the database, and storing duplicate
tapes in fireproof vaults (they do, don't they?), but subscribers like us can
always just go back to the source when that rare disaster strikes.

> But I think it is a big mistake to think that the quarterly distribution
> of Genbank via tape (1600/6250) can or should be abandoned any time soon. 

	I probably agree with you that it can't be eliminated, but not that it
shouldn't.  Certainly, anybody who has decent Internet connectivity should be
getting it over the net.  And those who don't, should be scheeming to find a
way to get that connectivity.  Down with physical media!
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

kristoff@genbank.BIO.NET (David Kristofferson) (12/15/89)

> Certainly, the data is so precious that the Keepers Of The Sacred
> Knowledge should be making backups of the master copy of the database,
> and storing duplicate tapes in fireproof vaults (they do, don't
> they?), but subscribers like us can always just go back to the source
> when that rare disaster strikes.

All systems at IntelliGenetics including the GenBank database and
On-line Service computers are backed up nightly (partials of changed
or new files) and full system backups are performed weekly.  Complete
system backups are stored offsite in vaults as Roy indicates above.
Copies of the data are also maintained and backed up at Los Alamos.
In addition there is the EMBL databank and the DNA databank of Japan.
It would be hard to get rid of this data even if we tried!
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

sob@watson.bcm.tmc.edu (Stan Barber) (12/15/89)

In article <1989Dec11.160609.5436@phri.nyu.edu> roy@alanine.UUCP (Roy Smith) writes:
>	CD-ROM is nice, but doesn't really solve the problems that tape has.
>You still have to get a physical object from point A to point B, and you
>still have to produce those objects.  How long does it take to press CDs
>compared to the time it takes to cut tapes?  Also, from what I know of CDs,
>they are much slower than magnetic hard disks.  Also, I'm not sure that
>CD-ROM is really practical yet.  Maybe in a couple of years, but it's still
>pretty much of a specialty item today.
>
I don't agree with you. CD-ROM is VERY VERY VERY available for both Macintosh
and PC environments NOW. People use them NOW. In the Unix workstations world,
they are a coming thing (Sun CD, DEC has had CDs for awhile). 

Also, makeing CDs is not that big of a deal if you are generating 100 of them
or more. I don't know the stats on how many sites get a GENBANK tape, but I'd
guess it is more than 100. I'd also guess that generating 100 tapes would take
about the same amount of time as generating 100 CDs.

--
Stan           internet: sob@bcm.tmc.edu         Director, Networking 
Olan           uucp: {rutgers,mailrus}!bcm!sob   and Systems Support
Barber         Opinions expressed are only mine. Baylor College of Medicine

benton@presto.IG.COM (David Benton) (12/16/89)

> Certainly, the data is so precious that the Keepers Of The Sacred
> Knowledge should be making backups of the master copy of the database,
> and storing duplicate tapes in fireproof vaults (they do, don't
> they?), but subscribers like us can always just go back to the source
> when that rare disaster strikes.

Of course, we do.  Although I'm sure we could regenerate any release
if necessary, we have not given that capability a high priority.  Does
anyone out there think that they are likely to want to request a
release older than the most current or the n-1 release?  


					Sincerely,

					David Benton
					GenBank Manager
					415-962-7360
					benton@genbank.ig.com

roy@phri.nyu.edu (Roy Smith) (12/16/89)

	Regarding wanting to regenerate old releases.  No, I can't imagine
anybody wanting to do that.  As long as the current release is safe, I
think that's all that's really critical.  Actually, I take that back.  On
occassion, loci names change, or things get corrected.  I could imagine
cases where you once found something of interest but can't reproduce the
search now.  In that case, you might want to load up an old version of the
database to make sure you aren't crazy, but I think that's stretching
things a little bit.  Your procedure of keeping two master copies of the
database at two different sites, with each maintaining its own set of
redundant backups tapes sounds appropriately paranoid.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

cherry@mgh-coffee.uucp (Mike Cherry) (12/17/89)

In article <1989Dec13.201432.20058@phri.nyu.edu> roy@alanine.UUCP (Roy Smith) writes:
>kristoff@genbank.BIO.NET (David Kristofferson) writes:
>I was somewhat surprised (pleasantly) when a
>new person came to the Institute a few months ago and asked me how to send
>email.  He didn't ask me *if* we had it, he took it for granted that we did
>and just wanted to know the local juju.
>
...
>So why am I surprised when people are so impressed?  Because I consider what
>we have to be just barely passable.  If most of the visitors here are
>impressed, that must mean that what most people have available to them in the
>way of computer resources is a disaster.
>--
>Roy Smith, Public Health Research Institute
>455 First Avenue, New York, NY 10016
>roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
>"My karma ran over my dogma"

I run a similarly equiped network of systems at Massachusetts General
Hospital.  We became connected to the Internet about a year and a half
ago.  I also receive compliments from scientist that are enjoying our
computer environment.  These scientists generally have at least a little
experience with sequence analysis or modelling.  More interestingly the
younger students or post-docs that haven't actually used the computer
for much expect all these applications and networking to be there.  I
recently automated the transmission of the weekly genbank updates and
subsequent reformating into our databases.  Many of my users didn't see
this as a great acheivement.  This preception appears to be the result
that these users do not realize how the databases had been transferred
to our system in the past or even where the databases where physically
located.  Many thought we were always "connected to genbank" and that
everytime they ran FASTA they were searching "the" database in Los
Alamos.

Anyway, it seems that computers in molecular biology is starting to make
many advances.  I can't wait to see what will be possible in a couple
years when all the fruits of the work going on a GenBank, NCBI and NCSA
(to name a few) become available.

J. Michael Cherry     		   Director of Computer Systems
Department of Molecular Biology, Massachusetts General Hospital
Boston, MA 02114  (office) 617-726-5955  (TeleFAX) 617-726-6893
cherry@mgh-coffee.harvard.EDU            cherry@harvunxu.BITNET

klong@pauling.bcm.tmc.edu (Kevin Long) (12/19/89)

In article <1855@gazette.bcm.tmc.edu> sob@watson.bcm.tmc.edu (Stan 
 Barber) writes (whose comments are preceded by ">"):
>In article <1989Dec11.160609.5436@phri.nyu.edu> roy@alanine.UUCP (Roy 
 Smith) writes (whose comments are preceded by ">>"):

>>CD-ROM is nice, but doesn't really solve the problems that tape has.
>>You still have to get a physical object from point A to point B, and you
>>still have to produce those objects.  

>>How long does it take to press CDs
>>compared to the time it takes to cut tapes?  

There are service bureaus in the US that will press CD-ROMs overnight.
They typically charge $325 for an overnight check disk.  To press quantity 
100 disks (the smallest quantity any existing pressing plant will accomodate,
plan on $200 for premastering, $1000 for mastering, then $2-$2.50 per disk.
In other words, for 100 disks, total per-disk cost is $32 to $37.  Five
day turnaround for the pressing plant.

A real alternative is to buy the premastering system, which if equipped
with a Yamaha PDS drive, would let the agency cut CD-ROMs themselves,
one at a time. These systems start at a little over $10,000, but the
blank disks are still rather expensive, which keeps the per disk cost
higher than the service bureau method.

>>Also, from what I know of CDs,
>>they are much slower than magnetic hard disks.  

So are magnetic tapes. I'm confused by your argument.  Are you proposing 
we ship updates on hard disks?  You can copy off the contents of your slow 
CD-ROM to a fast hard disk as easily as you can copy a slow tape over to a 
fast hard disk. Besides, a typical CD-ROM drive has a transfer rate of 


>>Also, I'm not sure that CD-ROM is really practical yet.  Maybe in a 
>>couple of years, but it's still pretty much of a specialty item today.
>I don't agree with you. 

Neither do I.  CD-ROM drives are not only available, but they are generally
much less expensive than and requiring of less regular maintenance than
tape drives.  
     - No head cleaning, 
     - No threading tapes, 
     - No shipping heavy tapes back and forth, 
     - No protecting from magnetic fields, 
     - No special tape storage racks required  

There is every reason to move to CD-ROM:
     - It's newer technology, 
     - It's very reliable (error correction is built in to data stored
       in CD-ROM format)
     - CD-ROMs are as easy to use as floppy disks, 
     - A CD is extremely compact,
     - Drives are readily available at less than $2,000,
     - A single CD-ROM can hold up to 600MB of data.
     - Despite the price tag, I'd guess it's still cheaper than
       distributing the data on tapes.  Am I right?

I'd be happy to put together a proposal for the agency if they'd
like.

Regards,

    Kevin Long    
    IAIMS Development
    Baylor College of Medicine
    Office of the Vice President for Information Technology
    One Baylor Plaza
    Houston, TX 77030
    (713) 798-6116
    klong@bcm.tmc.edu