[bionet.molbio.genbank] ANONYMOUS FTP FROM BITNET

BCHTANTW%NUSVM@PUCC.PRINCETON.EDU (Tan Tin Wee) (09/29/90)

Recently somebody asked for help on getting random sequences from
GENBANK.  One of the options suggested was anonymous FTP with the
caveat that the netter must be on INTERNET.
Even if he is not, it is still possible to do anonymous FTP via
BITNET.
I frequently use the BITFTP Princeton BITNET FTP server
which provides a mail interface to the FTP portion of the
IBM TCP/IP running on the Princeton VM system.  It allows
BITNET (or NetNorth or EARN) users to ftp files from sites
on the INTERNET.  The load on BITFTP is "often very heavy".
For further information, send a mail message "HELP" to BITFTP@PUCC.BITNET
or ask M Varian at MAINT@PUCC.BITNET or MAINT@pucc.princeton.edu (INTERNET).
Hope it solves your problem if you are not on INTERNET and wish to
do anonymous FTP.

Sincerely,
Tin Wee TAN
Dept of Biochemistry
National University of Singapore

BCHTANTW@NUSVM.BITNET

PS. Many thanks to folks who run the show at BITFTP. Wouldn't know
what I'd do without this service.

kristoff@genbank.bio.net (David Kristofferson) (09/30/90)

A WORD OF EXTREME CAUTION HERE on trying anonymous FTP using these
BITNET servers for GenBank files!!!!!

GenBank data bank files are LARGE, i.e. many MEGABYTES.  I shudder to
think what would happen to BITNET if people all around the world on
BITNET started using this mechanism regularly to access GenBank.  You
would be much better off having us mail you a tape than trying to get
GenBank over BITNET as it would hog vital resources.  I will add the
caveat that this is based on second hand information about other
BITNET disasters of which I have heard, so perhaps someone with more
direct experience using these things should respond.  My understanding
is that this service may be great for getting small programs from
places like SIMTEL-20, but would be a disaster trying to retrieve 10+
megabyte files.

Anyone else like to comment?
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

gilbertd@silver.ucs.indiana.edu (Don Gilbert) (09/30/90)

I would guess trying to pump megabyte-sized files thru bitnet is asking
for trouble.  It is a strain on Internet traffic using FTP to transfer
the 50+ megabyte releases of Genbank, and the ability of Internet to
transfer data is orders of magnitude greater than Bitnet.

GenBank already provides two very handy services thru e-mail that should
suffice for most individual users needs to have access to the most
recent GenBank data:  the FastA search and the retrieval of individual
sequences.  

I also suggest that the most economical and useable way to receive quarterly
updates of the full Genbank is to subscribe to the CD-Rom release.  CD-Rom
drives are inexpensive ($500-1000), can be put on about any microcomputer or
workstation or vax.  GenBank now sells their CDs for an annual subscription of
$300 (for 4 releases).  This cost should drop as more users subscribe, as the
actual media cost of compact disks is quite low if enough copies are pressed.
For instance, Apple Computer now prefers to release software on a CD rather
than two or more floppies, as the cost to them is less (e.g., $1 or so).

While the cost of using even FTP thru Internet may not be seen by you
directly as an end user, we are paying for it thru federal taxes.

Don.Gilbert@Iubio.Bio.Indiana.Edu
biocomputing office, indiana univ., bloomington, in 47405, usa

OLIVER@calstate.bitnet (OLIVER SEELY) (09/30/90)

There is another problem associated with using FTP (help me out, Dave,
I'm not sure what anonymous FTP is), to retrieve MEGABYTE files.  I
tried to transfer the primate sequence file first to an ELXSI computer
and when I was unsuccessful, I tried to transfer it to my account on
our central Cyber 960.  Well, in both cases I exceeded my memory
allocation but in one (sorry, I cannot remember which) I was unable
to login again until one of the systems analysts "unfroze" the
problem.  FTPs advantage (in my mind at least) lies in its simplicity
of use, but as Dave writes there are some serious problems such as
tying up valuable resources.  Oh, yeah, I forgot to write that
the transfer process had gone on for about 20 minutes before it bombed.

Oliver Seely
CSU Dominguez Hills

usenet@nlm.nih.gov (usenet news poster) (09/30/90)

In article <9009290610.AA23614@genbank.bio.net> BCHTANTW%NUSVM@PUCC.PRINCETON.EDU (Tan Tin Wee) writes:
>
>Recently somebody asked for help on getting random sequences from
>GENBANK.  One of the options suggested was anonymous FTP with the
>caveat that the netter must be on INTERNET.
>Even if he is not, it is still possible to do anonymous FTP via
>BITNET. ...
>Tin Wee TAN

Even if it is possible to hack a path from E-mail into FTP, in
the long run your best solution is to get on INTERNET.  There 
is already a dramatic difference in transfer speed between
INTERNET and BITNET, and this is only going to get worse.
BITNET cannot handle files larger than 25kbytes without segmenting
them into chunks, but 25k is pretty small (even a 2 page paper
with some graphics will exceed this).  The network protocol used
on INTERNET, TCP/IP is more reliable than modems, and once your
site is on line, much less hassle.  Perhaps most important,
FTP is only one of many services available via INTERNET.  Until
you have a network route which supports TCP/IP, you will not be
able to use remote shell, remote procedure call, socket communication,
and other services available through INTERNET.

The importance of these other services is that they give you much
greater flexibiity to optimize your communications and computation.  
For example, rather than attempt to maintain the current, most
up to date version of a database on your local machine, you could
use a relatively simple local program to send your search queries
via the net to an up to date server, perhaps in another city.
You save the expense and aggravation of attempting to maintain
an up to date local database copy, and the net saves the traffic
of sending the whole database.  With TCP/IP, the turn around is
still interactive rather than hours or days as in FASTA-mail.

The arguement that "Internet is not free, we pay for it one way
or another through taxes" is true on some level, but, in my view,
misleading.  The communication of scientific data between academic
research groups is precisely what INTERNET was created to do.
It is one of the support mechanisms the the US government provides
to encourage research and development.  Saying "Don't use INTERNET
because we will all pay for it in our taxes" is like saying "Don't
write grant applications because ..."

David States				states@ncbi.nlm.nih.gov
National Center for Biotechnology Information
National Library of Medicine

BCHTANTW%NUSVM@PUCC.PRINCETON.EDU ("T.W.Tan") (09/30/90)

I would agree with Don Gilbert and Dr Kristofferson that pumping
megasize files through the network is not on.  I only use
BITFTP to get much smaller files eg public domain programs or
single sequence files and certainly would *NOT* recommend anyone
to try getting the whole lot by ftp through BITNET or otherwise.
Apologies if I have conveyed the wrong impression.

Tin Wee TAN
Dept of Biochemistry
Natl. U. of Singapore

kristoff@genbank.bio.net (David Kristofferson) (10/03/90)

David States gave an excellent description of the advantages of the
Internet over BITNET, and I would heartily second the fact that sites
should get on the Internet.  Unfortunately it often takes some time,
money, and effort for this to occur.  I suggest that if someone at
your campus is not already working on getting an Internet style
network connection, then they should begin ***immediately*** before
the data problem reaches overwhelming proportions.

However there was one statement made in Mr. States' message which was
less than accurate.

> With TCP/IP, the turn around is still interactive rather than hours or
> days as in FASTA-mail.

As many of our readers know, FASTA-MAIL is a GenBank service.  As our
readers who have USED THE SERVICE also know, the turnaround on
FASTA-MAIL, while not "interactive," is very fast, on the order of
minutes, not "hours or days."  I recently did a demonstration of the
service in Mr. States' back yard at the NIH and got the results of my
search back in about ten minutes.  During this time my terminal was
freed for other aspects of the demo, instead of sitting
"interactively" looking at a "Working ..." message.  Because many
biologists still do not have Internet connections, FASTA-MAIL provides
a needed service to them.  We are also working on providing access by
e-mail to the newer BLAST program which was developed at NCBI and
appears to be a faster search algorithm.

Another point that needs clarification:

> You save the expense and aggravation of attempting to maintain
> an up to date local database copy, and the net saves the traffic
> of sending the whole database.

GenBank's goal *is* to allow remote sites to have their own local
copies of the database in a relational database management system and
to have the local copies updated over the network, not by sending
megabyte size files, but instead by providing sites with an initial
copy of the database and then by sending "transactions" which
automatically update individual entities in the local copies every
time the master copy is changed.  The software to provide these
transactions to remote sites is currently undergoing testing and more
will be announced about this later.

While it is true that for things like FASTA searches it is a waste to
maintain a local copy, I have heard enough comments from the community
over the last several years that indicate that the desired set-up is
to have a local copy for more specialized applications, but also to
have access to a powerful remote facility for offloading routine, but
CPU-intensive searches.  Although I have personally managed
centralized time-sharing services such as BIONET, it appears to be the
case that these systems are not the wave of the future except for
specialized applications.  Right now remote database searching can be
done for free on the GenBank On-line Service via FASTA-MAIL or
interactively over the Internet or SprintNet by GOS account holders.

So much for specifics, but now for a more general and much more
important statement about Mr. States' remark about FASTA-MAIL.

As I mentioned in a recent posting on BIONEWS, there are many
discussions going on right now in "high places" related to the future
of bio-computing, particularly as it impacts the Genome Project.  The
National Center for Biotechnology Information where David works is a
key player in these debates and will be the agency that oversees the
next GenBank contract which will start in 1992.  One would hope that,
given NCBI's important role, public statements by its employees should
be very carefully considered and based on fact, not on distortions.
If there is a better way of doing things, then it should be perfectly
possible to demonstrate it by setting up and successfully running a
service.  NCBI has already provided us with some fine software such as
IRX and BLAST, so I do not doubt their talents in software
development.  However, I sincerely hope that we will evolve into the
future in this fashion **** rather than by attempting to put down
existing systems through the spread of misinformation ****.

GenBank has unfortunately been an easy target to shoot at because the
first five year contract underestimated the size of the task, and the
resulting lack of funds led to a tremendous data backlog.  This
backlog has been largely eliminated during the second five year
contract and the NIH GenBank advisors commended both LANL and
IntelliGenetics for their progress at the last advisory meeting.  Word
of this progress is slow to get out unfortunately and complaints are
always remembered much longer than compliments.  One can also still
find responsible people quoting outdated GenBank backlog statistics in
print.

You have my solemn word that if flaws are pointed out we will OPENLY
either attempt to correct them to the best of our ability or step
aside if the system is so structurally flawed that an entirely new
attempt is needed.  However you may also rest assured that I will
vigorously respond to any attempt at distortion of the facts.  It is
always easy to tear down through distortion, but this is not the kind
of tactic that one would expect from those who are really professional
and who really have better ways of doing things.  Their results should
be able to speak for themselves.  

I also suggest that the community pay close attention to any services
offerred and provide their feedback ** before ** decisions are made.

*** In the end, it will be the users who will be left with the results. ***

Given the amount of data projected to be generated by the Genome
Project a mistake made now would make the backlog of the initial
GenBank attempt appear miniscule by comparison.  Unfortunately the
users are often the last to react because they are not brought in to
the decision loop.  I have argued before, and will do so again, that
electronic newsgroups can be a new element in this review process.
Although the decision must ultimately be the responsibility of a
single person or small group, the technology nows exists to easily
sample a wide range of opinion.  Why not take advantage of this,
particularly when so much is at stake?  Why not utilize the collective
experience residing on the net?  Currently we have "developers
meetings" where people are asked to digest a large amount of new
information in the course of a day.  Why not do this over the net so
that people can react more intelligently than in a one day jet-lagged
haze?

After all, scientists are supposed to be progressive, right?  ... right?
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

JS05STAF@MIAMIU.BITNET (Joe Simpson) (10/03/90)

BITNET usage guidelines suggest that no file should be larger that 3000
lines of 80 column records.  Many genbank data sets are much larger
than this.  Please use BITFTP only for files that approximate the BITNET
usage guidelines.

usenet@nlm.nih.gov (usenet news poster) (10/04/90)

In response to my comment:

>> With TCP/IP, the turn around is still interactive rather than hours or
>> days as in FASTA-mail.

kristoff@genbank.bio.net (David Kristofferson) writes:

>[...] However there was one statement made in Mr. States' message 
>which was less than accurate.
>
>As many of our readers know, FASTA-MAIL is a GenBank service.  As our
>readers who have USED THE SERVICE also know, the turnaround on
>FASTA-MAIL, while not "interactive," is very fast, on the order of
>minutes, not "hours or days."  I recently did a demonstration of the
>service in Mr. States' back yard at the NIH and got the results of my
>search back in about ten minutes.  

The turn around time for FASTA-MAIL is dependent on both the response
time of the server itself and the network handling the electronic mail
transactions.  My comments were made in the context of a discussion of
the latter.  As Dave Kristofferson points out, at a site like NIH,
where E-mail is handled by INTERNET, the turn around time can be pretty
good.  There are, however, sites where E-mail processing is handled as
a low priority batch process.  If you are dependent on one of those
sites, you may see a dramatic improvement in E-mail mediated services
by getting onto INTERNET.

David Kristofferson then goes on to say:

>So much for specifics, but now for a more general and much more
>important statement about Mr. [sic] States' remark about FASTA-MAIL.
>
>As I mentioned in a recent posting on BIONEWS, there are many
>discussions going on right now in "high places" related to the future
>of bio-computing, particularly as it impacts the Genome Project.  The
>National Center for Biotechnology Information where David works is a
>key player in these debates and will be the agency that oversees the
>next GenBank contract which will start in 1992.  One would hope that,
>given NCBI's important role, public statements by its employees should
>be very carefully considered and based on fact, not on distortions.
>If there is a better way of doing things, then it should be perfectly
>possible to demonstrate it by setting up and successfully running a
>service.  NCBI has already provided us with some fine software such as
>IRX and BLAST, so I do not doubt their talents in software
>development.  However, I sincerely hope that we will evolve into the
>future in this fashion **** rather than by attempting to put down
>existing systems through the spread of misinformation ****.

We appreciate the compliment on our software, but I really don't feel
that a comment on the relative technological merits of electronic mail
handling should be construed as "putting down an existing organization
existing systems through the spread of misinformation".  Readers of
this news group are quite able to assess response time at their own
sites, and the global handling of E-mail is beyond the control of GenBank
in any case.

>GenBank has unfortunately been an easy target to shoot at because the
>first five year contract underestimated the size of the task [...]
>One can also still find responsible people quoting outdated GenBank 
>backlog statistics in print.

I did not quote any statistics on the GenBank backlog in my posting 
(current or previous).

>You have my solemn word that if flaws are pointed out we will OPENLY
>either attempt to correct them to the best of our ability or step
>aside if the system is so structurally flawed that an entirely new
>attempt is needed.  However you may also rest assured that I will
>vigorously respond to any attempt at distortion of the facts.  It is
>always easy to tear down through distortion, but this is not the kind
>of tactic that one would expect from those who are really professional
>and who really have better ways of doing things.  Their results should
>be able to speak for themselves.  
>
>I also suggest that the community pay close attention to any services
>offerred and provide their feedback ** before ** decisions are made.
>
>*** In the end, it will be the users who will be left with the results. ***
>
>Given the amount of data projected to be generated by the Genome
>Project a mistake made now would make the backlog of the initial
>GenBank attempt appear miniscule by comparison.  Unfortunately the
>users are often the last to react because they are not brought in to
>the decision loop.  I have argued before, and will do so again, that
>electronic newsgroups can be a new element in this review process.
>Although the decision must ultimately be the responsibility of a
>single person or small group, the technology nows exists to easily
>sample a wide range of opinion.  Why not take advantage of this,
>particularly when so much is at stake?  Why not utilize the collective
>experience residing on the net?  Currently we have "developers
>meetings" where people are asked to digest a large amount of new
>information in the course of a day.  Why not do this over the net so
>that people can react more intelligently than in a one day jet-lagged
>haze?
>
>After all, scientists are supposed to be progressive, right?  ... right?

It is apparent that my posting elicited some rather stongly held
feelings.  I think both Dave and I agree that the biomedical research
community is going to face some quite significant information handling
challenges in the near future, and that we would all be best served by the
rational use of available technology.  Electronic mail and networks are
clearly a part of this so let's avoid unnecessary personal flames.  A
lively and open discussion depends on all parties feeling free to post
their opinions.

>-- 
>				Sincerely,
>
>				Dave Kristofferson
>				GenBank Manager
>
>				kristoff@genbank.bio.net

               David J. States, M.D.,Ph.D.
               Senior Staff Fellow

               National Center for Biotechnology Information
               National Library of Medicine

states@ncbi.nlm.nih.gov

kristoff@genbank.bio.net (David Kristofferson) (10/04/90)

My earlier response to Dr. States' message, while possibly appearing
to be personal, was actually the result of the straw being dropped on
the camel's back.  I apologize to David that he was in the unfortunate
position of having dropped that straw.  His reply was a model of
forebearance. 

However, what I am concerned about is not the attitude of any one
individual, but a sequence of events which is continuing to occur,
mainly outside of these newsgroups, of which this one small incident
is just the latest manifestation.  If this was an isolated statement I
would never have taken the time to reply in the intensity or at the
length that I did.  I agree completely that "a lively and open
discussion depends on all parties feeling free to post their
opinions," but one can only compromise so far for the sake of
gentility.  I intend to be "gentlemanly" only as long as advantage is
not taken of that forebearance, as has occurred in my past
experiences.  I emphasize that this is not an personal accusation
against Dr. States, but a general pronouncement to all concerned.  In
the context of a simple comment about FASTA-MAIL, I realize that the
readership may think that these words may border on the absurd, but I
can assure you that much larger computing issues are on the agenda.
As I stated in my last message, our real concern here is with issues
to be decided in the next two years that will affect the future
direction of biological computing.

Now to return to details and comment on the following:

> The turn around time for FASTA-MAIL is dependent on both the response
> time of the server itself and the network handling the electronic mail
> transactions.  My comments were made in the context of a discussion of
> the latter.  As Dave Kristofferson points out, at a site like NIH,
> where E-mail is handled by INTERNET, the turn around time can be pretty
> good.  There are, however, sites where E-mail processing is handled as
> a low priority batch process.  If you are dependent on one of those
> sites, you may see a dramatic improvement in E-mail mediated services
> by getting onto INTERNET.
> 

There is no doubt that e-mail transfer would be faster from our site
to other Internet sites rather than to, e.g., JANET, but it is also
true that sites which are not on the Internet would be completely cut
off from using an interactive, TCP/IP based system.  FASTA-MAIL is
accessible to people on virtually any major network.

I will also suggest that the effect of e-mail delays on other networks
may not always be as great as suggested above.  While it is the case
that the transatlantic BITNET gateways have become extremely congested
at times, I have received reports of excellent FASTA-MAIL turnaround
times from many non-Internet sites, not only in the U.S. but overseas
as well.  For example, in the U.K. a user on JANET reported a 15
minute turnaround time to get the results back of a protein search.  I
invite others to tell us of their experiences, either positive or
negative.  Of course, if the network or our machine goes down, a delay
would occur until the access was restored, but note that this would
affect both e-mail and interactive access equally.

The facts are the following.  Our computer takes about ten minutes
(plus or minus one depending on the load) to do a search of 1000 bases
against all of Genbank 64.  It takes about 1.5 minutes to do a search
of 1000 amino acids against all of SWISS-PROT 14.  FASTA provides the
search time as part of its output.  Any additional time is due to
transit.  Users can utilize this information themselves to decide if
the network is unduly affecting them.  I have never received any
evidence that FASTA-MAIL tooks "days" unless there was an extreme
malfunction on the network or with the computer system.  I suggest
that in terms of turnaround time the difference is not a big issue.

On the other hand, if someone wanted to design some nice user-friendly
software that would reside on a PC or Mac (which were, e.g.,
etherneted in to a gateway to the network) and provide an interactive
interface to FASTA or BLAST on another machine with a comprehensive
up-to-date database somewhere else on the Internet (which is what I
assume Dr. States is alluding to), this would be a great help to
people who have PC's and Mac's with such a network connection.  Such a
suggestion could be made without having to mention FASTA-MAIL which is
not obsoleted by such software.  Had the suggestion been made in this
context, much of this discussion would have been kept on a lower key.
I would also venture to suggest that this idea has not occurred in
only one place.  As usual, the question of who does it usually gets
down to either who has the funding or who expects to be able to make
money by selling it.

	  No one place has the sole monopoly on the talent.
-- 
				Sincerely,

				David Kristofferson, Ph.D.
				GenBank Manager

				kristoff@genbank.bio.net

harper@csc.fi (Rob Harper (Supercomputer Centre Finland)) (10/04/90)

In article <Oct.3.16.58.11.1990.195@genbank.bio.net>, kristoff@genbank.bio.net (David Kristofferson) writes:
> There is no doubt that e-mail transfer would be faster from our site
> to other Internet sites rather than to, e.g., JANET, but it is also
> true that sites which are not on the Internet would be completely cut
> off from using an interactive, TCP/IP based system.  FASTA-MAIL is
> accessible to people on virtually any major network.

I would suggest that people in Europe could use the similar service from
EMBL. There is a very nice DCL script for VMS called FASTEMBL which can
can take your sequence, convert it to UPPERCASE, modify it to STADEN format,
and perpare a file to be sent to EMBL. I have tried sending the same sequence
to EMBL with FASTEMBL and to GENBANK with FAMAIL... ( a similar script for
Genbank) the results I can get back from EMBL within an hour... the Genbank
results usually arrive the next day... I have never had to wait days for a
reply. Scientists should learn to utilize the resource that is closest to home.

Rob " it's not where you're from... it's where you're at... Harper