kristoff@GENBANK.BIO.NET (Dave Kristofferson) (01/19/91)
We have had a request to keep both the current GenBank release AND the previous GenBank release in the FTP directory as standard policy. This places certain demands on disk space obviously. We would like to know how many users really think that this feature would be of use and why. Please feel free to post your responses publicly to genbank-bb@genbank.bio.net (this will not be applicable to most European BIOSCI readers except those with Internet connections). Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
roy@ALANINE.PHRI.NYU.EDU (Roy Smith) (01/19/91)
Dave Kristofferson says: > We have had a request to keep both the current GenBank release AND the > previous GenBank release in the FTP directory as standard policy. To be honest, I'm hard pressed to find a reason why anybody would want to see release N-1 of GenBank. I certainly don't see a reason to use valuable public disk space for this purpose. What could possibly be in an older release that's not also in the current one? -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
jkramer@molbio.med.miami.edu (Jack Kramer) (01/19/91)
In article <9101190214.AA17249@alanine.phri.nyu.edu> roy@ALANINE.PHRI.NYU.EDU (Roy Smith) writes: >Dave Kristofferson says: >> We have had a request to keep both the current GenBank release AND the >> previous GenBank release in the FTP directory as standard policy. > > To be honest, I'm hard pressed to find a reason why anybody would >want to see release N-1 of GenBank. I certainly don't see a reason to use >valuable public disk space for this purpose. What could possibly be in an >older release that's not also in the current one? >-- I am one of those who requested that the previous release UPDATE files be kept on line for some overlap period after a new release. My primary reason for this is that I maintain two major software packages (IG and GCG) that work with the databases. I make every attempt to keep all the databases as up to date as possible. Each of the packages uses a proprietary format for the data. Even though it is ultimately possible, it is extremely inconvenient to download the entire databases as soon as a new version is posted and reformat them to the proprietary formats. I therefore usually depend on the vendors normal distributions for full releases. There is delay from the GenBank release date until the vendors get the new release, reformat it, and distribute it to their customers. This can be a several week delay. My request was that the previous updates be kept online for a reasonable period to allow those dependent upon vendor distribution to get the new baseline release. This is mainly to prevent any confusion and mistakes which could affect the work of the database and software package users. Now that the new feature table fiasco is finally over, and the state of update files are well documented on-line this may all be moot. But I still feel a little uncomfortable about everything being deleted for the previous release as soon as a new release is available at GenBank. This is not a complaint about GenBank. The anonymous ftp service is a real lifesaver for me and I really appreciate all the cooperation and service I have received from the GenBank staff.
BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) (01/19/91)
David K. recently asked us to comment on the idea of keeping both the current GenBank release AND the previous GenBank release in the FTP directory as standard policy. Personally I don't care since I do not down load these. However, we have the GCG programs at our site and it would be very nice if we could ftp the latest GenBank release which has been converted to the GCG format. It seems that alot of us in netland have to go through alot of work to convert the databases from GenBank to GCG format before we can use them and it would be nice if we could just ftp the database in a format which we don't have to fool with before using. At present we (GCG sites) either have to pay $1600 for the tapes from GCG or download the databases from GenBank and convert them ourselves to GCG format. This is money and time we could spend on other, more productive work. Is there a site we can ftp the databases in GCG format out there? If not, why not? I await responses with baited breath.... Thanks Bruce A. Roe Professor of Chemistry and Biochemistry INTERNET: BROE@aardvark.ucs.uoknor.edu BITNET: BROE@uokucsvx AT&TNET: 405-325-4912 or 405-325-7610 SnailNet: Department of Chemistry and Biochemistry University of Oklahoma 620 Parrington Oval, Rm 208 Norman, Oklahoma 73019 FAXnet: 405-325-6111
reisner@ee.su.oz.au (Alex Reisner) (01/20/91)
>We have had a request to keep both the current GenBank release AND the >previous GenBank release in the FTP directory as standard policy. > Dave Kristofferson > GenBank Manager ======================================================================= For our part we download new releases as soon as they become available via the Internet. They are converted to PIR format for use by the packages we've purchased and the in house software we run. The previous release is then compressed and moved to an 8 mm Exabyte cartridge. Therefore, we don't require holding the previous release on GOS discs. One inexpensive option that may be open to GOS is to place on-line the previous version which will be on CD-ROM starting this quarter via a symbolic link. That should be a fairly cheap solution. It wouldn't be in compressed format but at least it would be available. Alex Reisner (Australian Genomic Information Service)
jkramer@molbio.med.miami.edu (Jack Kramer) (01/21/91)
In article <9101191404.AA18024@genbank.bio.net> BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) writes: > >go through alot of work to convert the databases from GenBank to >GCG format before we can use them and it would be nice if we could >just ftp the database in a format which we don't have to fool with >before using. > >At present we (GCG sites) either have to pay $1600 for the tapes >from GCG or download the databases from GenBank and convert them >ourselves to GCG format. This is money and time we could spend on >other, more productive work. Is there a site we can ftp the databases >in GCG format out there? If not, why not? > My correspondence with major vendors over the past few years on the ftp availability of proprietary formatted versions of the government provided seqnence databases would fill a small book. After lots of haggling over all the details it seems to all boil down to the fact that the databases will never be available for "free" as long as these commercial vendors can make additional profit by reformatting and distributing the databases. I have never seen any argument which "yet" justifies the reformatting other than the profit motive.
roy@phri.nyu.edu (Roy Smith) (01/21/91)
jkramer@molbio.med.miami.edu (Jack Kramer) writes: > I am one of those who requested that the previous release UPDATE files > be kept on line for some overlap period after a new release. My primary > reason for this is that I maintain two major software packages [...] Each > of the packages uses a proprietary format for the data. Perhaps I misunderstood the original posting; I thought the request was to keep the *entire* previous release on line. Just the updates sounds more reasonable. But, the real reason I'm following up to Jack's posting is to flame the software vendors. The idea of each vendor having a proprietary format for Genbank is nuts. Do vendors really think it's a good idea for people who use two or more packages to have to keep two or more complete copies of the database on-line? Or do they just think that their package is so wonderful, so complete, and so able to fulfill the needs of every user at every site that nobody might ever want to possibly run any software other than either own? I could see how you could make a point for reformatting the database to be in some drastically better format (a relational data base, for example), but many of the reformats I've seen have been nothing more than trivial textual changes that don't make it better, they just make it different. For example, Ross Smith and I both maintain complete copies of GenBank (and other databases) on different machines on the same LAN. For a while, we've been talking about just having a single copy which one of us would NFS mount from the other's disk. A couple of days ago, I got to look at his copy of GenBank. It's still formatted as plain old ascii flat files, but his software vendor decided it was important to insert lines starting with >'s to delimit loci, instead of the "//" delimiter that the files have coming off the tape from IG. There were a couple of other other textual differences which I didn't study too closely, but it was obvious that none of them were fundamental changes; they didn't make the file substantially better than it was before, just different. Enough so that in order for us to share a single copy of the database, one of us would have to re-write a lot of our software to know about the format of the other's database. Assuming the only difference is purely reformating the text, then there is no excuse. If there is some added information, then it seems to me that best thing would have been to create a parallel flat file with the extra info; the vendor's programs could read both files and other programs that wanted to see a virgin GB file could see that too. If the vendor wanted some sort of index into the file, they could have made an index that pointed into the original file; again, programs that wanted the virgin file could just ignore the index. > This is not a complaint about GenBank. The anonymous ftp service is > a real lifesaver for me and I really appreciate all the cooperation > and service I have received from the GenBank staff. I'll go along with that. I've had some minor disagreements with the GenBank folks, but even the closest long-term colaborators don't always agree 100%. By and large, the GB people (both at IG and LANL) have gone out of their way to service every request we have made of them, even when those requests havn't been entirely reasonable. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
Cherry@Frodo.MGH.Harvard.EDU (J. Michael Cherry) (01/21/91)
In article <9101191404.AA18024@genbank.bio.net> BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) writes: > David K. recently asked us to comment on the idea of keeping both > the current GenBank release AND the previous GenBank release in the > FTP directory as standard policy. I see no reason for GenBank.Bio.Net to keep the old release files online. Each release exists on GenBank.Bio.Net for about three months before its deleted and replaced with the next release. That seems like plenty of time for anyone to retrieve the files if GenBank is important to their site. If someone really needs an easy archive of old versions they should subscribe to the CD-ROM distributions. > Is there a site we can ftp the databases in GCG format out there? If > not, why not? I know of no sites that make the GCG format of the database open to the public. GCG's format once was just NBRF's format but they have been moving away from that in recent years. I don't think GenBank.Bio.Net should provide anything but the GenBank format files. I would be willing to provide the GCG formatted database to the net but I really don't want to be the only site in the world providing access. If other sites around the world, or at least the US, are interested in being regional ftp access sites for the GCG formatted database please let me know. As a closing note the reformating from GenBank to GCG is quite simple, involving two commands to rebuilt the entire database. Transferring the database via the Internet can take longer in real time than the GCG reformatting process. Mike Cherry cherry@frodo.mgh.harvard.edu Department of Molecular Biology Massachusetts General Hospital, Boston 617-726-5955
kristoff@genbank.bio.net (David Kristofferson) (01/22/91)
Bruce, As I am sure you are aware, it is not in GenBank's charter to supply the databank in any commercial format. Reformatting costs money regardless of who does it. If we were required to reformat the database as you suggest, we would be obligated to provide it for *every* commercial vendor. This is clearly impractical. Also since many users do not have access to FTP, they would still have to rely on tape or CDROM distributions. The net effect of this would be to delay the production of GenBank tremendously. Reformatting GenBank clearly belongs where it is right now, in the hands of the commercial vendors. Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
kristoff@genbank.bio.net (David Kristofferson) (01/22/91)
The issue of why commercial vendors choose their own format goes way back. A common complaint in the "good old days" was that GenBank was "always changing their format." Commercial vendors did not feel that they could reliably support their users if the format of the data that they were receiving was not consistent. A considerable investment in time, money, and accumulated data has been made in the interim by vendors and the users of their software. Note, however, that when GenBank changed the features table format recently, there was still a lot of controversy despite the fact that many attempts had been made to alert users in advance. Having been on both sides of the fence there is undoubtedly blame to go around everywhere. I do not think that one can allege any kind of commercial conspiracy here, because it also costs the companies a significant amount of money to fiddle with these conversions. IBM may be able to "lock people in" with proprietary products because of their size, but this is not a significant consideration in this rather little arena. When people buy commercial software and pay a not insignificant sum, they expect to get something for their money. I can understand it if, having been burned in the past, most vendors still use their own formats. Remember that GenBank has been based on five year contracts, the second of which will end in another year and 9 months. Each change brings potential uncertainty although it appears that the GenBank format will continue to be produced after the end of the current contract. Whether it is more cost effective for vendors to change formats is a decision which is up to them since each faces their own market conditions with their own set of resources. As you are well aware, the National Center for Biotechnology information is trying to establish another format using ASN.1 to try to develop a new standard for this area. If this is well thought out and well received by the user community, perhaps this will eventually put an end to some of these issues. Until some reliable degree of stability is assured to any format, others will undoubtedly continue to exist.
jkramer@molbio.med.miami.edu (Jack Kramer) (01/22/91)
In article <Jan.21.09.32.02.1991.7956@genbank.bio.net> kristoff@genbank.bio.net (David Kristofferson) writes: >Bruce, > > As I am sure you are aware, it is not in GenBank's charter to >supply the databank in any commercial format. Reformatting costs I think all the comments here have been directed to the commercial distributors. At least my intent was completely in that direction. GenBank is to be commended on providing the original databases for access via Internet. This is definitely not true for all government sponsored sequence databases. Try getting PIR from the NBRF. The number of price lists and order forms they have sent me in response to requests for network access must by now exceed the volume of the actual database. >the production of GenBank tremendously. Reformatting GenBank clearly >belongs where it is right now, in the hands of the commercial vendors. I am completely satisfied with current and planned GenBank formats. And if all the vendors standardized on the GenBank format it would certainly make life much easier. No information is added by any of the proprietary vendors in any of the proprietary formats. And indexing schemes which seem to be the most common justification, work just as well with the original GenBank format as any other.
BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) (01/23/91)
Hi, Obviously the problem with the databases, their formats, and programs to access the databases continues. As with most things in life there are no simple solutions until, of course, the solution is found and then everyone says: "My, that solution was so simple, why didn't we think of it before". The solution is rather simple, a common, stable, database format. Without this, venders of software have 2 choices: 1. Reformat the databases to fit their software 2. Change their software to read the distributed databases. Until now the choice of the venders has been the former, mainly because the format of the databases was (and still is) in a state of change. It is more efficient to write a program to change the database format than it is to change the multitude of code for dealing with the databases. John and the folks at GCG have provided tools for converting GenBank to GCG format and for inter-converting individual sequences from one format to another. The Staden programs read the databases stored in the PIR format but can analyze individual sequences stored in any of several formats. I do not know what IG does in their package but am sure they have some similar approaches or do they use GenBank without reformatting? David K. has written: > As I am sure you are aware, it is not in GenBank's charter to >supply the databank in any commercial format. Reformatting costs >money regardless of who does it. If we were required to reformat the >database as you suggest, we would be obligated to provide it for >*every* commercial vendor. This is clearly impractical. Also since >many users do not have access to FTP, they would still have to rely on >tape or CDROM distributions. The net effect of this would be to delay >the production of GenBank tremendously. Reformatting GenBank clearly >belongs where it is right now, in the hands of the commercial vendors. Give me a break. How many many vendors is *every* ? Do folks really search the entire GenBank from their pc's? Some search the protein databases on their pc's/macs but the entire GenBank? Could we at least concentrate our discussion on MainFrame computer programs and databases on these. Maybe I'm mistaken but I count three MainFrame program sets as the vast majority used, GCG, IG, and NBRF/PIR. A few sites have the Staden programs but most of us who use the Staden programs use them for purposes other than database searching. In reality, Bill Pearson's FASTA and companion programs probably are used the most and they handle the GCG formatted databases. I think what we need here is a survey of what's out there. If we limit our discussion to Main Frame programs and FTP sites and not deal with individual users but rather with sites. I also do not think we should consider other forms of the databases, such as those which require pre-processing for the NLM BLAST programs or GCG's QUICKSEARCH. The problem is time and money. If GCG supplies users with tapes for $1600 they make money but they sure save me lots of time and I get ALL the databases we want and need in a format we can use. I also do not have to worry about transmission error which may corrupt an ftp-ed database. If I get the GenBank tapes I still have to pay (although less) but then I have to spend time re-formatting databases and also get additional tapes from PIR and maybe others which could bring the cost in tapes and effort to a figure greater than the cost from GCG. No matter what it looks like the NIH is going to pay the bills, either from individual grants or from contracts to GenBank/IG. I'd like to hear from the funding agencies and also like comments from those who supply databases to the rest of us. My overall conclusions are: (1) pay the money to GCG and get quarterly database updates on tape as it is the least hassle for me and our system folks. (2) encourage users to search the latest databases using FASTA-Mail,etc. (3) continue to join with others to encourage discussions which will result in a common, stable database format. Best to one and all, Bruce A. Roe Professor of Chemistry and Biochemistry INTERNET: BROE@aardvark.ucs.uoknor.edu BITNET: BROE@uokucsvx AT&TNET: 405-325-4912 or 405-325-7610 SnailNet: Department of Chemistry and Biochemistry University of Oklahoma 620 Parrington Oval, Rm 208 Norman, Oklahoma 73019 FAXnet: 405-325-6111 ICBMnet: 35 deg 14 min North, 97 deg 27 min West
kristoff@GENBANK.BIO.NET (Dave Kristofferson) (01/24/91)
Bruce, One "break" coming up 8-)! I may respond in more detail later, but one quick note now. I think that you *will* find people who search the entire database on PC's. They just turn the thing on and walk away for the evening. There's a sizable crowd that doesn't want anything to do with "mainframes" if they can avoid them. I have to leave official statements as to what the government can and can't do to government officials, but there are many people that use commercial PC and Mac programs, and it is not clear to me that the government can provide the data in a format specifically tailored to a subset of commercial programs. Another thing that comes to mind is what happens if tapes sent out in XYZ format by GenBank cause a problem. Does GenBank get access to the commercial software and run tests first to ensure that the stuff works? Does this task remain in the hands of the vendors and result in constant exchanges between the vendor and GenBank? It strikes me that GenBank would have to accept some degree of responsibility for this, but why should a public agency get involved in support for a commercial company? Because people out there can't come up with $1600 a year for a commercial tape subscription (* see below)? It strikes me as being a lot cleaner for GenBank simply to provide its own format and for the vendors to adapt. We both agree that this will only occur if there is continued stability in the format, but this has not happened yet and may still not happen for some time unfortunately. Technological progress in its own right has a nasty habit of requiring format changes. Dave * - When I used to work at UCSF I never saw people cring at spending several hundred dollars on radionucleotides, GTP, etc. However, there is some kind of psychological barrier when it comes to spending money on computers and software (maybe because it's easier to copy software than it is to make reagents!).
kristoff@GENBANK.BIO.NET (Dave Kristofferson) (01/24/91)
Bruce, P.S. - The people at NCBI will be running the next GenBank contract and have it in their charter to develop standards. I would hope that they would comment on these issues. NCBI has already held developers meetings on their proposed ASN.1 standard which I hope people are anticipating. I am all for minimizing disruptions to people's software (I'm still suffering from some of these problems myself), but the longer term future is in the hands of the NIH. They can dictate what the contractors are required to provide. Dave
JREES@vax.oxford.ac.uk (01/24/91)
Time I guess to say my piece also... It shouldn't even need questioning that reformatting the databases form one format to another is inappropriate, and I won't reiterate the arguements made before here as to, that has been done already. But I am going to add my voice to those who would see ALL the software packages accept the format distributed by the databases as the one to use for access to the databases in straight ascii format. Will Gilbert articulated this whole area very clearly on INFO-GCG last August, and perhaps he can be persuaded to repost to this discussion, but in essence it is very simple for everyone using the flat format files to interface to them in "native" format and to provide utilities to index the access in a fashion which facilitates access for their own package. Some programmers seem willing to do this (Rodger Staden for one has stated a willingness to use whatever format is chosen by PIR, and has no objection to using "native" format if they do), others seem very determined to go their own way in the face of opposition (perhaps rather silent oppostion until now) from those are actually on the receiving end of the effect of this dogma. Since there can be no programming advantage that I can see for the reformatting the question is why is noone willing to standardise? It is clear that the standard HAS to be that created by the database in question, and that software can be written to meet whatever format it is presented with, and that all the packages COULD use whatever format the database was presented to them in (EMBL, Genbank, PIR, Codata, whatever) by setting the appropriate parameter to the package at the start. Perhaps if this were done then there would be less time and effort wasted making multiple copies for everyday use. The problems the Genbank/IG have with disk space probably apply to most of use - my own operations running software and databases in Oxford and at MIT run 700MB on each machine, reformatting databases generally means finding 150 MB of spare space at a minimum, more if the active version is not deleted first, and this is getting worse as the databases get larger each year. Clearly the change we could have all hoped for in the construction of the relational format Genbank and Embl has not yet gained the active interest of the programming community as an option, it is my hope that this will be the way forward in the long term, and that those in a position to advance this will do so on this forum. Finally to avoid the point being missed, I am fully in support of the use for total reformattting where it achieves a significant change in the response to the user, the preprocessing required to run BLAST or GCG's Quick software is an investment well worth making - even when the overall cost in resource has been higher, but the problem under discussion here does NOT achieve that end. Jasper Rees Jrees@Vax.Oxford.ac.uk (%nsfnet-relay.ac.uk) Seqtest@Wccf.MIT.edu "One World, One database format ?"
droufa@MATT.KSU.KSU.EDU (Donald J Roufa) (01/24/91)
Bruce Roe writes: > > Give me a break. How many many vendors is *every* ? Do folks really > search the entire GenBank from their pc's? Some search the protein > databases on their pc's/macs but the entire GenBank? > > Could we at least concentrate our discussion on MainFrame computer > programs and databases on these. Maybe I'm mistaken but I count > three MainFrame program sets as the vast majority used, GCG, IG, and > NBRF/PIR. A few sites have the Staden programs but most of us who use > the Staden programs use them for purposes other than database searching. > > In reality, Bill Pearson's FASTA and companion programs probably are > used the most and they handle the GCG formatted databases. > > I think what we need here is a survey of what's out there. If we limit > our discussion to Main Frame programs and FTP sites and not deal with > individual users but rather with sites. I also do not think we should > consider other forms of the databases, such as those which require > pre-processing for the NLM BLAST programs or GCG's QUICKSEARCH. > > Although I agree that users carrying out sophisticated analyses of GenBank on PCs are severely handicapped by their machines' command of memory and speed, as a 'bench' molecular biologist who frequently uses GenBank information for experiments, I have found that most of my searches are not sophisticated ones. They simply are requests to retrieve a single locus within the database, or, as Bruce asserts, are TFASTA sequence comparisons. The former are most conveniently done on our laboratory PCs using the CD-ROM or floppy report release. The latter are best done, as suggested, on mainframes or via e-mail directly to GenBank, but they can also be carried out in just a few minutes (20 minutes to be precise) for TFASTA search on a 80386 PC. Since the vast majority of working laboratories have access to PC's, whereas only a subset of them have close ties to mainframes, I think that Dr. Roe's suggestion would not be in the best interests of the entire population of working molecular biologists. In addition, it has been my experience that, despite the fact that I do use our university's mainframe and unix network for GenBank work, as a research scientist I have little influence over our institution's allocation of mainframe computing resources. In contrast, I have complete control over our local microcomputing resources, and can tailor our database needs for research quite specifically at that level. Inasmuch as GenBank is, in fact, a research resource, it is important that we not lose site of its use by people who are depositing data in the database. -- Don Roufa E-Mail: DROUFA@MATT.KSU.KSU.EDU // | / /--- | | Division of Biology DROUFA@KSUVM.KSU.EDU // |/ |__ | | Kansas State Univ. Tel: (913) 532-6641 // |\ | | | Manhattan, KS 66506 Fax: (913) 532-6653 // | \ \__/ \__/
Cherry@Frodo.MGH.Harvard.EDU (J. Michael Cherry) (01/25/91)
In article <CMM.0.88.664670183.kristoff@genbank.bio.net> kristoff@GENBANK.BIO.NET (Dave Kristofferson) writes: > The people at NCBI will be running the next GenBank contract > and have it in their charter to develop standards. I would hope that > they would comment on these issues. NCBI has already held developers > meetings on their proposed ASN.1 standard which I hope people are > anticipating. Please forgive my nitnicking that follows but I'd hate to see things get more confused. This is not directed to Dave's posting I just quoted it so you would see were things start. The NCBI proposed database standard is built using a transaction/notation standard called ASN.1. ASN.1 has been adopted by several commercial computer and software companies for a variety of applications. ASN.1 is not the name of the NCBI standard. I believe the NCBI refers to the database format by the name of their nascent database - GenInfo Backbone. You can retrieve a copy of the the GenInfo Backbone format version 0.5 via anonymous ftp from ncbi.nlm.nih.gov. Look in the toolbox/asn_0.5 directory. One more little point if I may. Several people have referred to "mainframe" computers in this discussion of formats. A mainframe computer is a very large computer typically produced by IBM. There are few if any dedicated mainframe computers run for molecular biologist. However Digital's VAX computers - generally called mini computers are everywhere. However, currently most all the computers being sold by Digital, Sun Microsystems, HP, Apple and even IBM are microcomputers. Sun and others may call them supermicros - but that is just marketing. Mike Cherry cherry@frodo.mgh.harvard.edu Department of Molecular Biology Massachusetts General Hospital, Boston 617-726-5955