maulik@PRESTO.IG.COM (Sunil Maulik) (05/09/91)
In response to questions from our users about how to utilize the daily database updates available on the GenBank On-line Service, I would like to point out that the TOIGSF program in the IG-Suite solves this problem. This posting (long) explains how to create a data bank for use with the IG-Suite, and then provides step-by-step instructions on updating the data bank with the nightly GenBank update files. Please feel free to address any questions or queries concerning this posting to myself at the address below. Sincerely, Sunil Maulik, Ph.D. Manager Customer Services IntelliGenetics, Inc. (415) 962-7342 FAX: (415) 962-7302 Technical E-mail: ig-consultant@presto.ig.com (Internet) Personal E-mail: maulik@presto.ig.com (Internet) ames!ig.com!presto!maulik (Uucp) ---------------------------- cut here ----------------------------------------- A. CREATING AN UPDATE DATABASE with TOIGSF and IG-SUITE You can add your own amino acid, nucleic acid, or key data bank to the list of available data banks. You may wish to create a data bank that holds the sequences added to GenBank, EMBL, PIR, or SWISS-PROT between regular releases. The TOIGSF program is now available to convert the raw data bank sequence files to IntelliGenetics format sequence files. When you want to search all of the sequences in a data bank, you should then select both the data bank of the current release and the data bank with the new sequences that you created. You only need to create the data bank once, and you can continue to add new files to it with TOIGSF. The newly created data bank will appear on the list of data banks available for searches in the QUEST, FASTDB, BIFIND, and IFIND programs. 1. Make an "igdatabanks.par" File Each data bank must have an entry in an "igdatabanks.par" file in the "/usr/igsw/igv54/runtimes" directory (Sun) or the IGRUNTIME: directory (VAX). (An example of such a file is in the Sun "/usr/igsw/igv54/examples/igdatabanks.par" file or the VAX "[IG]igdatabanks.vms" file.) Copy this file, rename it if neccesary, and edit it. The contents of this file must be in the following format: <NAME>,<ABB>,<TYPE>,SEQUENCE FILE,<DIRECTORY> Example: New-GenBank,ng,NUCLEIC ACID,SEQUENCE FILE,/pr0/joe/ (Sun) New-GenBank,ng,NUCLEIC ACID,SEQUENCE FILE,$DISK1:[JOE] (VAX) <NAME> is the name of the data bank; this is the name displayed in the list of available data banks (New-Proteins in the example) for the data bank searching programs. <ABB> is the one-, two-, or three-letter abbreviation of the data bank name (ng in the example). The data bank consists of all of the entries in all of the files in the directory you specified that have the ABB file name extension (all ".ng" files in the "/pr0/joe/" directory (Sun) or $DISK1:[JOE] directory (VAX) in this example). The ABB filename extension must be in lower case letters. <TYPE> is NUCLEIC ACID, AMINO ACID, or KEY, depending on what kind of sequence is in the data bank (AMINO ACID in the example); all entries in the data bank must be of the same type. SEQUENCE FILE indicates that the sequences are in IntelliGenetics Suite sequence file format. <DIRECTORY> is the computer directory in which the sequence files or key files are located ("/pr0/joe/" (Sun) or $DISK1:[JOE] (VAX) in this example); you must set the permissions on this directory and its files so that they are readable by the users of the data bank. 2. Make an ".IDB" File Each data bank you make must also have an ".IDB" file, which must have the name <ABB>.IDB (ng.IDB in this example); IDB must be in uppercase letters. (An example of such a file is in the Sun "/usr/igsw/igv54/examples/SEQ.IDB" file or the VAX "[IG]SEQ.IDB" file.) Copy this file and edit it or make this file with a text editor. The file must be in the same directory as the sequence files for the data bank to be recognized (in this example, the file is "/pr0/joe/ng.IDB" (Sun) or $DISK1:[JOE]ng.IDB (VAX) ). Entries in the ".IDB" file have the format: HEADER INFORMATION --------------------------------------------------------------- RELEASE <#> DESCRIPTION <phrase> TYPE <type> TOTAL <##> HEADER INFORMATION is one or more lines of free format information; it may give the data bank name, the release number, the date, etc. The line after the HEADER INFORMATION consists of 65 hyphens. <#> is the release number that you assign to this data bank; it will appear after the data bank name in the list of data banks. <phrase> is the short description of the data bank; it will appear after the data bank name and release number in the list of data banks. The short description should have no more than 50 characters. <type> is AMINO ACID, NUCLEIC ACID, or KEY. <##> is the number of sequences in the data bank. For example, the np.IDB file in our example is: New-GenBank Data Bank Release 1 April 1991 --------------------------------------------------------------- RELEASE 1 DESCRIPTION New GenBank Sequences TYPE NUCLEIC ACID TOTAL 87 3. Check Your Work When you have created a proper "igdatabanks.par" file, a proper ".IDB" file, and IntelliGenetics format sequence or key files with the proper file name extension, the next time you run a data bank searching program, the data bank will appear in the list of available data banks. In this example, the data bank list would be: File - User-specified or indirect file A-GENESEQ 2 - Patented amino acid sequences data bank New-GenBank 1 - New GenBank Sequences GenBank 67 - GenBank nucleic acid sequences data bank PIR 23 - Protein Identification Resource sequence data bank SWISS-PROT 17 - University of Geneva protein sequence data bank B. ADDING NIGHTLY UPDATES TO THE DATA BANK 1. Login as the root or system user: su to root (Suns) login as SYSTEM (VAX) 2. Make sure you are in the directory where you want the data files containing the nightly updates to reside: cd /pr0/joe (Suns) SET DEF $DISK1:[JOE] (VAX) 3. Connect to the GenBank Online Service over the Internet using the FTP program: ftp genbank.bio.net (134.172.1.160) login: anonymous password: your-last-name 4. Change directories to the directory containing the nightly update data: cd /pub/db/gb-newdata 5. Get the README file and determine if the new sequences replace the previous gbupdate or if they must be appended to the previous file. (They will replace it only if you have just obtained a new GenBank update from IntelliGenetics) 5. Set the FTP file transfer mode: binary mode transfer 6. Use the GET command to obtain the file containing the cumulative nightly update: get gbseq.all.Z 7. Once the file has been completely transferred, terminate the FTP connection: bye (or exit) 8. Uncompress the file containing the updates: zcat gbseq.all.Z > gbseq.all on VAX systems, use the appropriate VMS decompression routine 9. Remove the original compressed file: rm gbseq.all.Z (Sun) DEL gbseq.all.Z.* (VAX) 10. Run the TOIGSF program and convert the uncompressed file to IntelliGenetics sequence file format: toigsf input file gbseq.all input format genbank nucleic acids run the conversion 11. TOIGSF will make one file (or more, if there are over 100 sequence entries) with the name xxx_toigsfn.seq. Once these files have been created, delete the original uncompressed file: rm gbseq.all (Sun) DEL gbseq.all;* (VAX) 12. If the sequences are to replace the previous update then: rm gbupdate.ng cat *.seq > gbupdate.ng (Sun) DEL gbupdate.ng;* APPEND *.seq gbupdate.ng (VAX) 13. If the sequences are to be added to the previous update then: cat *.seq >> gbupdate.gbu (Sun) APPEND *.seq gbupdate.ng (VAX) 14. Remove the .seq files created by TOIGSF: rm *.seq (Sun) DEL *.seq;* (VAX) 15. Edit NG.IDB to reflect the new update date AND the appropriate number of sequences. If the update was replaced, use the number of sequences reported by TOIGSF. If the update was appended then add the old number to the number TOIGSF reports. 16. Alert users to the new data by editing /etc/motd (Sun) or SY$LOGIN.COM (VAX) to reflect the new update.
smith@mcclb0.med.nyu.edu (05/11/91)
In article <CMM.0.90.2.673772500.maulik@presto.ig.com>, maulik@PRESTO.IG.COM (Sunil Maulik) writes: > > In response to questions from our users about how to utilize the daily > database updates available on the GenBank On-line Service, I would > like to point out that the TOIGSF program in the IG-Suite solves this > problem. From a probably erroneous reading of this posting it looks as though you are suggesting daily FTPing of the new stuff from GenBank. The USENET distribution was aimed at avoiding this by allowing GenBank to send it all out just ONCE (via USENET), and then have people pick it up from the local USENET delivery. DO you have instructions as to how to use the local files created by NEWS from the nightly delivery of the sequence data, thereby avoiding having to FTP it? +---------------------------------------------------------------------------+ |Ross Smith, Cell Biology, NYU Medical Center, 550 First Ave., NYC, 10016| |E-Mail: SMITH@NYUMED.BITNET (BITNET), SMITH@MCCLB0.MED.NYU.EDU (Internet)| +---------------------------------------------------------------------------+