[bionet.software] Using the GenBank nightly updates with the IG-Suite

maulik@PRESTO.IG.COM (Sunil Maulik) (05/09/91)

In response to questions from our users about how to utilize the daily
database updates available on the GenBank On-line Service, I would
like to point out that the TOIGSF program in the IG-Suite solves this
problem.

This posting (long) explains how to create a data bank for use with
the IG-Suite, and then provides step-by-step instructions on updating
the data bank with the nightly GenBank update files.

Please feel free to address any questions or queries concerning this
posting to myself at the address below.

Sincerely,

Sunil Maulik, Ph.D.
Manager
Customer Services
IntelliGenetics, Inc.
(415) 962-7342
FAX: (415) 962-7302

Technical E-mail: ig-consultant@presto.ig.com (Internet)
Personal  E-mail: maulik@presto.ig.com (Internet)
		  ames!ig.com!presto!maulik (Uucp)

---------------------------- cut here -----------------------------------------

A. CREATING AN UPDATE DATABASE with TOIGSF and IG-SUITE

You can add your own amino acid, nucleic acid, or key data bank to the
list of available data banks.  You may wish to create a data bank that
holds the sequences added to GenBank, EMBL, PIR, or SWISS-PROT between
regular releases.  The TOIGSF program is now available to convert the
raw data bank sequence files to IntelliGenetics format sequence files.
When you want to search all of the sequences in a data bank, you
should then select both the data bank of the current release and the
data bank with the new sequences that you created.  You only need to
create the data bank once, and you can continue to add new files to it
with TOIGSF.  The newly created data bank will appear on the list of
data banks available for searches in the QUEST, FASTDB, BIFIND, and
IFIND programs.

1.  Make an "igdatabanks.par" File

Each data bank must have an entry in an "igdatabanks.par" file in the
"/usr/igsw/igv54/runtimes" directory (Sun) or the IGRUNTIME: directory
(VAX).  (An example of such a file is in the Sun
"/usr/igsw/igv54/examples/igdatabanks.par" file or the VAX
"[IG]igdatabanks.vms" file.) Copy this file, rename it if neccesary,
and edit it.  The contents of this file must be in the following
format:
	<NAME>,<ABB>,<TYPE>,SEQUENCE FILE,<DIRECTORY>

Example: New-GenBank,ng,NUCLEIC ACID,SEQUENCE FILE,/pr0/joe/	(Sun)
         New-GenBank,ng,NUCLEIC ACID,SEQUENCE FILE,$DISK1:[JOE]	(VAX)

<NAME> is the name of the data bank; this is the name displayed in the
list of available data banks (New-Proteins in the example) for the
data bank searching programs.

<ABB> is the one-, two-, or three-letter abbreviation of the data bank
name (ng in the example). The data bank consists of all of the entries
in all of the files in the directory you specified that have the ABB
file name extension (all ".ng" files in the "/pr0/joe/" directory
(Sun) or $DISK1:[JOE] directory (VAX) in this example).  The ABB
filename extension must be in lower case letters. 

<TYPE> is NUCLEIC ACID, AMINO ACID, or KEY, depending on what kind of
sequence is in the data bank (AMINO ACID in the example); all entries
in the data bank must be of the same type.  SEQUENCE FILE indicates
that the sequences are in IntelliGenetics Suite sequence file format.

<DIRECTORY> is the computer directory in which the sequence files or
key files are located ("/pr0/joe/" (Sun) or $DISK1:[JOE] (VAX) in this
example); you must set the permissions on this directory and its files
so that they are readable by the users of the data bank.  

2.  Make an ".IDB" File

Each data bank you make must also have an ".IDB" file, which must have
the name <ABB>.IDB (ng.IDB in this example); IDB must be in uppercase
letters. (An example of such a file is in the Sun
"/usr/igsw/igv54/examples/SEQ.IDB" file or the VAX "[IG]SEQ.IDB"
file.) Copy this file and edit it or make this file with a text
editor. The file must be in the same directory as the sequence files
for the data bank to be recognized (in this example, the file is
"/pr0/joe/ng.IDB" (Sun) or $DISK1:[JOE]ng.IDB (VAX) ).  Entries in the
".IDB" file have the format:

HEADER INFORMATION
---------------------------------------------------------------
RELEASE <#>
DESCRIPTION <phrase>
TYPE <type>
TOTAL <##>

HEADER INFORMATION is one or more lines of free format information; it
may give the data bank name, the release number, the date, etc.  The
line after the HEADER INFORMATION consists of 65 hyphens.  <#> is the
release number that you assign to this data bank; it will appear after
the data bank name in the list of data banks.  <phrase> is the short
description of the data bank; it will appear after the data bank name
and release number in the list of data banks. The short description
should have no more than 50 characters.  <type> is AMINO ACID, NUCLEIC
ACID, or KEY.  <##> is the number of sequences in the data bank.  For
example, the np.IDB file in our example is:

                     New-GenBank Data Bank
                             Release 1
                             April 1991
---------------------------------------------------------------
RELEASE 1
DESCRIPTION New GenBank Sequences
TYPE NUCLEIC ACID
TOTAL 87

3.  Check Your Work

When you have created a proper "igdatabanks.par" file, a proper ".IDB"
file, and IntelliGenetics format sequence or key files with the proper
file name extension, the next time you run a data bank searching
program, the data bank will appear in the list of available data
banks. In this example, the data bank list would be:

File		- User-specified or indirect file
A-GENESEQ 2	- Patented amino acid sequences data bank 
New-GenBank 1	- New GenBank Sequences
GenBank 67	- GenBank nucleic acid sequences data bank
PIR 23		- Protein Identification Resource sequence data bank 
SWISS-PROT 17 	- University of Geneva protein sequence data bank


B. ADDING NIGHTLY UPDATES TO THE DATA BANK


1. Login as the root or system user:

	su to root (Suns)
	login as SYSTEM (VAX)

2. Make sure you are in the directory where you want the data files
containing the nightly updates to reside:

	cd /pr0/joe (Suns)
	SET DEF $DISK1:[JOE] (VAX)

3. Connect to the GenBank Online Service over the Internet using the
FTP program:

	ftp genbank.bio.net (134.172.1.160)
	login: anonymous
	password: your-last-name

4. Change directories to the directory containing the nightly update data:

	cd /pub/db/gb-newdata

5. Get the README file and determine if the new sequences replace the
previous gbupdate or if they must be appended to the previous file.
(They will replace it only if you have just obtained a new GenBank
update from IntelliGenetics)

5. Set the FTP file transfer mode:

	binary mode transfer

6. Use the GET command to obtain the file containing the cumulative
nightly update:

	get gbseq.all.Z

7. Once the file has been completely transferred, terminate the FTP connection:

	bye (or exit)

8. Uncompress the file containing the updates:

	zcat gbseq.all.Z > gbseq.all
	 on VAX systems, use the appropriate VMS decompression routine

9. Remove the original compressed file:

	 rm gbseq.all.Z (Sun) 
	 DEL gbseq.all.Z.* (VAX)

10. Run the TOIGSF program and convert the uncompressed file to
IntelliGenetics sequence file format:

	toigsf
		input file gbseq.all
		input format genbank nucleic acids
		run the conversion


11. TOIGSF will make one file (or more, if there are over 100 sequence
entries) with the name xxx_toigsfn.seq. Once these files have been
created, delete the original uncompressed file:

	rm gbseq.all (Sun)
	DEL gbseq.all;* (VAX)

12. If the sequences are to replace the previous update then:

	rm gbupdate.ng
	cat *.seq > gbupdate.ng (Sun)

	DEL gbupdate.ng;*
	APPEND *.seq gbupdate.ng (VAX)


13. If the sequences are to be added to the previous update then:
	
	cat *.seq >> gbupdate.gbu (Sun)
	APPEND *.seq gbupdate.ng (VAX)
	
14. Remove the .seq files created by TOIGSF:

	rm *.seq (Sun)
	DEL *.seq;* (VAX)

15. Edit NG.IDB to reflect the new update date AND the appropriate
number of sequences.  If the update was replaced, use the number of
sequences reported by TOIGSF.  If the update was appended then add the
old number to the number TOIGSF reports.

16. Alert users to the new data by editing /etc/motd (Sun) or
SY$LOGIN.COM (VAX) to reflect the new update.

smith@mcclb0.med.nyu.edu (05/11/91)

In article <CMM.0.90.2.673772500.maulik@presto.ig.com>, maulik@PRESTO.IG.COM (Sunil Maulik) writes:
> 
> In response to questions from our users about how to utilize the daily
> database updates available on the GenBank On-line Service, I would
> like to point out that the TOIGSF program in the IG-Suite solves this
> problem.

From a probably erroneous reading of this posting it looks as though you are 
suggesting daily FTPing of the new stuff from GenBank.  The USENET 
distribution was aimed at avoiding this by allowing GenBank to send it all 
out just ONCE (via USENET), and then have people pick it up from the local 
USENET delivery.

DO you have instructions as to how to use the local files created by NEWS 
from the nightly delivery of the sequence data, thereby avoiding having to 
FTP it?

+---------------------------------------------------------------------------+
|Ross Smith, Cell Biology,  NYU Medical Center,  550 First Ave.,  NYC, 10016|
|E-Mail:  SMITH@NYUMED.BITNET (BITNET),  SMITH@MCCLB0.MED.NYU.EDU (Internet)|
+---------------------------------------------------------------------------+