[bionet.molbio.bio-matrix] DCL script for EMBL's FastA mail service

ERIK@VAXH.NKI.NL ("Erik L. Sonnhammer") (07/17/90)

For VMS people who have enjoyed the shells of Steve Clark for using
the FASTA mail service of Genbank, but want to use the EMBL FASTA mail service
instead, I have modified his FAMAIL.COM to work with EMBL.
The auxiliary program SEQINFO.FOR is still required, so I appended it to
this FASTEMBL.COM script. If you're already running FAMAIL, you can use the
SEQINFO that you've already compiled.
Once you have an executable SEQINFO.EXE, all you have to do is to edit the
search_address of the extracted FASTEMBL.COM to something you mailer accepts
and then execute the script with @FASTEMBL.

Any bug-reports and questions are welcome.

What you will discover when using the Genbank and EMBL FastA machines
alternatingly, is that even with identical parameters you will receive
quite different results. I wonder if someone could explain why this is the
case and which results are 'better'.

Yours,

Erik L. Sonnhammer

NKI, Amsterdam, The Netherlands
E-Mail: erik@vaxh.nki.nl

-----------------------------------X----------------------------------------

$       ! FASTEMBL.COM
$
$       ! By Steve Clark - Modified for EMBL by Erik Sonnhammer in May 1990
$
$       ! Command procedure to send a sequence to EMBL to have a FASTA
$       ! search performed on it. This procedure asks all the relevant
$       ! questions, constructs a text file with the sequence in the right
$       ! format, and mails it to EMBL. It accepts the name of the query
$       ! sequence on the command line as P1.
$       ! Note: the symbol SEARCH_ADDRESS is the network address for the EMBL
$       ! Search service. This will have to be changed to accomodate local
$       ! gateways, etc.
$
$       On control_y then goto terminate
$       On warning then continue
$       bell[0,7] = 7
$       ws := "write sys$output"
$       iq := inquire/nopunctuation
$
$       ! The Internet address for sending the search file is
$       ! FASTA@EMBL.BITNET
$
$       search_address := """"IN%"""""FASTA@EMBL.BITNET""""""
$
$       ws ""
$       ws "This procedure initiates a FASTA search for similarity between"
$       ws "your query sequence and one of the databases maintained by EMBL."
$       ws "The information required for executing the search is sent to"
$       ws "EMBL via electronic mail and is executed by the EMBL people"
$       ws "themselves. Their databases are much more current than our local"
$       ws "ones, and their computer is very fast."
$       ws "The results of the search will be returned to you via e-mail."
$       ws ""
$
$       seqname = "''p1'"
$       if(seqname.NES."") then goto no_query
$
$get_query:
$
$       ws ""
$       iq seqname "FASTEMBL with what query sequence? "
$       if(seqname.EQS."") then goto get_query
$
$no_query:
$
$       ! See if the sequence exists
$
$       assign/user_mode nl: sys$output
$       seqinfo/infile='seqname'
$       if(seqinfotype.NES."NONE") then goto check_gcg
$       ws ""
$       ws "''bell'''seqname' doesn't exist. Please try again."
$       goto get_query
$
$check_gcg:
$
$       ! Check if the sequence is in GCG format.
$
$       if(seqinfotype.NES."NOGCG") then goto get_start
$       ws ""
$       ws "''bell'''seqname' is not a legitimate GCG sequence file!"
$       ws ""
$       ws "Select option by number -"
$       ws ""
$       ws "1) Specify another sequence"
$       ws "2) Quit"
$       ws ""
$       iq choice "Choice (* 1 *) ? "
$       if(choice.EQS."2") then exit
$       goto get_query
$
$get_start:
$
$       ws ""
$       iq begin "Begin (* 1 *) ? "
$       if (begin.EQS."") then begin := 1
$       ibegin = f$integer(begin)
$       if((ibegin.GE.1).AND.(ibegin.LT.f$integer(seqinfolength))) then -
                        goto get_end
$       ws "''bell'The start must be between 1 and ''seqinfolength'."
$       goto get_start
$
$get_end:
$
$       iq end "End (* ''seqinfolength' *) ? "
$       if (end.EQS."") then end := 'seqinfolength'
$       iend = f$integer(end)
$       if((iend.GT.ibegin).AND.(iend.LE.f$integer(seqinfolength))) then -
                        goto get_database
$       ws "''bell'The start must be between 1 and ''seqinfolength'."
$       goto get_end
$
$get_database:
$
$       ! Find out which database to search. If the sequence is DNA, the
$       ! default is the Genbank database. The default for proteins is
$       ! the SWISS-PROT database
$
$       ws ""
$       ws "Database to search:"
$       ws ""
$       if(seqinfotype.EQS."PROTEIN") then goto get_pepdatabase
$       ws "  1) new EMBL entries since latest release"
$       ws "  2) all EMBL entries (latest release + new ones)"
$       ws "  3) EMBL fungi division only "
$       ws "  4) EMBL invertebrates division only"
$       ws "  5) EMBL mammals division only"
$       ws "  6) EMBL organelles division only "
$       ws "  7) EMBL phages division only"
$       ws "  8) EMBL plants division only"
$       ws "  9) EMBL primates division only"
$       ws " 10) EMBL prokaryotes division only"
$       ws " 11) EMBL rodents division only"
$       ws " 12) EMBL synthetic sequences division only"
$       ws " 13) EMBL unannotated division only"
$       ws " 14) EMBL viruses division only"
$       ws " 15) EMBL vertebrates division only"
$       ws " 16) New GenBank entries since latest release"
$       ws " 17) all GenBank entries (latest release + new ones)"
$       ws " 18) Entries only in GenBank, not in EMBL"
$       ws " 19) GenBank and EMBL entries (latest releases)"
$       ws " 20) new EMBL and GenBank entries since latest release"
$       ws ""
$       iq choice "Please enter choice (* 2 *): "
$       if(choice.EQS."") then choice := 2
$       database := ""
$       if(choice.EQS."1") then database := EMNEW
$       if(choice.EQS."2") then database := EMALL
$       if(choice.EQS."3") then database := EFUN
$       if(choice.EQS."4") then database := EINV
$       if(choice.EQS."5") then database := EMAM
$       if(choice.EQS."6") then database := EORG
$       if(choice.EQS."7") then database := EPHG
$       if(choice.EQS."8") then database := EPLN
$       if(choice.EQS."9") then database := EPRI
$       if(choice.EQS."10") then database := EPRO
$       if(choice.EQS."11") then database := EROD
$       if(choice.EQS."12") then database := ESYN
$       if(choice.EQS."13") then database := EUNA
$       if(choice.EQS."14") then database := EVRL
$       if(choice.EQS."15") then database := EVRT
$       if(choice.EQS."16") then database := GBNEW
$       if(choice.EQS."17") then database := GBALL
$       if(choice.EQS."18") then database := GBONLY
$       if(choice.EQS."19") then database := GENEMBL
$       if(choice.EQS."20") then database := GENNEW
$       !
$       if(database.NES."") then goto get_wordsize
$       ws "''bell'Valid responses are 1 - 17, inclusive."
$       goto get_database
$
$get_pepdatabase:
$
$       ws " 1) latest release of SwissProt protein database"
$       ws " 2) latest release of NBRF/PIR protein database"
$       ws " 3) Entries only in NBRF/PIR, not in SwissProt"
$       ws ""
$       iq choice "Please enter choice (* 1 *): "
$       if(choice.EQS."") then choice := 1
$       database = ""
$       if(choice.EQS."1") then database := SW
$       if(choice.EQS."2") then database := NBRF
$       if(choice.EQS."3") then database := PIRONLY
$       if(database.NES."") then goto get_wordsize
$       ws "''bell'Valid responses are 1 - 3, inclusive."
$       goto get_database
$
$get_wordsize:
$
$       ! Find out how long the word should be
$
$       if(seqinfotype.EQS."PROTEIN") then goto get_pwordsize
$
$       ws ""
$       iq wordsize "What word size (* 6 *)? "
$       if(wordsize.EQS."") then wordsize := 6
$       if((f$integer(wordsize).GT.2).AND.(f$integer(wordsize).LT.7)) -
                        then goto get_nscores
$       ws "''bell'Word size must be in the range from 3 to 6."
$       goto get_wordsize
$
$get_pwordsize:
$
$       ws ""
$       iq wordsize "What word size (* 2 *)? "
$       if(wordsize.EQS."") then wordsize := 2
$       if((f$integer(wordsize).GT.0).AND.(f$integer(wordsize).LT.3)) -
                        then goto get_nscores
$       ws "''bell'Word size must be in the range from 1 to 2."
$       goto get_pwordsize
$
$get_nscores:
$
$       ws ""
$       iq nscores "List how many best scores (* 50 *)? "
$       if(nscores.EQS."") then nscores := 50
$       if(f$integer(nscores).GE.10) then goto get_nalign
$       ws "''bell'At least 10 scores should be listed."
$       goto get_nscores
$
$get_nalign:
$
$       ws ""
$       iq nalign "Align how many matches (* 10 *)? "
$       if(nalign.EQS."") then nalign := 10
$       if(nalign.GT.4) then goto summarize
$       ws "''bell'At least 5 alignments should be shown."
$       goto get_nalign
$
$summarize:
$
$       ws ""
$       ws ""
$       ws "The following FASTA search will be executed:"
$       ws ""
$       ws "Query sequence: ''seqname' from ", -
                        "''begin' to ''end' (''seqinfotype')"
$       ws "Database to be searched: ''database'"
$       ws "Word size: ''wordsize'"
$       ws "Scores to list: ''nscores'"
$       ws "Alignments to show: ''nalign'"
$       ws ""
$       iq choice "Are these parameters correct (* Yes *)? "
$       choice = f$extract(0, 1, choice)
$       if(choice.EQS."") then goto do_it
$       if(choice.EQS."Y") then goto do_it
$
$       ! Something is wrong. Give the chance to correct it, or give up.
$
$ask_repeat:
$
$       ws ""
$       ws "Do you want to"
$       ws ""
$       ws "1) Try again"
$       ws "2) Give up"
$       ws ""
$       iq choice "Please enter the number of your choice (* 1 *): "
$       if(choice.eqs."") then goto get_query
$       if(choice.eqs."1") then goto get_query
$       if(choice.eqs."2") then exit
$       ws "''bell'Wasn't the question simple enough for you?"
$       goto ask_repeat
$
$do_it:
$
$       ! Determine the root name of the sequence for comparison. This in used
$       ! in specifying the output file name, which has the extension .FAM.
$
$       ! Check if the sequence is a file. If not, assume it is a database
$       ! entry and get the locus name to use as a root.
$
$       seqroot = seqname
$       root = f$parse(seqroot,,,"NAME") ! root name of sequence
$       if(root.NES."") then goto set_outname
$
$       ! Remove database name
$
$       pos = 'f$locate(":", seqroot)'
$       len = 'f$length(seqroot)'
$       if(pos.NE.len) then root = f$extract(pos+1, len-pos, seqroot)
$
$ set_outname:
$
$       outname := "''root'.FAM"
$
$       ! Convert the sequence to Staden format. Since ToStaden does
$       ! not allow command line input, need to construct a little command
$       ! procedure to substitute in the appropriate sequence names.
$
$       ws ""
$       ws "Converting ''seqname' to Staden format..."
$       ws ""
$
$       ! ToStaden only accepts UpperCase - convert
$
$       OneCase /In='seqname' /ext=.seq /Menu=U
$       seqname = "''root'.seq"
$
$       open/write comfile fam.cmd
$       wc := "write comfile"
$
$       wc "$ set noon"
$       wc "$ set verify"
$       wc "$ ToStaden"
$       wc "''seqname'"
$       wc "''begin'"
$       wc "''end'"
$       wc "''root'.SDN"        ! Output file name
$       wc "$ set noverify"
$       close comfile
$
$       @fam.cmd
$       del fam.cmd.
$       del 'seqname'.
$
$       ws "Creating the file to be mailed to EMBL..."
$       ws ""
$
$       Create 'outname'
$       open/append comfile 'outname'
$       wc "TITLE ''seqname' FASTEMBL SEARCH"
$       wc "LIB ''database'"
$       wc "WORD ''wordsize'"
$       wc "LIST ''nscores'"
$       wc "ALIGN ''nalign'"
$       wc "SEQ"
$       wc ""
$       close comfile
$
$       Append 'root'.SDN 'outname'
$       delete 'root'.SDN.
$
$       ws "Mailing the file to EMBL..."
$       ws ""
$
$       mail 'outname' 'search_address'
$
$       ws "The file ''outname' has been sent to EMBL."
$       ws "The results will be mailed back to you shortly."
$       ws ""
$
$       delete 'outname'.
$
$ terminate:  ! Jump here on ^y. Don't do any cleanup
$
$       exit

----------------------------------X--------------------------------------------

!***  SEQINFO  ***********************************************************
!*
!* This program reads the sequence specified on the command line with the
!* qualifier /INfile=... and sets two DCL symbols: SEQINFOTYPE ("PROTEIN",
!* "DNA", "NONE" (if the sequence doesn't exist) or "NOGCG" (if the sequence
!* is not in the GCG format)), and SEQINFOLENGTH (the length of the sequence).
!* It is intended to be executed from a command procedure, so the
!* command procedure can get this information about any specified sequence.
!*
!* To be used with the command procedure shells, the symbol
!* SEQINFO :== $LOGICAL:SEQINFO
!* has to be set, where LOGICAL is of course replaced with the logical name
!* of the appropriate device and directory.
!*
!* Written by Steve Clark November 15, 1989
!*
!*************************************************************************

        program seqinfo

        implicit none

        character seqfname(256), strand(100001), lengthstr(64)

        integer infile, checksum, length, inttostr, istatus

        logical isprotein, readsq, clgetoldfname, dclsetsymbol
        logical openf, logstatus

! Check for the existance of the sequence file name on the command line

        if( .not. clgetoldfname('INfile', 1, seqfname)) then
                logstatus = dclsetsymbol('seqinfotype', 'NONE')
                stop ' '
        endif

! Open the file.

        if( .not. openf(infile, seqfname, 'rdb')) then
                logstatus = dclsetsymbol('seqinfotype', 'NONE')
                stop ' '
        endif

! Read the sequence to determine its length.

        if( .not. readsq(infile, strand, length, checksum)) then
                logstatus = dclsetsymbol('seqinfotype', 'NOGCG')
                stop ' '
        endif
        istatus = inttostr(length, lengthstr)
        logstatus = dclsetsymbol('seqinfolength', lengthstr)

! Find out if it is protein or DNA

        if(isprotein(strand)) then
                logstatus = dclsetsymbol('seqinfotype', 'PROTEIN')
        else
                logstatus = dclsetsymbol('seqinfotype', 'DNA')
        endif

        stop ' '
        end