[bionet.molbio.genbank] Software for automated subseqence extraction

eesnyder@boulder.Colorado.EDU (Eric E. Snyder) (04/28/91)

I am looking for some software that will allow me to extract subsequences
from genbank or PIR.

For example, I would like to be able to provide a keyword such as 'splice
site' and have the program search genbank and return with a list of sequence
names and the subsequence from each entry corresponding to my keyword.

Any leads would be appreciated....
Thanks,
---------------------------------------------------------------------------
TTGATTGCTAAACACTGGGCGGCGAATCAGGGTTGGGATCTGAACAAAGACGGTCAGATTCAGTTCGTACTGCTG
Eric E. Snyder                            
Department of MCD Biology              ...making feet for childrens' shoes.
University of Colorado, Boulder   
Boulder, Colorado 80309-0347
LeuIleAlaLysHisTrpAlaAlaAsnGlnGlyTrpAspLeuAsnLysAspGlyGlnIleGlnPheValLeuLeu
---------------------------------------------------------------------------

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (04/30/91)

> I am looking for some software that will allow me to extract subsequences
> from genbank or PIR.
> 
> For example, I would like to be able to provide a keyword such as 'splice
> site' and have the program search genbank and return with a list of sequence
> names and the subsequence from each entry corresponding to my keyword.

The most expeditious way of doing this is through the free GenBank IRX
account.  Instructions are included in the information below.

				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

----------------------------------------------------------------------

The GenBank On-Line Service

The GenBank On-Line Service (GOS) provides access to the most recent
quarterly releases of the GenBank and EMBL nucleic acid sequence databases, 
as well as the data added to each of these since their most recent releases 
(in the New Data databases).  In addition, the Swiss-Prot protein sequence 
database and GenPept, a database of peptide sequences derived by the 
automatic translation of annotated coding regions of entries in the 
GenBank databases, are available.  Users can query the databases by 
annotation keywords, search for sequence similiarity, and retrieve entries 
of interest.  The GOS is available through e-mail servers, anonymous FTP, 
anonymous interactive login, and login to established, password-protected, 
individual accounts.  Access to all GOS services is available to both 
commercial and non-commercial users at the same cost.  On-line help is 
available for all aspects of this Service.  User manuals, information
on costs, and application forms may be requested from GenBank at
GENBANK@GENBANK.BIO.NET.

INTERACTIVE ACCESS

Interactive access to the GOS databases is provided through the SprintNet
public data network and via remote login over the Internet.  At present,
the IRX (Information Retrieval Experimental Workbench) program is the
primary interactive database retrieval program.  Three usage classes are
available for the GOS; these classes are described below.

Class 0 Accounts

Anonymous users of the interactive system are provided with 20 minute
sessions using the IRX retrieval program.  With this program, entries
in any of the on-line databases can be located by searching for a
keyword or combination of keywords appearing in any of the fields of
the entries' annotations.  Located entries can be displayed on the
terminal or downloaded to the user's computer with the Kermit
file-transfer program.  (The Kermit program is available for a wide
variety of computers from numerous software bulletin boards, user
groups, and from Columbia University.  MS-DOS and Macintosh versions
are available from GenBank on request.)  New users of the IRX program
should read the on-line introduction which can be displayed by
answering 'Y' to the first question the program asks ("Do you want
help?").

To use the GOS Class 0 account, one must have a supported terminal or
a computer with software for emulating one of those terminals (see the
list in the Example at the end of this message) and a modem capable of
communicating at 300, 1200, 2400, or 9600 baud.  Instructions for
dialing to access the GenBank computer are shown in the example below.
After completing the login procedure shown in the example, the IRX
database query program is immediately started.

Class 1 Accounts

To gain access to additional services, users of the GOS may wish to
establish accounts on the GOS computer.  These accounts provide access
to the GOS computer, 1 Mbyte of disk space for user files, access to
IRX, the GenBank relational database management system, and
interactive and batch mode use of FASTA and TFASTA (a version of FASTA
that compares a peptide sequence with a nucleic acid sequence database
by translating the database sequences in up to six reading frames "on
the fly").  Class 1 accounts also provide electronic mail access for
contacting other users of the GOS and users of computers connected to
the Internet and other computer networks. Access to a wide variety of
electronic bulletin boards is also provided.  Newsgroups that may be
of special interest are the bionet.journals.contents newsgroup which
provides on-line versions of the tables of contents of several
important journals before publication and bionet.sci-resources which
provides on-line copies of the NIH Guides to Grants and Contracts.
Several other newsgroups are available for exchange of information on
experimental protocols and other areas of scientific interest.

Class 2 Accounts

For an additional fee, Class 2 users are provided with access to the
IntelliGenetics Suite of sequence analysis programs and databases
formatted for those programs.  Additional databases (e.g., the PIR Protein
Sequence Database, KeyBank(TM), and VectorBank(TM)) are also available to
Class 2 users.  Class 2 users also have access to all the facilities
available to Class 1 users.

E-MAIL SERVERS

In addition to providing interactive access, GenBank currently offers two
electronic mail servers, one for sequence similarity searching and one for
database entry retrieval.  These are freely available to anyone who can
send mail to an Internet address.  The following networks have gateways to
the Internet: BITNET, EARN, NETNORTH and JANET.  Users of computers on
these networks may need to change the format of the addresses given below
to send the message through a forwarding gateway.  Users should consult
their computer system managers or administrators to determine the proper
forwarding gateway and address form.  Questions regarding the use of the
e-mail servers (or other aspects of the GOS) may be addressed to:
CONSULTANT@GENBANK.BIO.NET.

FASTA Server

The GenBank FASTA Server receives mail messages containing a nucleic acid
or protein query sequence with instructions for the search. The server
then performs a FASTA sequence similarity search against the specified
database, and returns the results by electronic mail.

To use the FASTA Server, send an electronic mail message containing the
formatted query sequence to the following Internet address:
SEARCH@GENBANK.BIO.NET.  To receive instructions for formatting the query
sequence, send a mail message to this address containing the word "Help"
as the only line of the message.

Entry Server

E-mail access to sequence database entries is provided for three reasons:
1) to enable users of the FASTA Server to retrieve entries identified by
sequence similarity searches; 2) to enable users of the Class 0
interactive system described above, who access it by network remote login
(e.g., telnet) to retrieve copies of entries of interest; and 3) to enable
readers of journals that identify published sequences by accession number
to retrieve computer-readable versions of those sequences.  To retrieve a
database entry, send a mail message containing only the entry name or the
accession number (not both) to the address: RETRIEVE@GENBANK.BIO.NET.  The
on-line databases are searched and the entry (if any) which corresponds to
the supplied entry name or accession number will be returned by electronic
mail.  To receive instructions on using the Entry Server, send a mail
message to the RETRIEVE address (above) containing the word "Help" as the
only line of the message.  Because of the order in which the databases are
searched, if both GenBank and EMBL data banks contain entries with the
same primary accession number (the usual case), a query on the accession
number will result in the GenBank version of the entry being returned.  If
the EMBL-format version of the entry is required, it can be retrieved from
the EMBL file server at NETSERV@EMBL.BITNET.

ANONYMOUS FTP

In addition to interactive access and electronic mail servers, GenBank
also provides files for anonymous FTP (File Transfer Protocol),
including GenBank and EMBL new data and contributed software.  Each week
the new entries created in the GenBank database are collected into an
update file.  The file has a name in the form of gbMMDD.seq, where MM is the
number of the month and DD is the date of file creation.  Likewise, new
EMBL entries are collected into files with names in the form of emMMDD.seq.
The weekly update files are kept in the new data directories until they
are superseded by a new quarterly release of the database.

To access any of the files available for anonymous FTP, one should use
the FTP protocol to connect to GENBANK.BIO.NET [134.172.1.160], using
"anonymous" as the Username and one's surname as the Password.

===============================================
Example.   Login to the free GOS IRX account

ATDT14159616860			Use ATDP for pulse dialing phone.
  
CONNECT 2400			Connect to 2400 baud modem
Trying GENBANK.BIO.NET (1434.172.1.160)...Open


SunOS/BSD UNIX (genbank.bio.net) 
login:genbank		Typing 'genbank' allows you to access the GenBank
			computer.
Password:4nigms		This is the password for the GenBank computer; it
			MUST be entered in lowercase characters.  
Last login...		This message includes a date showing the last 
			anonymous login, as well as other system
			information.  
SunOS Release 4.0.3 (GENBANK)

The following is a list of commonly used terminals 
Designation 	Terminal Type 
adm3a 		Lear Siegler (ADM) 
aaa-48 		Ann-Arbor Ambassador in 48 line mode
aaa-60 		Ann-Arbor Ambassador in 60 line mode 
dm3025 		Datamedia 3025a 
h19		Heath H19 or Zenith 
hp2621 		Hewlett Packard HP2621 
hp2648-iv 	Hewlett Packard HP2648A 
sun 		Sun Microsystems Workstation console 
tvi912 		Televideo 912, 920 
tvi950 		Televideo 950 
vi200 		Visual 200 
vt100 		Digital Equipment VT100 (default) 
vt102 		Digital Equipment VT102 
vt200 		Digital Equipment VT200 
Press Return to select vt100, or enter the appropriate terminal

(type the designation of the appropriate terminal type followed by <CR>)

After completing the login procedure shown above, the IRX sequence entry
searching program is immediately started.

===============================================

Further information about the GenBank On-line Service may be obtained by
contacting GenBank at:

      GenBank
      c/o IntelliGenetics Inc.
      700 East El Camino Real
      Mt. View, CA  94040
      (415) 962-7364
      genbank@genbank.bio.net

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (04/30/91)

In article <eesnyder.672776972@beagle> eesnyder@boulder.Colorado.EDU (Eric E. Snyder)
writes:
>I am looking for some software that will allow me to extract subsequences
>from genbank or PIR.

The Delila system, old and senile as it is, was designed to extract large
sets of subsequences (DNA only).

>For example, I would like to be able to provide a keyword such as 'splice
>site' and have the program search genbank and return with a list of sequence
>names and the subsequence from each entry corresponding to my keyword.

Because Delila was designed before GenBank, and GenBank structure is STILL not
up to snuff, one must convert from GenBank to Delila format.  This is a simple
program called dbbk (written by Matt Yarus, son of Mike Yarus, you may be
interested to know!).  The Delila viewpoint is that the database consists of a
set of organisms and their chromosomes.  You must specify these, and then the
piece of DNA you are interested in.  The piece corresponds roughly to a GenBank
entry.  The idea is that Delila is a 'librarian' and you give 'her'
instructions that define the fragments you want.  She reaches into the library
and pulls out -- what else? -- a book.  Instructions might look like:

    title 'Demonstration of Delila instructions';
    (* the title is required to name the resulting book *)
    (* this is a comment, just as in the computer language Pascal *)

    organism H.sapians; (* define the organism *)
    chromosome 3; (* I made this name up; unfortunately GenBank hasn't
                     stored this information consistently *)
    piece x253; (* I made this name up also *)

    get from 536 -24 to 536 +30;

The last instruction, 'get' says to Delila that you want the fragment
that starts 24 bases before coordinate 536 and ends 30 bases after.
By having the instructions written in a file, one can handle many of them.

There is now a program that automatically creates Delila instructions from the
GenBank features.  This has allowed us to create hundreds to thousands of
fragments for statistical analysis.

Parts of the Delila system are available by anonymous ftp from
ncifcrf.gov in pub/delila.  See the README files.  I will place more
programs in the archive if you request them.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

roy@phri.nyu.edu (Roy Smith) (05/01/91)

toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
> The idea is that Delila is a 'librarian' and you give 'her' instructions
> that define the fragments you want.  She reaches into the library and
> pulls out -- what else? -- a book.

Her?  Why is a librarian automatically assumed to be female?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

donnel@helix.nih.gov (Donald A. Lehn) (05/01/91)

In article <1991May1.114219.25483@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
->toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
->> The idea is that Delila is a 'librarian' and you give 'her' instructions
->> that define the fragments you want.  She reaches into the library and
->> pulls out -- what else? -- a book.
->
->Her?  Why is a librarian automatically assumed to be female?
->--

	Its probably wrong to make such an assumption.  However, being a
frequent user of libraries and noticing how neat and ordered they tend to
be,  I find it difficult to imagine how a "him" could be responsible.  If
you don't grasp what I'm talking about, take a look at any little boy's
bedroom and compare it to his sister's.  :)

Don

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (05/01/91)

In article <CMM.0.88.672964473.kristoff@genbank.bio.net> kristoff@GENBANK.BIO.NET
(Dave Kristofferson) writes:
>> I am looking for some software that will allow me to extract subsequences
>> from genbank or PIR.
>> For example, I would like to be able to provide a keyword such as 'splice
>> site' and have the program search genbank and return with a list of sequence
>> names and the subsequence from each entry corresponding to my keyword.

>The most expeditious way of doing this is through the free GenBank IRX
>account.  Instructions are included in the information below.

This person wanted PARTS of genbank entries, not whole entries!

Since GenBank entries do not carry a coordinate system with them, it is
not possible to extract subsequences without losing the location of the
sequences.  One must add a new feature to the entries: a coordinate system.
Do you understand the situation Dave?  Genbank does not serve the needs
of this user.  He needs software that can manipulate portions of entries.

>				Dave Kristofferson
>				GenBank Manager
>				kristoff@genbank.bio.net
  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

POSTMAST@gunbrf.bitnet (05/01/91)

From:   edu%"eesnyder@boulder.colorado.edu" 29-APR-1991 11:18:10.80
To:     genbank-bb@colorado.edu
CC:
Subj:   Software for automated subseqence extraction

Received: From STANFORD(MAILER) by NBRF with Jnet id 6419
          for POSTMASTER@GUNBRF; Mon, 29 Apr 91 11:18 EDT
Received: by Forsythe.Stanford.EDU; Sun, 28 Apr 91 18:30:08 PDT
Received: by genbank.bio.net (5.61/IG-2.0)
        id AA18837; Sat, 27 Apr 91 12:45:20 -0700
Received: by genbank.bio.net (5.61/IG-2.0)
        id AA18789; Sat, 27 Apr 91 12:44:37 -0700
Message-Id: <9104271944.AA18789@genbank.bio.net>
To: genbank-bb@colorado.edu
From: eesnyder@boulder.colorado.edu (Eric E. Snyder)
Subject: Software for automated subseqence extraction
Date: 27 Apr 91 18:29:32 GMT
Sender: news@colorado.edu (The Daily Planet)
Nntp-Posting-Host: beagle.colorado.edu

I am looking for some software that will allow me to extract subsequences
from genbank or PIR.

For example, I would like to be able to provide a keyword such as 'splice
site' and have the program search genbank and return with a list of sequence
names and the subsequence from each entry corresponding to my keyword.

Any leads would be appreciated....
Thanks,
---------------------------------------------------------------------------
TTGATTGCTAAACACTGGGCGGCGAATCAGGGTTGGGATCTGAACAAAGACGGTCAGATTCAGTTCGTACTGCTG
Eric E. Snyder
Department of MCD Biology              ...making feet for childrens' shoes.
University of Colorado, Boulder
Boulder, Colorado 80309-0347
LeuIleAlaLysHisTrpAlaAlaAsnGlnGlyTrpAspLeuAsnLysAspGlyGlnIleGlnPheValLeuLeu
---------------------------------------------------------------------------

kristoff@genbank.bio.net (David Kristofferson) (05/02/91)

Whoops, you definitely caught me on that one, Tom (blush)!!  I also
must admit to not understanding your comment about the lack of a
coordinate system.  For example, coding sequences are clearly
annotated in the features table and one can extract these subsequences
from an entry while also carrying along the annotations which refer to
their position in the original sequence.  What do you mean by the lack
of a coordinate system???  As to writing software for the user's
request, GenBank is not in a position to produce analysis software
(except for the production of the GenPept database).

Dave

overt@antony (Christian Overton) (05/02/91)

Dave,

One problem with using GenBank IRX in its current form is that
retieval of information can only be done via Kermit.  This means, in
particular, that data cannot be retrieved over the Internet and that
instead, one has to dial up the GenBank server directly (too expensive
from East Coast) or use SprintNet (never tried this before).  It would
be nice if disk space was set aside for temporary files (like files
that can exist for no more than an hour) that could be created and
ftp'd by anonymous users.

Chris
-- 
+-------------------------------------------------------------------------------+
| G. Christian Overton                        || Telephone: (215) 648-2420      |
| Center for Advanced Information Technology  || Internet: overt@prc.unisys.com |
| Unisys                                      || FAX: (215) 648-2288            |

kristoff@genbank.bio.net (David Kristofferson) (05/02/91)

> One problem with using GenBank IRX in its current form is that
> retieval of information can only be done via Kermit.  This means, in
> particular, that data cannot be retrieved over the Internet and that
> instead, one has to dial up the GenBank server directly (too expensive
> from East Coast) or use SprintNet (never tried this before).  It would
> be nice if disk space was set aside for temporary files (like files
> that can exist for no more than an hour) that could be created and
> ftp'd by anonymous users.

Chris,

	We'll consider this proposal at our next GOS meeting.

Dave

P.S. - Your mail header overt@antony needs fixing.  I can't reply to
your mail directly.

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (05/03/91)

In article <1991May1.114219.25483@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>> The idea is that Delila is a 'librarian' and you give 'her' instructions
>> that define the fragments you want.  She reaches into the library and
>> pulls out -- what else? -- a book.

>Her?  Why is a librarian automatically assumed to be female?

Because 'she' is named Delila (DEoxyribonucleic acid LIbrary LAnguage), which
is similar to the famous Delilah, who cut Samson's hair.  Like Delilah,
Delila cuts DNA.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

roy@phri.nyu.edu (Roy Smith) (05/04/91)

overt@antony (Christian Overton) writes:
> One problem with using GenBank IRX in its current form is that retieval
> of information can only be done via Kermit.  This means, in particular,
> that data cannot be retrieved over the Internet

	IRX isn't restricted to kermit; you can save to a disk file instead,
but selecting "save to file" instead of "download with kermit" is one of the
few serious mis-features in the IRX user interface (IMHO).  You have to do
something like hit the space bar to cycle around the choice; yes, the
directions to do this are right there on the screen, but 1) who reads
directions, 2) it's counter-intuitive, and 3) the directions are confusing.
It took us a few times to figure out how to do it.

	Anyway, once you've got the entry saved as a disk file, it's easy to
ftp it back to your machine.  I do this on a regular basis.  If I was
maintaining IRX, I'd make Save To Disk the default.  Really neato would be
"Save to remote host via ftp", but if building a whole ftp client into a
GenBank retrieval program isn't feeping creatureism, I don't know what is.

/roy

--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

roy@phri.nyu.edu (Roy Smith) (05/05/91)

overt@antony (Christian Overton) writes:
> One problem with using GenBank IRX in its current form is that 
> [...] data cannot be retrieved over the Internet

To which I replied:
> IRX isn't restricted to kermit; you can save to a disk file instead,

	It would appear that I spoke without fully comprehending the
situation.  Apparantly, there are different types of accounts; the account
I use allows full access to the Unix shell, so I can save disk files and
manipulate them later.  Apparantly there are other accounts in which you
can not do this.  Please excuse the confusion I may have caused.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (05/07/91)

In article <May.1.10.08.02.1991.16403@genbank.bio.net> kristoff@genbank.bio.net
(David Kristofferson) writes:
>I
...
>must admit to not understanding your comment about the lack of a
>coordinate system.  For example, coding sequences are clearly
>annotated in the features table and one can extract these subsequences
>from an entry while also carrying along the annotations which refer to
>their position in the original sequence.  What do you mean by the lack
>of a coordinate system???

There are many ways to use a genetic sequence database.  Most people are
interested in a single sequence, and for this the current methods work
reasonably well.  However, more and more people are interested in studying
collections of sequences.  For example, we have a huge collection of splice
junctions.  To analyze these statistically, we would like to extract only a
minimum region around the junctions.  If we were to do this by hand, then we
would be likely to make errors, and the process would be very tedious.  To
avoid errors, we create a set of instructions that define the regions we want
to study.  We used the feature table to make the instructions.  But what should
the output of such an extraction look like?

Many years ago, Jeff Haemer and I realized that the best form for the output
extraction should be identical to the input!  Thus if I want bases 57 through
89 of a partiticular GenBank entry, the most useful output would look like a
GenBank entry, but would only contain bases 57 through 89.

The power of this is that it allows one to use the same search or other
analysis program on GenBank as one uses on a subset.  Using written sets of
instructions (instead of interactive input), one can automatically create
sub-databases and sub-sub databases.  The subsets would be equivalent to the
main database.  For example, we created a subset of E. coli sequences that were
the transcribed RNA.  Further extractions of ribosome binding sites to create a
sub-sub-database, were therefore guaranteed to give us sequences that were
alway RNA.  These were the initial steps toward creating a database which we
used to train the Perceptron (a neural net) to locate ribosome binding sites
(Stormo, NAR 10: 2997, 1982).

In the present GenBank scheme, this means that the numbering of the extracted
fragment 57 through 89 would implicitly become 1 through 89-57+1 = 33.  If we
made a nice printed listing of open reading frames of the original entry, then
we would have to keep doing subtractions to find things in our sub-sequence.
If you every have had to do this, you know how painful it is.

So the idea came up that the extracted entry should carry a coordinate system.
This is a set of numbers that defines the original number of each base in the
extracted sequence.

But if the extracted entries have coordinate systems, then so too should the
main library, in keeping with the principle of equivalence between database
and sub-databases.

To implement such a scheme today, we would have to add a coordinate system to
the extracted GenBank entries.  This is equivalent to carrying along the
annotations, but makes it more explicit.  A true coordinate system does not
depend on any 'features'.  With today's GenBank, we would also have to have
each analysis program check for a coordinate system, and if it is not found,
assume that the numbering is 1 to n.  This is possible, but is obviously a
messy design, forced by the lack of an explicitly defined coordinate system in
the main database.

You might ask: why not simply implement this program check and be done with
it?  Well, if nothing else, having a coordinate system would allow GenBank to
extend an old sequence before base 1 and not modify any other coordinates.
(There is nothing wrong with having a zero coordinate.)

These ideas were implemented in the Delila system before GenBank came
into existence (NAR 10:3013, 1982; 12:129, 1984).

I don't expect GenBank to write software, since that goal of GenBank was
dropped for political/funding (?) reasons many years ago.  However, GenBank
should be creating a database which is useable for many purposes.  The ability
to automatically create specialized databases is becomming more and more
important.  Unfortunately it often means the creation of a completely new
database, rather than one extracted from the original database.

The trouble with absolute coordinate systems is that if two GenBank entries
fuse together, the numbering of at least one sequence must change.  Any
instructions become out of date.  The way to avoid this is to have landmarks on
the sequence which do not change.  For this reason I urged that every feature
in GenBank have a name.  I see that at least the latest entry I extracted does
have a name, but I don't know if this is true of all features (I suspect it
isn't).  If each feature had a unique name, then the instructions for
extracting fragments would remain the same.

For example, I could say:

organism 'E. coli';  chromosome 'main';
gene lacZ;
get from gene beginning -20 to gene beginning + 10;

This is pseudo-delila code since the names don't exist and the use of quote
marks is not implemented yet.  However, with the right database, these
instructions would last forever since the names E. coli and lacZ are universal
and not likely to change.  The best names to use are the currently accepted
genetic names (since they are the most stable), but provision must be made for
using alternative names.

The fragment defined by these instructions would, of course, have whatever
numbering (coordinate system) the current database allowed, so that one could
compare the results from several different analyses.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

jlong@uhunix1.uhcc.Hawaii.Edu (John Long) (05/08/91)

In article <1991May1.114219.25483@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:

>Her?  Why is a librarian automatically assumed to be female?


With a name like 'Delila' I think it's safe to assume that he/she/ye/it is a 
female. Maybe the creator named it after herself. Call it artistic license.
BFD.

Besides, doesn't it just make sense that software would be female and hardware
be male?

Aloha,
-LongJohn

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (05/10/91)

In article <12911@uhccux.uhcc.Hawaii.Edu> jlong@uhunix1.uhcc.Hawaii.Edu (John Long) writes:
>In article <1991May1.114219.25483@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>>toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>>Her?  Why is a librarian automatically assumed to be female?
>With a name like 'Delila' I think it's safe to assume that he/she/ye/it is a 
>female. Maybe the creator named it after herself. Call it artistic license.
>BFD.

>Besides, doesn't it just make sense that software would be female and hardware
>be male?

I was designing a computer language with which one can extract portions
of a DNA sequence.  I needed a name, and one morning woke up and wrote down:
  DEoxyribonucleic acid
    LIbrary
      LAnguage
  DELILA
hence the name.  See 

@article{Schneider1982,
author = "T. D. Schneider
 and G. D. Stormo
 and J. S. Haemer
 and L. Gold",
title = "A design for computer nucleic-acid sequence storage, retrieval and
manipulation",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "3013-3024",
year = "1982"}

"She"'s available by anonymous ftp from ncifcrf.gov in pub/delila.

>Aloha,
>-LongJohn

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov