[sci.bio] What would you like to see in DNA sequencing tools

papowell@umn-cs.cs.umn.edu (06/20/88)

I have been looking at the problem of building a general purpose
set of tools for doing some of the computation associated with
the DNA/RNA sequencing work.   One of the problems that I have is getting
a wider perspective on the TYPES of tools that users want.

1.  User interfaces:
	Graphical?  Iconic?  Simple line oriented?  Menu driven?
2.  Database Searching:
	What level of presentation?  Single entry?  Multiple entries (these two
	are the most common)?
	Save/add the entries found to a new database (private database creation)?
3.	What types of searching do you want?
	-- This is the most interesting part.  Many of the programs that I have seen
	have been fairly special purpose.  I am trying to see if there are more
	general "meta searches" available.   I am aware of the computational
	restrictions/requirements for building different types of tools,
	and the need for high performance on large data bases almost demands
	specialized implementations.
4.  If you had a wish list,  what NEW facilties,  tools, commands, etc.,
	would you want?

Please mail me replies;  I will summarize in 2 weeks time.
Prof. Patrick Powell, Dept. Computer Science, 136 Lind Hall, 207 Church St. SE,
University of Minnesota,  Minneapolis, MN 55455 (612)625-3543/625-4002

wrp@biochsn.acc.virginia.edu (William R. Pearson) (06/21/88)

In article <5916@umn-cs.cs.umn.edu> papowell@umn-cs.cs.umn.edu (Patrick Powell, Dept. CS. U-Minn) writes:
]I have been looking at the problem of building a general purpose
]set of tools for doing some of the computation associated with
]the DNA/RNA sequencing work.   One of the problems that I have is getting
]a wider perspective on the TYPES of tools that users want.
]
]1.  User interfaces:
]2.  Database Searching:
]3.	What types of searching do you want?
]4.  If you had a wish list,  what NEW facilties,  tools, commands, etc.,
]	would you want?
]
]Please mail me replies;  I will summarize in 2 weeks time.
]Prof. Patrick Powell, Dept. Computer Science, 136 Lind Hall, 207 Church St. SE,
]University of Minnesota,  Minneapolis, MN 55455 (612)625-3543/625-4002

	I believe that there are two deficiencies in current sequence
analysis software.

	(1) a powerful sequence editor/assembly program for building
DNA sequences from multiple sequence determinations.  Although the
computational tools are available for rapid sequence assembly, as are the
database tools for recording changes to a sequence assembly database and the
editing tools required to make corrections, these have not been put together
to built a high quality, rapid, easy to use sequence assembly program.

	(2) a general regular expression matching program for looking
at DNA or protein sequences.  Although regular expression matching
programs are common, a program appropriate for biological sequence analysis
should have the ability to specify a distance range over which matches
must take place.  For example, match the pattern ABC{6-10}NDEF would
match ABCNNNNNNDEF and ABCNNNNNNNNNNDEF but not ABCNNNNNDEF.  In addition,
such a program should recognize DNA or protein sequences, not be
confused by newlines, and provide answers that refer to residue
numbers, not line numbers.

	Bill Pearson
	wrp@virginia.EDU

noordewi@speedy.cs.wisc.edu (Mick Noordewier) (06/21/88)

In article <434@hudson.acc.virginia.edu>, wrp@biochsn.acc.virginia.edu (William R. Pearson) writes:
> In article <5916@umn-cs.cs.umn.edu> papowell@umn-cs.cs.umn.edu (Patrick Powell, Dept. CS. U-Minn) writes:
> ]I have been looking at the problem of building a general purpose
> ]set of tools for doing some of the computation associated with
> ]the DNA/RNA sequencing work.   One of the problems that I have is getting
> ]a wider perspective on the TYPES of tools that users want.
> ]
> ]1.  User interfaces:
> ]2.  Database Searching:
> ]3.	What types of searching do you want?
> ]4.  If you had a wish list,  what NEW facilties,  tools, commands, etc.,
> ]	would you want?
> ]
> 	I believe that there are two deficiencies in current sequence
> analysis software.
> 
> 	(2) a general regular expression matching program for looking
> at DNA or protein sequences.  Although regular expression matching
> programs are common, a program appropriate for biological sequence analysis
> should have the ability to specify a distance range over which matches
> must take place.  For example, match the pattern ABC{6-10}NDEF would
> match ABCNNNNNNDEF and ABCNNNNNNNNNNDEF but not ABCNNNNNDEF.  In addition,
> such a program should recognize DNA or protein sequences, not be
> confused by newlines, and provide answers that refer to residue
> numbers, not line numbers.
> 
Here at Wisconsin, a set of VERY low level tools has been developed for use 
in analysis and manipulation of characterized nucleic acid sequences.
Although the level of analysis is somewhat primitive, source code is
readily available and might serve as a starting point for the development
of more sophisticated tools.

You also might take a look at a paper bound for an upcoming issue of AAAI.
"Representing Genetic Information with Formal Grammars", by David Searls.
He has used a prolog implementations of a definite clause grammar to 
organize an analytical approach.  A definite clause grammar is more
powerful than regular expressions in the Chomsky hierarchy, and allows
very flexible descriptions of nucleic acid constructs.

The type of match described above is very straight-forward, as are
hierarchical features, overlapping features, and alternative structures
(e.g. alternative splice patterns).

The initial intent of the grammar is to "parse" uncharted DNA, such as
that which will become available from the human genome sequencing project.
It has other potential uses, however.  My own interest is to use it as the 
basis for a planner for molecular genetic constructions.  For example, if
one wanted to create a shuttle vector, such a planner might return 
several strategies by which such a vector might be created from existing 
vectors utilizing available techniques (an appraisal of restriction sites,
mutagenesis techniques, etc., as these techniques apply to the given
molecular precursors available in one's own refrigerator).