[bionet.molbio.evolution] needs for phylogeny programs

gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") (11/24/90)

Although a few comments fo my own are listed at the end, most of this
message is really a reply to Don Gilberts message re: what is needed in
PHYLIP.  Not really what was asked for, but I disgreed strongly enough
with what he said that I felt I should reply: 

Don says:

>I would much prefer to see Phylip read (and if needed, write) sequence data
>using the IntelliGenetics format.  This format is widely used now, and some
>multiple aligners already produce this output.  

	I disagree, I think that most people acquire their sequences through
	the sequence databases, and that these should be the primary formats
	supported.  Other formats are really just gravy. Of course 
	there is no database standard for interleaved multiple sequences at 
	this time.

>                                                The current Phylip 3.3
>sequence input format, with its interleaving of species, is a pain to
>translate to/from.  

	Again, I disagree.  If you intend to look at the sequences as an entire 
	group, interleaved format is the only one that lets you do it.

>                   While many multi-aligners produce some sort of interleaved
>output for _display_, all of these require extensive hand editting to
>fit into Phylip format.  I am working with output from another program now
>that uses an interleaved format: about 150 species with 2000+ bases each.

	I have found it pretty trivial to convert the formats.  Usually 
	using the GCG program PRETTY.  All that is needed is to remove the 
	sequence names from the blocks after the first one, which is easily 
	done with the replace function of an editor.

>Normally programs read 1 sequence at a time.  This means 150 passes thru
>a 600 kilobyte file ... it takes a while.  A format with one sequence after
>another can be read in one gulp.

	If the overall length of the sequences are noted at the beginning of 
	the file, a sequence reading program requires only one pass to read all 
	of the sequences.  In the worst case, where the length of the sequences 
	is unknown, only two passes are required: the first to find the lengths 
	of the sequences, the second to read them all.  This assumes that the 
	sequences will all be stored in memory, which doesn't seem too 
	prohibitive, although 300K might be too much on a pc.  However, 
	whatever the length sequence buffer you can afford it only takes at 
	worst (sum_of sequence_lengths)/(sequence_buffer) +1 passes to read the 
	sequences.  for Don's case of 300K of sequence, even a small buffer 
	such as 30K reduces the passes to 10.
-------------------------------------------------------------------------------
My suggestions:

I think that a filter to convert sequences to and from interleaved 
format would be useful.  From Don's message, it si clear there is a 
demand and it would solve his problems, I think.

It would be nice have a way to compare the generated tree with a tree 
of specified topology and calculate if they are significantly 
different.  I'm thinking of the case where the sequence based phylogeny 
disagrees with the known or currently approved phylogeny, and you would 
like to know how confident to be that the sequence based tree is really 
different.  Let me just add that I have only used the current 
package a little bit, and maybe there is a way to do this already.

Michael Gribskov
gribskov@ncifcrf.gov