gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") (11/24/90)
Although a few comments fo my own are listed at the end, most of this message is really a reply to Don Gilberts message re: what is needed in PHYLIP. Not really what was asked for, but I disgreed strongly enough with what he said that I felt I should reply: Don says: >I would much prefer to see Phylip read (and if needed, write) sequence data >using the IntelliGenetics format. This format is widely used now, and some >multiple aligners already produce this output. I disagree, I think that most people acquire their sequences through the sequence databases, and that these should be the primary formats supported. Other formats are really just gravy. Of course there is no database standard for interleaved multiple sequences at this time. > The current Phylip 3.3 >sequence input format, with its interleaving of species, is a pain to >translate to/from. Again, I disagree. If you intend to look at the sequences as an entire group, interleaved format is the only one that lets you do it. > While many multi-aligners produce some sort of interleaved >output for _display_, all of these require extensive hand editting to >fit into Phylip format. I am working with output from another program now >that uses an interleaved format: about 150 species with 2000+ bases each. I have found it pretty trivial to convert the formats. Usually using the GCG program PRETTY. All that is needed is to remove the sequence names from the blocks after the first one, which is easily done with the replace function of an editor. >Normally programs read 1 sequence at a time. This means 150 passes thru >a 600 kilobyte file ... it takes a while. A format with one sequence after >another can be read in one gulp. If the overall length of the sequences are noted at the beginning of the file, a sequence reading program requires only one pass to read all of the sequences. In the worst case, where the length of the sequences is unknown, only two passes are required: the first to find the lengths of the sequences, the second to read them all. This assumes that the sequences will all be stored in memory, which doesn't seem too prohibitive, although 300K might be too much on a pc. However, whatever the length sequence buffer you can afford it only takes at worst (sum_of sequence_lengths)/(sequence_buffer) +1 passes to read the sequences. for Don's case of 300K of sequence, even a small buffer such as 30K reduces the passes to 10. ------------------------------------------------------------------------------- My suggestions: I think that a filter to convert sequences to and from interleaved format would be useful. From Don's message, it si clear there is a demand and it would solve his problems, I think. It would be nice have a way to compare the generated tree with a tree of specified topology and calculate if they are significantly different. I'm thinking of the case where the sequence based phylogeny disagrees with the known or currently approved phylogeny, and you would like to know how confident to be that the sequence based tree is really different. Let me just add that I have only used the current package a little bit, and maybe there is a way to do this already. Michael Gribskov gribskov@ncifcrf.gov