murphy@PHRI.NYU.EDU (Ellen Murphy) (12/09/89)
In article <8912080616.AA04908@genie.gene.com> botstein@gene.com (David Botstein) writes: >... the level of accuracy required is dependent upon the use to >which the sequence is to be put. More important ... is an agreement > on a labeling scheme that associates with each sequenced base > (or stretch of bases) an estimate of the expected reliability... I agree completely. There is definitely a place in the databases for less than perfect sequences. I'm not sure what kind of accuracy we're talking about, though: 95% sounds a bit low-one reading of one gel on one strand should produce better than that. (I would also be surprised if the databases were really only 99% accurate). But how do you define accuracy? What about the 1 kb of sequence that I have 99.8% confidence in (there's two uncertain bases) that is almost 100% single-strand generated? I had to walk that far to find the bit I was interested in, and I'm unlikely to sequence the other strand in the near future. In the meantime, that sequence could be useful to somebody else. I don't think the search algorithms will be too much bothered; a frameshift would simply cause tfasta to score a hit against two reading frames in the same entry. It ought to be possible to determine what level of inaccuracy, randomly distributed (which is probably not usually the case), would be required to miss scoring a hit at a given level of similarity. This is the number we should be aiming for. On the related subject of when people should be required to submit sequences to the databases, I think that those who are using "it's not perfect yet" as an excuse will just find themselves some other reason for delay. Submission of sequences upon acceptance of a manuscript is something that has to be enforced by the journals. I find the "it's not our responsibility" attitude of some respected journals quite offensive. Ellen Murphy murphy@phri.nyu.edu