GEORGE@gunbrf.bitnet (08/13/90)
I have been reading with some interest the discussions concerning the introduction of the GenBank/EMBL/DDBJ feature tables and GenBank's role in software development. I am sympathetic with the concerns expressed by James Cassett of avoiding the establishment of a large applications software component as part of the GenBank project. However, the question of software development in association with biological databases is one that has not been sufficiently addressed. One may recall that during the discussions that led to the original GenBank contract it was proposed that a center for the collection of sequence data and the development of software tools for their analysis be established. Subsequent to this and because of concerns similar to those expressed by Jim the NIH decided to release two RFPs one for collection of the data and one for software development (the second of these was never released). This set the precedent for a strick separation of software development and data collection. In retrospect, was this a sensible decision? It seems quit clear that the development of applications software should be in the domain of the research scientists and that setting up a single, large, centralized center for such work would not only be wasteful but might actually hinder development in this field. On the other hand, we should consider whether the decision not to develop applications software should be interpreted to mean that there should be NO software development associated with the project. The first five years of the GenBank contract demonstrated quit nicely (and quit predictably, even in 1981) the difficulty of maintaining a large database effort without adequate software support for verification and management of the data. One may go a step further and assert that if the GenBank project had attempted to develop even a very primitive data access and retrieval system that perhaps we would not have had to wait 7 years for the appearance of a format that allows a protein coding region to be unambiguously translated. The main problem that I see is that we as a community still do not appreciate the complexity of the data we are trying to compile. It is still viewed that there is some fundamental distinction between the data and the software. I think, if we have any chance at all of approaching the 'Matrix of Biological Knowledge' that this view must change. In continuing with the protein translation example, we have learned over the past 10 years that this task is not quite as simple as we expected, there are numerous exceptions to the general rules and many, such as RNA editing and translation of selenio-cysteine proteins, are specific to individual coding regions. The game here is to store the knowledge of HOW these coding regions are expressed. I assert that the only sensible way to do this is through a combination of both software and data. Unless the rules for manipulating the information are precisely defined, the information itself is not effectively defined. Indeed, these rules in and of themselves are information that must be considered as part of the database. If one likes to be fashionable, one could say this is an 'object-oriented' approach; however, the key point is that the software, data structures, and data must be considered as a whole and for efficiency must be closely integrated. What does this mean in real practice? It seems that the development of some minimal retrieval system capable of interpreting the feature table information is well within the scope of the GenBank project. Keep in mind that such software would be designed ONLY to provide access to the information. It should NOT be confused with applications software designed to analyse the information. I quit agree that the data should not be 'buried' in a specialized software system; it must continue to be distributed in an easily accessible format. Such software, however, would serve to define the data more precisely and to provide an example of how to access it. Others should certainly be encouraged to develop alternative systems. However, those with minimal needs would NOT be required to hire a post-doc or purchase expensive commercial software to solve trivial problems. Moreover, and perhaps most importantly is would 'force' the database staff to define and adhere to standard conventions for representing the information (i.e., they would have to ensure that THEIR software functioned effectively on the information). Keep in mind that although the new feature tables definition is, in principle capable of representing coding regions, it allows a seemingly infinite variety of ways of doing so, many of which could cause programming difficulties. In passing, I would like to also address another recurrent misunderstanding that the database designers do not always know how the database will be used. It seems to me rather incredulous that anyone would collect anything without at least some notion of what use will be made of it. Certainly, the NIH should consider itself rather foolish for spending multi-millions of dollars on a data set that no one knows what to do with. The fact of the matter is that we have a very good idea of exactly what we want to do with a great deal of the information. Certainly, one cannot forsee all possible future uses, but we should at least make sure the data are formulated in a manner that allows for the operations that we know we want to carry out today. If formulated correctly, this does not preclude efforts to provide extensibility. I have had some experience in this area and I am convinced that if the biological database designers do not give careful thought to the functionality of the data, it will be impossible, irrespective of how artificially intelligent one is, to succeed in satisfying even the most minimal needs of the scientific community. The predictable result will be a scopeless project that flounders about helplessly looking for some direction. Please keep in mind, that the data in scientific databases consists of information that has been distilled from the scientific literature. Whether one likes to admit it or not, what is represented in the database is an INTERPRETATION of the original work and HOW these data are interpreted is entirely depend on a PRECONCEIVED notion of what is important (these arguments hold even for submitted data, the only difference is in who is doing the interpretation). This results in a particular view of the data and please understand that this is NOT a view that can be changed by a series of 'join' operations.