[bionet.molbio.genbank] Software development

GEORGE@gunbrf.bitnet (08/13/90)
I have been reading with some interest the discussions concerning the
introduction of the GenBank/EMBL/DDBJ feature tables and GenBank's role in
software development. I am sympathetic with the concerns expressed by James
Cassett of avoiding the establishment of a large applications software
component as part of the GenBank project. However, the question of software
development in association with biological databases is one that has not been
sufficiently addressed.

One may recall that during the discussions that led to the original GenBank
contract it was proposed that a center for the collection of sequence data and
the development of software tools for their analysis be established.
Subsequent to this and because of concerns similar to those expressed by Jim
the NIH decided to release two RFPs one for collection of the data and one for
software development (the second of these was never released). This set the
precedent for a strick separation of software development and data collection.

In retrospect, was this a sensible decision?

It seems quit clear that the development of applications software should be in
the domain of the research scientists and that setting up a single, large,
centralized center for such work would not only be wasteful but might actually
hinder development in this field.

On the other hand, we should consider whether the decision not to develop
applications software should be interpreted to mean that there should be NO
software development associated with the project. The first five years of the
GenBank contract demonstrated quit nicely (and quit predictably, even in 1981)
the difficulty of maintaining a large database effort without adequate software
support for verification and management of the data. One may go a step further
and assert that if the GenBank project had attempted to develop even a very
primitive data access and retrieval system that perhaps we would not have had
to wait 7 years for the appearance of a format that allows a protein coding
region to be unambiguously translated.

The main problem that I see is that we as a community still do not appreciate
the complexity of the data we are trying to compile. It is still viewed that
there is some fundamental distinction between the data and the software. I
think, if we have any chance at all of approaching the 'Matrix of Biological
Knowledge' that this view must change. In continuing with the protein
translation example, we have learned over the past 10 years that this task is
not quite as simple as we expected, there are numerous exceptions to the
general rules and many, such as RNA editing and translation of selenio-cysteine
proteins, are specific to individual coding regions. The game here is to store
the knowledge of HOW these coding regions are expressed. I assert that the only
sensible way to do this is through a combination of both software and data.
Unless the rules for manipulating the information are precisely defined, the
information itself is not effectively defined. Indeed, these rules in and of
themselves are information that must be considered as part of the database. If
one likes to be fashionable, one could say this is an 'object-oriented'
approach; however, the key point is that the software, data structures, and
data must be considered as a whole and for efficiency must be closely
integrated.

What does this mean in real practice?

It seems that the development of some minimal retrieval system capable of
interpreting the feature table information is well within the scope of the
GenBank project. Keep in mind that such software would be designed ONLY to
provide access to the information. It should NOT be confused with applications
software designed to analyse the information. I quit agree that the data should
not be 'buried' in a specialized software system; it must continue to be
distributed in an easily accessible format. Such software, however, would serve
to define the data more precisely and to provide an example of how to access
it. Others should certainly be encouraged to develop alternative systems.
However, those with minimal needs would NOT be required to hire a post-doc or
purchase expensive commercial software to solve trivial problems. Moreover, and
perhaps most importantly is would 'force' the database staff to define and
adhere to standard conventions for representing the information (i.e., they
would have to ensure that THEIR software functioned effectively on the
information). Keep in mind that although the new feature tables definition is,
in principle capable of representing coding regions, it allows a seemingly
infinite variety of ways of doing so, many of which could cause programming
difficulties.

In passing, I would like to also address another recurrent misunderstanding
that

   the database designers do not always know how the database will be used.

It seems to me rather incredulous that anyone would collect anything without at
least some notion of what use will be made of it. Certainly, the NIH should
consider itself rather foolish for spending multi-millions of dollars on a data
set that no one knows what to do with. The fact of the matter is that we have a
very good idea of exactly what we want to do with a great deal of the
information. Certainly, one cannot forsee all possible future uses, but we
should at least make sure the data are formulated in a manner that allows for
the operations that we know we want to carry out today. If formulated
correctly, this does not preclude efforts to provide extensibility.

I have had some experience in this area and I am convinced that if the
biological database designers do not give careful thought to the functionality
of the data, it will be impossible, irrespective of how artificially
intelligent one is, to succeed in satisfying even the most minimal needs of the
scientific community. The predictable result will be a scopeless project that
flounders about helplessly looking for some direction. Please keep in mind,
that the data in scientific databases consists of information that has been
distilled from the scientific literature. Whether one likes to admit it or not,
what is represented in the database is an INTERPRETATION of the original work
and HOW these data are interpreted is entirely depend on a PRECONCEIVED notion
of what is important (these arguments hold even for submitted data, the only
difference is in who is doing the interpretation). This results in a particular
view of the data and please understand that this is NOT a view that can be
changed by a series of 'join' operations.