[bionet.molbio.bio-matrix] Matrix meeting discussions

GEORGE@gunbrf.bitnet (08/16/90)
        On the Institutionalization of Databases and Other Threats to
                      the Matrix of Biological Knowledge

                               David G. George
                    Protein Identification Resource (PIR)
                   National Biomedical Research Foundation
                     Georgetown University Medical Center


While this treatise does not directly represent the material I presented at the
recent Matrix Meeting at George Mason University, it was in large part
stimulated by the discussions had at this meeting. In particular it was
influenced by the presentations of Mary Berlyn (I must admit to being the one
who suggested that most of us working on scientific databases have at one point
in their careers written a similar abstract) and other speakers concerning
current database efforts. It was also influenced by some of the comments I
overheard that resulted from these presentations: 'Ooo no, not these database
people again, they've been saying the same thing for the past five years!' If
one ascribes any intelligence at all to the database people, one might look for
a lesson here. Perhaps they feel that they must continually repeat themselves
because nobody is listening!

Before proceeding it is instructive to examine in more detail the problem of
reducing biological knowledge to computer form. As has been stated many times,
within the matrix community and elsewhere, biological data are not 'hard;' they
are built upon many layers of inference and one must be aware that when the
inference structure changes so does the information. Databases add an
additional level of interpretation or inference in an attempt to represent this
information symbolically. If these are data, they certainly are not data in the
traditional sense.

Typically in a real application, a group of scientists evolve a conceptual
model of the data they wish to compile and then examine the literature and
extract these data by interpreting the presented material. In the case of the
sequence databases, these data include not just the sequence itself but some
representation of regions or sites of biological interest within or composed of
the sequence, as well as other ancillary information that may describe the
properties and function of the molecule. One should note that in doing so these
groups are compiling a specific view of the scientific literature. Although the
criticism is often made that this is an inherently subjective process and that
such databases represent a scientific point of view, not data, this is
precisely the nature of the beast. Naturally, one tries to be as objective as
possible; however, the key issue is that the information is INTERPRETED. I have
yet to see a single example where this is not true to some degree! I would
challenge anyone to demonstrate that it could be done in any other way. Note
that even abstracting services such as MedLine, CAS, etc., interpret the
information to a certain degree; after all an abstract IS an interpretation of
the article. Moreover, the situation is not significantly changed when the data
are directly submitted by the authors. The database designers/compilers must
provide the contributors with a schema within which to represent their work.
This schema has profound influences on the nature of the compiled data.

This being the case, the database providers are continually frustrated by naive
assertions such as 'why are you wasting good scientists on a job best suited
for librarians?' I have yet to come across anyone skilled in the library
sciences that can fully comprehend a biological manuscript, let alone interpret
one! Please understand that this is the very fabric of the Matrix that we are
considering here. If you are satisfied to leave such work to unqualified staff
than you need not read any further. However, I am 100% certain that if we do
so, ten years from now we will still be sitting around dreaming about the
Matrix of Biological Knowledge and we will not be any closer to its realization
than we are today.

There is another aspect of databasing that recently has become more clearly
understood, that is the relationship between software and data. I am sure I do
not have to lecture most of you on this subject so I will be brief. Those
working with the information have come to realize that the interrelationships
between the data are even more complex than was first suspected. As an example,
consider the task of translating a genetic coding region to its protein
product. Over the past decade, it has become abundantly clear that there are
seemingly an infinite variety of ways of doing so; like biologists many bugs
seem insistent on expressing their individuality by doing things in their own
way. The game here is to store the knowledge of HOW these coding regions are
expressed. This leads naturally to 'object-oriented' or 'knowledge base'
approaches. The rules for gene expression are a necessary part of the
description and must themselves become part of the database. Thus the sharp
dividing line between software and data becomes rather blurred and it is no
longer feasible to maintain the classic two-state model of software development
and data collection.

If we are to be successful in compiling a Matrix of Biological Knowledge, we
must approach the problem as an interdisciplinary collaboration between
computer scientists and biologists and understand that database design and the
compiling of information are intimately intertwined; as the understanding of
the information evolves so must the design of the database. Attitudes such as
'we are not really interested in the data themselves, only in how to manipulate
them' are entirely inappropriate. The mechanisms for manipulation of the data
are the data! Biologists working alone will not be able to effectively draw on
the expertise in computer science and computer scientists working alone will
likely return to the pre-Copernican era of building epicycles upon epicycles
that have little to do with the real world. Successful systems will only be
developed by those intimately aware of the properties of the data and the
mechanisms and assumptions employed in their accumulation.

Therein lies one of the central problems with current support strategies. The
scientific community STILL does not appreciate that compiling and maintaining a
scientific database is a legitimate scientific enterprise and that database
compilation requires research both in computer science and in basic biology.
Although there have been some recent initiatives to fund research on database
design, this is considered to be distinctly different from actually compiling
the database and there are still is no initiatives to support the actual
gathering of the data in concert with these design efforts. Indeed those groups
currently pursuing active programs of concurrent software development and data
collection are being pressured to stop developing software for fear that they
have an unfair advantage over other software developers! I fear we are rapidly
entering an age of empty shells; we are destined to create exquisite data
manipulation and analysis systems with nothing to analyse and that do not
effectively express real biological science.

So let's briefly review the current and prospective funding mechanisms
available to emerging and established database groups.


Research Grants:

In the past, many database groups have been successful in obtaining relatively
stable funding via traditional research grant mechanisms, but not without
considerable difficulties. Indeed throughout its 30 year history the PIR has
drawn its major support via this mechanism. Currently all support for the
project is through research grants, although the funding agencies are planning,
or have already preordained, to rectify this situation in the near future.
Given the nature of the task, I feel that this is still the most effective
mechanism. If such work cannot effectively compete on its own research merits
perhaps it is not a worthwhile scientific endeavor in the first place.

There are two serious failings with this mechanism. 1) Once databases reach a
certain threshold in size they become too costly to support under traditional
research grant mechanisms; the funding agencies become very uncomfortable about
supporting large research projects. 2) The second problem is essentially an
educational one, as long as applications in this area continue to be greeted by

'What is proposed has to do with the development and establishment of a
resource for important genetic information, of interest to a large community of
biological investigators ... and ... although the ideas are sound and the
investigative team seems well qualified, IT IS NOT REALLY A RESEARCH
PROPOSAL...'

such efforts will continue to encounter difficulties obtaining funding via this
mechanism.


Contracts:

This appears to be the favorite mechanism being pursued by the funding
agencies; apparently such mechanisms make it easier for them to appropriate
relatively large sums of money. Although in principle contracts can be handled
in much the same way as research grants, in practice there are often important
distinctions. Contract mechanisms work most effectively when there is a well
defined amount of work to be done within a limited time span and the solution
to the problem is straightforward. Where they fail is when the problem is not
clearly understood. From my previous discussion you should not be too surprised
when I assert that I don't believe anyone really knows how to solve the
database problem; therefore, conceptually this mechanism seems inappropriate.

In practice, a need is generally identified (either by investigator initiative
or some other form of recommendation) and the funding agencies convene a series
of committees to outline the essential nature of the problem. The project is
awarded to a selected applicant whose role is to carry out the prescribed work.
The direction of the project is placed in the hands of an advisory committee,
which meets one or two weeks a year and rotates every few years. Moreover, the
contract itself is reissued on a cyclic basis, typically on a 3 to 5 year
cycle, and may pass from group to group. There is a tendency for such projects
to lack focus and to meander about the cycle of advisory board and contractee
turnover.

There is a more fundamental problem with this mechanism; biological research
has never been effectively conducted by committee. Please understand this is
not a comment on the competency of the committee members or a criticism of the
involvement of the committee. When effectively employed, advisory committees
are extremely beneficial and should be considered essential for the development
of all relatively large database projects. They keep the database group from
developing 'tunnel vision' and often provide valuable insights into the project
itself. The question is whether they should be in the position to direct the
research activities. Of course, the primary assumption is that this is not
research in the first place.

There is a more subtle by-product of such mechanisms. There is a tendency to
fund only a single project in a particular area and to squash any existing
efforts that appear to be in anyway overlapping. Given my previous description
of a database as a specific view of the literature, one might consider whether
or not this is in the best interest of science. Granted, given current
budgetary constraints, one would not recommend funding two competing efforts of
the size of the GenBank project. However, is it really wise not to permit the
establishment of alternative small scale efforts. It is not clear whether these
policies emanate directly from the agencies or from the general scientific
community, but there is a clear mandate to eliminate any and all so-called
'redundant' efforts. None of the various genetic sequence database initiatives
around in the late 70's and early 80's have survived GenBank.

I might add that these policies have directly contributed to the lack of, or at
least, reluctance toward collaboration that has been witnessed among some
database groups. When there is only one prize, the stakes become much higher
and it is human nature for mutual trust to decrease. I now hear the echo of a
recent observation made to me: 'everyone is talking about collaboration but all
I see is competition.'


Institutes:

It has been proposed and it appears that there is significant movement towards
establishing centralized institutes for compiling biological data. This appears
to be the legacy of the human genome sequencing initiative. Although there is
a tendency away from this idea for the experimental sequencing efforts, the
tendency has not been reversed for database efforts and indeed appears to be
gaining momentum.

These mechanisms presumably alleviate some of the problems of contract
mechanisms, i.e., a stable group of scientists can be assembled and there is
provision for long-term support for this group. The down side of course is that
the mechanism necessarily creates a monolithic view of science. As the model
allows for absolutely no alternative approaches, this most certainly will
hinder any real scientific advancement in the field of bioinformatics. Perhaps,
more importantly by limiting opportunities, it will decrease the interest or
ability of emerging scientists to participate in this field. If such
consolidated approaches were really good for the biological sciences, then the
funding agencies could save themselves considerable effort by simply selecting
a Nobel Laureate every year to give all the money to.


                             So What Shall We Do

I am not sure what the answers to these problems are, but I do know that the
current and projected policies are seriously flawed. The future prospects for
existing and emerging database groups are not good. It is certainly not a very
healthy environment in which to cultivate a newly emerging scientific
discipline.

Inasmuch as database groups are already given second class status, in an era of
tight budgets and decreasing numbers of grant awards they will be among the
first to go. The other avenues of support available have becoming increasingly
political. In an ideal world, politics would have no place in science. Granted
we are not living in an ideal world, but we should certainly strive to minimize
these effects. The current policies seem to be accentuating them. In this field
it is no longer sufficient to be a good scientist, one must also be a skilled
politician.

If we are to proceed with the development of the Matrix of Biological
Knowledge, it seems that we have a massive educational task at hand. If we
cannot successfully convince the scientific community that the accumulation of
biological knowledge is a  worthwhile, legitimate, and necessary scientific
endeavor, than the Matrix concept surely has no hope for the future.