GEORGE@gunbrf.bitnet (08/16/90)
On the Institutionalization of Databases and Other Threats to the Matrix of Biological Knowledge David G. George Protein Identification Resource (PIR) National Biomedical Research Foundation Georgetown University Medical Center While this treatise does not directly represent the material I presented at the recent Matrix Meeting at George Mason University, it was in large part stimulated by the discussions had at this meeting. In particular it was influenced by the presentations of Mary Berlyn (I must admit to being the one who suggested that most of us working on scientific databases have at one point in their careers written a similar abstract) and other speakers concerning current database efforts. It was also influenced by some of the comments I overheard that resulted from these presentations: 'Ooo no, not these database people again, they've been saying the same thing for the past five years!' If one ascribes any intelligence at all to the database people, one might look for a lesson here. Perhaps they feel that they must continually repeat themselves because nobody is listening! Before proceeding it is instructive to examine in more detail the problem of reducing biological knowledge to computer form. As has been stated many times, within the matrix community and elsewhere, biological data are not 'hard;' they are built upon many layers of inference and one must be aware that when the inference structure changes so does the information. Databases add an additional level of interpretation or inference in an attempt to represent this information symbolically. If these are data, they certainly are not data in the traditional sense. Typically in a real application, a group of scientists evolve a conceptual model of the data they wish to compile and then examine the literature and extract these data by interpreting the presented material. In the case of the sequence databases, these data include not just the sequence itself but some representation of regions or sites of biological interest within or composed of the sequence, as well as other ancillary information that may describe the properties and function of the molecule. One should note that in doing so these groups are compiling a specific view of the scientific literature. Although the criticism is often made that this is an inherently subjective process and that such databases represent a scientific point of view, not data, this is precisely the nature of the beast. Naturally, one tries to be as objective as possible; however, the key issue is that the information is INTERPRETED. I have yet to see a single example where this is not true to some degree! I would challenge anyone to demonstrate that it could be done in any other way. Note that even abstracting services such as MedLine, CAS, etc., interpret the information to a certain degree; after all an abstract IS an interpretation of the article. Moreover, the situation is not significantly changed when the data are directly submitted by the authors. The database designers/compilers must provide the contributors with a schema within which to represent their work. This schema has profound influences on the nature of the compiled data. This being the case, the database providers are continually frustrated by naive assertions such as 'why are you wasting good scientists on a job best suited for librarians?' I have yet to come across anyone skilled in the library sciences that can fully comprehend a biological manuscript, let alone interpret one! Please understand that this is the very fabric of the Matrix that we are considering here. If you are satisfied to leave such work to unqualified staff than you need not read any further. However, I am 100% certain that if we do so, ten years from now we will still be sitting around dreaming about the Matrix of Biological Knowledge and we will not be any closer to its realization than we are today. There is another aspect of databasing that recently has become more clearly understood, that is the relationship between software and data. I am sure I do not have to lecture most of you on this subject so I will be brief. Those working with the information have come to realize that the interrelationships between the data are even more complex than was first suspected. As an example, consider the task of translating a genetic coding region to its protein product. Over the past decade, it has become abundantly clear that there are seemingly an infinite variety of ways of doing so; like biologists many bugs seem insistent on expressing their individuality by doing things in their own way. The game here is to store the knowledge of HOW these coding regions are expressed. This leads naturally to 'object-oriented' or 'knowledge base' approaches. The rules for gene expression are a necessary part of the description and must themselves become part of the database. Thus the sharp dividing line between software and data becomes rather blurred and it is no longer feasible to maintain the classic two-state model of software development and data collection. If we are to be successful in compiling a Matrix of Biological Knowledge, we must approach the problem as an interdisciplinary collaboration between computer scientists and biologists and understand that database design and the compiling of information are intimately intertwined; as the understanding of the information evolves so must the design of the database. Attitudes such as 'we are not really interested in the data themselves, only in how to manipulate them' are entirely inappropriate. The mechanisms for manipulation of the data are the data! Biologists working alone will not be able to effectively draw on the expertise in computer science and computer scientists working alone will likely return to the pre-Copernican era of building epicycles upon epicycles that have little to do with the real world. Successful systems will only be developed by those intimately aware of the properties of the data and the mechanisms and assumptions employed in their accumulation. Therein lies one of the central problems with current support strategies. The scientific community STILL does not appreciate that compiling and maintaining a scientific database is a legitimate scientific enterprise and that database compilation requires research both in computer science and in basic biology. Although there have been some recent initiatives to fund research on database design, this is considered to be distinctly different from actually compiling the database and there are still is no initiatives to support the actual gathering of the data in concert with these design efforts. Indeed those groups currently pursuing active programs of concurrent software development and data collection are being pressured to stop developing software for fear that they have an unfair advantage over other software developers! I fear we are rapidly entering an age of empty shells; we are destined to create exquisite data manipulation and analysis systems with nothing to analyse and that do not effectively express real biological science. So let's briefly review the current and prospective funding mechanisms available to emerging and established database groups. Research Grants: In the past, many database groups have been successful in obtaining relatively stable funding via traditional research grant mechanisms, but not without considerable difficulties. Indeed throughout its 30 year history the PIR has drawn its major support via this mechanism. Currently all support for the project is through research grants, although the funding agencies are planning, or have already preordained, to rectify this situation in the near future. Given the nature of the task, I feel that this is still the most effective mechanism. If such work cannot effectively compete on its own research merits perhaps it is not a worthwhile scientific endeavor in the first place. There are two serious failings with this mechanism. 1) Once databases reach a certain threshold in size they become too costly to support under traditional research grant mechanisms; the funding agencies become very uncomfortable about supporting large research projects. 2) The second problem is essentially an educational one, as long as applications in this area continue to be greeted by 'What is proposed has to do with the development and establishment of a resource for important genetic information, of interest to a large community of biological investigators ... and ... although the ideas are sound and the investigative team seems well qualified, IT IS NOT REALLY A RESEARCH PROPOSAL...' such efforts will continue to encounter difficulties obtaining funding via this mechanism. Contracts: This appears to be the favorite mechanism being pursued by the funding agencies; apparently such mechanisms make it easier for them to appropriate relatively large sums of money. Although in principle contracts can be handled in much the same way as research grants, in practice there are often important distinctions. Contract mechanisms work most effectively when there is a well defined amount of work to be done within a limited time span and the solution to the problem is straightforward. Where they fail is when the problem is not clearly understood. From my previous discussion you should not be too surprised when I assert that I don't believe anyone really knows how to solve the database problem; therefore, conceptually this mechanism seems inappropriate. In practice, a need is generally identified (either by investigator initiative or some other form of recommendation) and the funding agencies convene a series of committees to outline the essential nature of the problem. The project is awarded to a selected applicant whose role is to carry out the prescribed work. The direction of the project is placed in the hands of an advisory committee, which meets one or two weeks a year and rotates every few years. Moreover, the contract itself is reissued on a cyclic basis, typically on a 3 to 5 year cycle, and may pass from group to group. There is a tendency for such projects to lack focus and to meander about the cycle of advisory board and contractee turnover. There is a more fundamental problem with this mechanism; biological research has never been effectively conducted by committee. Please understand this is not a comment on the competency of the committee members or a criticism of the involvement of the committee. When effectively employed, advisory committees are extremely beneficial and should be considered essential for the development of all relatively large database projects. They keep the database group from developing 'tunnel vision' and often provide valuable insights into the project itself. The question is whether they should be in the position to direct the research activities. Of course, the primary assumption is that this is not research in the first place. There is a more subtle by-product of such mechanisms. There is a tendency to fund only a single project in a particular area and to squash any existing efforts that appear to be in anyway overlapping. Given my previous description of a database as a specific view of the literature, one might consider whether or not this is in the best interest of science. Granted, given current budgetary constraints, one would not recommend funding two competing efforts of the size of the GenBank project. However, is it really wise not to permit the establishment of alternative small scale efforts. It is not clear whether these policies emanate directly from the agencies or from the general scientific community, but there is a clear mandate to eliminate any and all so-called 'redundant' efforts. None of the various genetic sequence database initiatives around in the late 70's and early 80's have survived GenBank. I might add that these policies have directly contributed to the lack of, or at least, reluctance toward collaboration that has been witnessed among some database groups. When there is only one prize, the stakes become much higher and it is human nature for mutual trust to decrease. I now hear the echo of a recent observation made to me: 'everyone is talking about collaboration but all I see is competition.' Institutes: It has been proposed and it appears that there is significant movement towards establishing centralized institutes for compiling biological data. This appears to be the legacy of the human genome sequencing initiative. Although there is a tendency away from this idea for the experimental sequencing efforts, the tendency has not been reversed for database efforts and indeed appears to be gaining momentum. These mechanisms presumably alleviate some of the problems of contract mechanisms, i.e., a stable group of scientists can be assembled and there is provision for long-term support for this group. The down side of course is that the mechanism necessarily creates a monolithic view of science. As the model allows for absolutely no alternative approaches, this most certainly will hinder any real scientific advancement in the field of bioinformatics. Perhaps, more importantly by limiting opportunities, it will decrease the interest or ability of emerging scientists to participate in this field. If such consolidated approaches were really good for the biological sciences, then the funding agencies could save themselves considerable effort by simply selecting a Nobel Laureate every year to give all the money to. So What Shall We Do I am not sure what the answers to these problems are, but I do know that the current and projected policies are seriously flawed. The future prospects for existing and emerging database groups are not good. It is certainly not a very healthy environment in which to cultivate a newly emerging scientific discipline. Inasmuch as database groups are already given second class status, in an era of tight budgets and decreasing numbers of grant awards they will be among the first to go. The other avenues of support available have becoming increasingly political. In an ideal world, politics would have no place in science. Granted we are not living in an ideal world, but we should certainly strive to minimize these effects. The current policies seem to be accentuating them. In this field it is no longer sufficient to be a good scientist, one must also be a skilled politician. If we are to proceed with the development of the Matrix of Biological Knowledge, it seems that we have a massive educational task at hand. If we cannot successfully convince the scientific community that the accumulation of biological knowledge is a worthwhile, legitimate, and necessary scientific endeavor, than the Matrix concept surely has no hope for the future.