walker@GENBANK.BIO.NET (Michael Walker) (02/25/90)
WORKSHOP ANNOUNCEMENT: Artificial Intelligence and Modern Computer Methods in Systematic Biology (ARTISYST Workshop) The Systematic Biology Program of the National Science Foundation, is sponsoring a Workshop on Artificial Intelligence, Expert Systems, and Modern Computer Methods in Systematic Biology, to be held September 9 to 14, 1990, at the University of California, Davis. There will be about 45 participants representing an even mixture of biologists and computer scientists. Expenses for participants will be paid, including hotel (paid directly by the workshop organizers), food (per diem of US $35), and travel (with a maximum of US $500 for travel expenses). Attendance at the workshop is by invitation only. These are the subject areas for the workshop: 1. Scientific workstations for systematics; 2. Expert systems, expert workstations and other tools for identification; 3. Phylogenetic inference and mapping characters onto tree topologies; 4. Literature data extraction and geographical data; 5. Machine vision and feature extraction applied to systematics. The workshop will examine state-of-the-art computing methods and particularly Artificial Intelligence methods and the possibilities they offer for applications in systematics. Methods for knowledge representation as they apply to systematics will be a central focus of the workshop. This meeting will provide systematists the opportunity to make productive contacts with computer scientists interested in these applications. It will consist of tutorials, lectures on problems and approaches in each area, working groups and discussion periods, and demonstrations of relevant software. Participants will present their previous or proposed research in a lecture, in a poster session, or in a software demonstration session. In addition, some participants will present tutorials in their area of expertise. Preference will be given to applicants who are most likely to continue active research and teaching in this area. The Workshop organizers welcome applications from all qualified biologists and computer scientists, and strongly encourage women, minorities, and persons with disabilities to apply. If you are interested in participating, please apply by sending to the workshop organizers the information suggested below: 1) your name, address, telephone number, and eventually your electronic mail address; 2) whether you apply as a computer scientist or as a biologist; 3) a short resume; 4) a description of your previous work related to the workshop topic; 5) a description of your planned research and how it relates to the workshop; 6) whether you, as a biologist (or as a computer scientist), have taken or would like to take steps to establish permanent collaboration with computer scientists (or biologists). A total of two pages or less is preferred. This material will be the primary basis for selecting workshop participants. If you have software that you would like to demonstrate at the workshop, please give a brief description, and indicate the hardware that you need to run the program. Several PC's and workstations will be available at the workshop. Mail your completed application to: Renaud Fortuner, ARTISYST Workshop Chairman, California Department of Food and Agriculture Analysis & Identification, room 340 P.O. Box 942871 Sacramento, CA 94271-0001 (916) 445-4521 E-mail: rfortuner@ucdavis.edu APPLICATIONS RECEIVED AFTER APRIL 15, 1990 WILL NOT BE ACCEPTED Notification of acceptance of proposal will be made before May 31, 1990 For further information, contact Renaud Fortuner, Michael Walker, Program Chairman, (Walker@sumex-aim.stanford.edu), or a member of the steering committee: Jim Diederich, U.C. Davis (dieder@ernie.berkeley.edu) Jack Milton, U.C. Davis (milton@eclipse.stanford.edu) Peter Cheeseman, NASA AMES (cheeseman@pluto.arc.nasa.gov) Eric Horvitz, Stanford University (horvitz@sumex-aim.stanford.edu) Julian Humphries, Cornell University (lqyy@crnlvax5.bitnet) George Lauder, U.C Irvine (glauder@UCIvmsa.bitnet) F. James Rohlf, SUNY (rohlf@sbbiovm.bitnet) James Woolley, Texas A&M University (woolley@tamento.bitnet) The five subject areas selected for the workshop are described in more detail below. 1 SCIENTIFIC WORKSTATIONS FOR SYSTEMATICS James Diederich and Jack Milton Department of Mathematics University of California Davis, CA 95616 Recent advances in computing technology are bringing greatly increased computing power to the desk top of the practicing systematist for prices that were unheard of only a few years ago. For example, in mid 1989 one could expect to purchase a 10 to 15 MIPS (million instructions per second) workstation for about $15,000. This machine might have eight megabytes of memory and possess a several hundred megabyte hard disk of its own or be networked to a large file server. Currently workstation manufacturers seem committed to doubling the performance of their workstations and halving the price each year. The continuing dramatic increase in computing power is making compute-intensive software, much of which was until recently only in the domain of mainframe computer users, available and working in a responsive manner on the desk top. We believe that in the near future this newly available computing power will bring such capabilities to the systematist as networked heterogeneous databases, sophisticated three dimensional modeling capabilities and corresponding databases, and semi-automated assistance in tasks requiring specialized expertise. It seems clear that these advances will change the way the systematist works. One approach in the area of semi-automated problem solving using modern computing methods involves capturing domain expertise in the form of rules, definitions, classifications, and the like. In conjunction with this a mechanism for carrying out some form of reasoning (inference) is provided. In some cases the goal is to mimic the behavior of an expert, while in other cases the goal is more modest in that the system does not try to behave exactly as an expert would. Systems implemented under this philosophy are called expert systems and are typically used only by experts. A different approach, which we take in our work, is to provide a set of tools to assist the scientist (possibly a non-expert in the field and in computer expertise) in carrying out his/her activities. We call such a collection of tools an expert workstation. Some tools may, and often will, be based on knowledge of the domain, and certainly some tools could be expert systems themselves. However, the expert workstation approach does not try to mimic expertise and usually will not be considered as a replacement for expertise. For example, a saw, a hammer, and a chisel form a set of tools to be used by an expert carpenter, but they do not in themselves replace the carpenter's expertise. Exactly how the tools are used will depend on the expertise of the user within the domain as well as on the user's expertise with the tools. Again, the analogy can be drawn between the expert carpenter's use of tools vis-a-vis an apprentice's use. The "set of tools" approach to handling knowledge representation and inference on specialized problems on a workstation lends itself well to incorporating a broad set of tools, the collection of which is very flexible, with interactions not limited to the vision of an expert system designer. The challenge for systematics seems to be to determine how to best exploit the new technology within the constraints of the resources available. In particular, how do we go about coordinating a diverse set of tasks on the workstation and making a rich scientific computing environment available and productive to scientists with a widely varying range both of knowledge about the domain as well as interest in modern workstations. Also some disciplines, such as computer aided design (CAD), have used computer tools for many years, and the existence of a large set of disparate tools that do not work well together is a particularly vexing problem. It is an advantage for systematics that we are not saddled with the problem of resolving many different existing standards, but the experience in areas such as CAD indicates that it is of critical importance for other areas to pay careful attention to tool coordination and standards from the outset. We consider it important to address the question of which tools need to be developed that provide support for fundamental activities in systematics research, have reasonably wide appeal, have long lifetimes, and form the basis for future developments. During the workshop we will have an early panel on "biological tool frameworks". During this panel and continuing into other sessions we will ask participants about the systematists requirements -- what do you want and need to be able to do at a workstation, what is a tool and what characteristics of tools emanate from the potential uses, how should tools work together, and what are reasonable standards for linking and managing possibly disparate tools? 2 EXPERT SYSTEMS, EXPERT WORKSTATIONS AND OTHER TOOLS FOR IDENTIFICATION James B. Woolley Texas A&M University Department of Entomology College Station Texas 77843 Biologists use the terms identification and classification somewhat differently than computer scientists. Classification to biologists is the process of constructing taxonomies (or the ordering or organisms into groups based on their relationships), and identification is the process of assigning an unknown specimen a place in an existing classification. Given this distinction, the general areas of identification and diagnosis are certainly familiar to workers in artificial intelligence. Expert systems for diagnosis of diseases, for example, were among the first applications of AI and this remains an active field of research and commercial development. However, expert system technology has been little used by biologists for identification of specimens. This may be surprising to AI workers, since at first glance, the problem of identifying biological specimens might seem to be little different from the diagnosis of any other classes of things. Certainly, many of the same difficulties are encountered in identifying biological material, for example, missing,imprecise or ambiguous data, scarcity of special expertise, and so forth. However, there are some subtle differences between the identification of biological specimens and the identification of other classes of things. With other types of objects, taxonomies can be erected for particular purposes, and they are often constructed with identification in mind. Biological taxonomies are generally based on criteria that may be quite external to identification processes; commonly, they are based on perceived relationships between taxa (groups of organisms). Identifying an unknown specimen involves determining to some level its placement in such a classification. Characters or attributes of organisms that are useful in inferring relationships may not be very suitable for the purposes of identification, and often characters are used for the special purpose of identification that are known to be unreliable indicators of relationship. Biological classifications are rigidly hierarchical and non- overlapping (that is, at a given level, an organism belongs to only one taxon). Existing tools for identification may or may not use the structure and logic of biological classifications to advantage. The study of the relationships between organisms is the primary research activity of systematic biologists, and classifications are perhaps the primary product of this research. People are often surprised to learn that this task has not been completed. Far from it, in many plant and animal groups only a small proportion of the species in nature have been formally described and classified. For example, about 750,000 insect species have already been described, but estimates of the number of undescribed species range from another million or so up to 30 million. Obviously, with this many objects the methods used to organize information are critical to the ability to store and retrieve data. Although various approaches exist for classifying organisms, classifications based on evolutionary relationships (phylogenetic history) are generally preferred because they are more informative and robust. The development of explicit methods for the inference of phylogenetic relationships given various kinds of data is an extremely active area of biological research, with wide implications for other fields of biology. The point is that classifications of organisms provide our only means of organizing the immense amounts of information about organisms, and that the identification of specimens is the critical first step in accessing this information. In many situations, for example interception of potential pests at border stations, precise identifications are critically important. At present, biological identifications are performed by a very small number of people, many of whom also have research and teaching interests. Because identifications per se are often among the less interesting of one's potential activities, there is widespread interest in developing more efficient methods. There has been some implementation of expert system tools for biological identification, and examples of these will be presented at the Workshop. It will be of interest to biologists to see examples of successful expert systems now used for identification or diagnosis in other fields. Certainly, workers experienced in artificial intelligence can provide guidance on the types of problems that are suited (and perhaps more importantly, the problems that are not suited) to expert system techniques. Specifically, the following areas are clearly relevant, and would seem to be a common starting point for discussions between systematists and AI workers. 1- An exploration of the methods now available for representation of biological knowledge domains, specifically biological classifications and supporting information, will serve as a foundation for much of the workshop. In this particular context, the ability to incorporate the structure and logic of biological classifications into identification devices should provide means to make them more powerful and robust. Systematists will no doubt find various methods for representing structure in a knowledge domain interesting (frames, semantic networks, etc.), and AI researchers will probably find that these knowledge domains have interesting and perhaps unique properties. 2- Methods for dealing with uncertainty in the identification process are clearly of interest. There are several sources of uncertainty in this context: damaged or incomplete specimens, natural variability among individuals of a species (or other taxon), user uncertainty as to the interpretation of attribute states, etc. We are aware that fundamentally different methods exist for representing uncertainty in AI (probabilistic methods, decision theory, fuzzy set theory and so forth) and we would like to see their potential in this context explored. 3- There will be concern about the practicality of implementing AI methods in systematics. Because many systematists now use database techniques of some sort (although not always computer- based), an exploration of techniques for rule induction would be useful. Because biological classifications are dynamic and research in many of these taxa is ongoing, methods for revision and update of knowledge domains are of interest. Critical comparisons of commercially available shells for expert systems and related issues (operating systems, hardware) would be useful. 3 PHYLOGENETIC INFERENCE AND MAPPING CHARACTERS ONTO TREE TOPOLOGIES George V. Lauder School of Biological Sciences University of California, Irvine BITNET: GLauder@UCIvmsa Two of the key interests of evolutionary biologists are (1) reconstructing evolutionary pathways or sequences of change in particular features of organisms, and (2) reconstructing genealogical relationships among organisms (also called phylogenies or evolutionary trees) As an example of the first interest, an evolutionary biologist might wish to understand the origin of flight in birds. What was the historical sequence of modifications in the muscles and the skeleton that occurred to allow early birds to fly? One might be able to make several a priori predictions about necessary morphological changes for flight (such as lightening of bones, reorientation of muscles to make the upstroke and downstroke of the wings possible, and lengthening of the arm bones to increase surface area). But how exactly did the sequence of modifications occur in evolution to produce early birds with flight? Did skeletal lightening occur before arm elongation, or was muscle reorientation the first change that occurred? An example of the second interest would be an evolutionary biologist who simply wanted to reconstruct the genealogical relationships among 20 species of birds. How is each species genealogically related to each other species? In other words, the biologist is interested in reconstructing the tree that describes that genetic relationships among the species. Both areas might benefit greatly from input from AI specialists. Most of the data sets gathered by workers in evolutionary biology generate either large numbers of trees or produce ambiguous reconstructions of evolutionary history. The number of trees may be so great or the ambiguity so extensive that assistance is needed in summarizing significant features of the trees and key aspects of character evolution. A Reconstructing historical pathways of change in individual characters. Perhaps it would be most useful if the concrete (albeit hypothetical) example of bird flight is used to examine the potential application of artificial intelligence to systematic and evolutionary biology and the difficulties now faced by systematic biologists in trying to interpret the evolution of characters. If we are given a tree that represents the genealogical relationships of a group of four bird species and wish to understand how the evolution of particular characters has occurred, we may want to reconstruct the evolution of those characters on the tree. Typically, such analyses begin with a "taxon-by-character" data matrix: Morphological feature 1 2 3 4 Taxon: species 1 A' B C' D species 2 A B' C D' species 3 A B' C D species 4 A' B C' D' where, there are four species of birds, each one a row of the data matrix, and four morphological features (1 to 4) each indicated by a different letter. Feature 1 could be the length of the arm bones, and a short arm could be represented by the letter A and a long arm by the letter A'. Feature 2 could represent the weight of the skeleton with B indicating a light skeleton and B' a heavy skeleton. These characters would be determined by an examination of each species and weighing and measuring of the muscles and bones. How might we birds given this distribution of species and morphological features? If we have available a phylogenetic tree that indicates genealogical relationships as follows: species 4 3 2 1 | | | | | | | |{ | | -------- | | | (Z) | | | | ------------ | | (Y) | | --------------- | (X) | (note that time runs up the page, and that nodes are indicated by the letters X, Y, and Z) and we accept this tree for the moment as a true depiction of the genealogical relationships of the four species of birds, then it is possible to reconstruct the evolution of any specific character on this tree. Unfortunately, it can easily be seen that there are two ways to reconstruct the evolution of morphological feature 1. Bird species 1 and 4 share feature A' while species 2 and 3 share feature A. Nodes Y and Z could be reconstructed as having A' and a total of 2 evolutionary steps (A' to A) would occur in species 2 and 3, or one could reconstruct nodes Y and Z as having feature A and 2 evolutionary steps would be required (A' to A from node X to Y, and A to A' from node Z to species 1). The key dilemma is that it is possible to reconstruct the evolution of this character in two very different ways both of which require the same number of steps. This is of course true for each character that is not completely consistent with the given tree. When forty or fifty morphological features are used in a study it is clear that any attempt to understand the evolution of these characters is made extremely difficult by our inability to examine the patterns of variation in reconstructed character evolution and exactly how each character has changed. For a biologist interested in the evolution of a structure such as the bird wing, this ambiguity in reconstruction and our inability to conveniently summarize divergent results makes interpreting the history of morphological changes extremely difficult. It would be of considerable interest to know, for example, if the ancestor of each of the four bird species above possessed the character A', a long arm, or if the process of morphological evolution is reconstructed as having involved a ancestor with a long arm that became shorter in the ancestors of species 2 and 3 and then became long again in species 1. It seems likely that artificial intelligence techniques could be used to summarize the possible reconstructions of characters on a given tree and present a summary of the major patterns of variation. Ideally, information on the nature of the characteristics of the species could be added so that changes in arm bone characters could be evaluated independently of changes in skull characters. The central problem in character reconstruction is that too many possibilities exist to allow an easy understanding or visualization of the major patterns of character evolution. Evolutionary biologists need approaches and techniques that will permit visualization and an overall perspective on the transformation of characters on trees. B Reconstructing genealogical relationships The above discussion has assumed that a particular phylogenetic tree is given to work with. But a similar general problem to that encountered above that could greatly benefit from contributions of artificial intelligence occurs when we attempt to reconstruct such trees in the first place. Given a particular taxon-by-character data matrix, there may be many trees at the shortest length. The most commonly accepted criterion for choosing a tree is that the tree should involve the shortest number of evolutionary steps (i.e., select the tree that has the shortest total length). But a given data matrix may produce many (hundreds) of equally short trees. Any contribution that artificial intelligence techniques could make to summarizing this variation in tree topologies would be of great help to evolutionary biologists in their attempts to reconstruct genealogical patterns among organisms. 4 LITERATURE DATA EXTRACTION AND GEOGRAPHICAL DATA Julian Humphries Cornell University Ithaca, NY 14850 There are two general areas where we anticipate that artificial intelligence techniques will be useful in this arena. There exists in the natural history museums an enormous reservoir of information about the distributions of organisms. Much of this information is being transferred to computerized databases, potentially enhancing the process of understanding species distributions. Unfortunately, much (if not most) of these data are accompanied not by precise descriptors of where the specimens were collected but anecdotal descriptions of how the collector got to the site (eg. 12 airmiles NNW of the intersection of State Hwy 12 and US 1). Such locality depictions may actually refer to a very precise place, yet without first hand knowledge of the area most researchers will have to resort to maps to actually determine the location. There are literally millions of collections which under our current system would need anywhere from 1-30 minutes each to determine a latitude and longitude (or other standardized coordinate system). Such a daunting task means that it will only be attempted when an individual researcher needs data for a particular taxa. It is hoped that sufficient rules about places (ie. cities, towns, counties, highways, geographic features, etc) on earth could be tabulated to at least semi-automate this process. The basic task would be one of parsing the anecdotal descriptions, deciphering an approximate or actual latitude and longitude from the data, determining a level of reliability or resolution (ie. localities where the data consist solely of "New World" should translate into an equally unresolved 'exact' description), perhaps showing a plot of the locality to a human operator on a bit-mapped screen, and finally storing the result into a extensible database. Although this still requires human intervention, the process should be orders of magnitude faster and ultimately more accurate. There are a number of confounding factors. Our data have a significant temporal component, having been accumulated for over the last 100 years. As such, the underlying knowledge base will need to know what "Germany" means in all its various historical contexts. Because we are dealing with biological objects, this too will taint the descriptions. As an example, many collections will be precisely located, but refer to a transect rather than a point (e.g. a oceanic trawl). The need for a measure of the resolution achieved can not be stressed too strongly. Unless we have some indication of when the process failed we will have little faith in the translation. The other related subject also concerns data extraction, but in a more abstract context and certainly not restricted to systematics. The process of accumulating knowledge is complicated, but at some level we acquire pieces of information from some larger construct. As the scientific knowledge base expands we are forced to spend larger amounts of time in the process of simply acquiring information (prior to any processing). Most of us have at times wished someone (or something) could scan our journals for us and tell us which articles we need to read. It seems to me that I could, with time, build up a framework describing my preferences, needs, priorities and other individual aspects of my research program. Such a framework could be used as set of rules to guide a "AIde" program which would glean the "important" information from the original literature. Note that I am begging the question of how these data will be represented. I presume that in coming years a greater proportion of our source material will be in some machine readable form, we need to have the tools ready to take advantage of that situation. 5 MACHINE VISION AND FEATURE EXTRACTION APPLIED TO SYSTEMATICS F. James Rohlf Department of Ecology and Evolution State University of New York Stony Brook, NY 11794-5245 Most data presently used in systematics are collected through the visual examination of specimens. Features are usually found by the visual comparison of specimens and most measurements are taken visually. These activities can be quite time consuming. Thus there is the potential for saving a systematist's time if appropriate hardware and software were available that would enable routine measurements to be made automatically. This would permit more extensive large- scale quantitative studies. But automation is difficult in systematics since the features to be measured are usually not easily separated from the background, i.e., the visual scene is often cluttered, and the structures of interest may not have distinct colors or intensities as in many industrial applications of image analysis. The problem is especially difficult for certain groups of organisms. The problem is further complicated due to biological variability. One usually cannot depend upon homologous structures having consistent geometrical features that can be used to automatically identify landmarks of interest. Other important complications are that most structures of interest are 3-dimensional and that the "texture" of surfaces often contains taxonomically useful information. Both aspects are difficult to capture with presently available hardware and software. For these reasons present applications of image analysis in systematics have been quite modest. In studies where data are recorded automatically, time is spent simplifying the image. For example, structures of interest are physically separated from the rest of the specimen and placed upon a contrasting plain background so the outline can be found with little error. Alternatively, an investigator can identify structures of interest by pointing to them with a mouse, watching how a program finds an outline, and them editing the trace if necessary. Working from this outline, additional landmarks can be identified by the operator. In some cases these landmarks can be associated with geometrical features of the outline and it will be possible for the software to help the operator to accurately locate these points. Due to the difficulty of solving the general problems of the automatic analysis of complex biological scenes, a more immediate goal should be to develop powerful tools that a systematist can interact with to isolate structures, locate landmarks, and compute various measurements. In addition, it would be desirable for the software to "learn" how to recognize the structures so that the process will go faster as both the software and the systematist become more experienced. Once the structures and landmarks have been found they are usually recorded so that, if necessary, additional measurements can be made without having to go back to the original image. These are usually in the form of x,y-coordinates of landmarks or chain-coded outlines. For very large studies, methods to compress this raw descriptive information need to be used. The features that are measured are usually the same types of features that would have been measured by hand -- 2-dimensional distances between landmarks or angles between pairs of landmarks. In some studies the features used are parameters from functions (such as Fourier, cubic splines, Bezier curves) fitted to the shapes of structures or of entire outlines of organisms. More work is needed to develop new types of features and to evaluate the implications of their use relative to traditional methods.