B_FOLEY%UVMVAX@PUCC.PRINCETON.EDU (03/19/91)
The UNIX/MAC/DOS/VMS/GUI debate has been quite interesting and somewhat informative. I would like to bring up another topic that won't be solved by open discussion, but that people should give some thought to: The data we work with. From the length of the discussions I have seen on minor changes in GenBank data format, I am sure that opinions will run strongly in many directions on database issues. This list should be a good conduit for exposing some of those directions. IMHO we need to be more concerned with getting good data into machine-readable form than we are about what format the data is in. Re-formating IS a major pain, but having a database in a bad format that can be re-formatted by machine is better than not having a database at all. EMBL/GenBank/PIR are wonderful resources that are tremendously needed and used by many biologists/biochemists, but they are only the tip of an iceberg that is rapidly coming into view. These databases are quite up to date at storing the data that they set out to store, but one can imagine storing much more information in database form. Examples: 1) It might be useful to be able to search for all genes known to be expressed in liver in response to insulin. A lot of this type of information is known, but it may never get cross-indexed to the existing database entries for those genes. 2) It would be useful to be able to find out what mutagenic agent (if known) was responsible for each mutation site noted in EMBL/ GenBank. This data is not stored. 3) It would be nice to have a database of known secondary structures of RNAs such as tRNAs and self-splicing introns. But it would be another leap to have this information "mapped" onto the existing database entries for those structural RNAs and the genes that encode them. Looking at a GenBank entry for a tRNA right now gives you no clue if the secondary structure is known and stored somewhere. 4) A look through LiMB shows that hundreds of micro-databases are springing up. Is anyone thinking about a grand scheme to plan them so that they can be cross-indexed or linked at some future point? I know it took GenBank quite some effort just to put the E. coli K12 map positions onto each E. coli gene in GenBank. 5) The Human Genome Project will generate enough raw data that we may not have time to enhance the existing data we have. Should some sort of survey be taken to see what priority gets put on the various types of data? 6) We are making big progress in Artificial Intelligence fields. Are current journal publications being edged into a format that might facilitate machine reading or at least machine-searching of journals? I know that what I can do with MEDLINE now is a vast improvement over scanning Current Contents, but I think even MEDLINE is quite crude compared to what is potentially possible. 7) Should anyone be trained in Bio-Information? Should we make database design and maintenance a science in itself? Should graduate students in biological sciences be forced to take a computer science class? I could go on, but I think I have said too much already. I would like to end with a round of applause for GenBank, EMBL, PIR and all of the databases; for Don Gilbert of IUBIO, Dan Davison of UH-gene-server, Rob Harper of FINFUN, and all the other very helpful people in NET-LAND; for John Devereux of UW-GCG, Jim Ostell of IBI/NBRF, Amos Bairoch of Pro-Site and all the other programmers!!!!! I hope to be of help someday soon too. Naturally; Brian Foley B_FOLEY@UVMVAX.UVM.EDU
gilbertd@cricket.bio.indiana.edu (Don Gilbert) (03/20/91)
Perhaps Jim Ostell or someone from NCBI will comment on what they are working on, as it addresses several of your points on advancing the organization of molecular biology data. There is also an RNA database group working toward one of your points. You can expect to see some of these projects forthcoming in the next months, I beleive. -- Don -- Don Gilbert gilbert@bio.indiana.edu biocomputing office, biology dept., indiana univ., bloomington, in 47405