[bionet.software] WAIS

jones@THINK.COM (Robert Jones) (06/25/91)

As a follow up to Rob Harper's messages about WAIS (Wide Area Information
Servers) here is the scoop on the two molecular biology sources that are
currently available.


NIH Guide (source NIH)

   This is a chunk of the NIH Guide to RFAs and Announcements covering a
   few months

GenBank (source MOLBIO)

   This is the TEXT component of the Bacterial Division of GenBank 
   (Release 65 if I remember correctly). Sequences are not included as
   doing text search on sequences isn't too useful.


I'm responsible for preparing these. Note that currently they are NOT
supported so use them to try out the system - don't expect them to be kept
up to date for a while.

WAIS provides a consistent mechanism for a variety of interfaces (X11,
gmacs, Mac, Unix shell) on various machines to access a variety of
information servers on various machines (Macs, Unix boxes, Connection
Machines) - transparency is the key - you don't need to know where the
database is or what it's running on.

Currently the search engines that are available for different machines vary
in one important feature. UNIX and Mac search engines that we provide
(public domain) simply index a file and search for keywords. The Connection
Machine Text Retrieval software is a commercial package that is quite
sophisticated. In particular it provides Relevance Feedback. 

By way of a simplistic example of the value of Relevance Feedback
consider searching a text database for the keyword HIV. You would miss
those articles that referred to AIDS but not HIV. With relevance feedback
the user has the option of examining the articles found by the 'HIV'
search, marking some of these as being "What I'm interested in" and then
asking for more articles "like the ones I've marked". The software examines
the full text of the marked articles and extracts common terms including,
one might expect the term 'AIDS'. This new set of terms is used to rescreen
the database and any new matches are presented to the user. As biology
abounds with nomenclature issues like this I feel that this approach has a
lot of value for our community. 

Most of the user interfaces that are around provide the option of relevance
feedback even if the information server does not support it. This can be
confusing - check the source that you want to work with. Both the biology
sources reside on a Connection Machine here in Cambridge.

I hope to be able spend more time on this project over the summer.
Specifically I want to set things up so that the NIH Guide is automatically
updated. Doing the same for the GenBank server is somewhat more complicated
and it takes up a lot of space (I used to have all of GenBank and PIR on
there) but I'll endeavour to do the same. Text-rich databases like Amos
Bairoch's are an excellent candidate for WAIS databases.

Play with it ... try out the variety of databases that are available on all
sorts of topics ... let us know what you think 

regards

--Robert Jones   Thinking Machines Corporation    jones@think.com