Willard McCarty <MCCARTY@vm.epas.utoronto.ca> (02/26/90)
Humanist Discussion Group, Vol. 3, No. 1095. Monday, 26 Feb 1990. (1) Date: Thu, 22 Feb 90 17:11 EST (59 lines) From: <NEUMAN@GUVAX> Subject: Georgetown Catalog of Projects in Electronic Text (2) Date: Fri, 23 Feb 90 11:56:21 CST (23 lines) From: "Michael S. Hart" <HART@UIUCVMD> Subject: Books on disk (3) Date: Fri, 23 Feb 90 10:00 CST (65 lines) From: John Baima <D024JKB@UTARLG> Subject: RE: Annotated e-texts, retrieval (1) -------------------------------------------------------------------- Date: Thu, 22 Feb 90 17:11 EST From: <NEUMAN@GUVAX> Subject: Georgetown Catalog of Projects in Electronic Text In a recent posting on HUMANIST, Bob Kraft generously mentioned Georgetown's project of maintaining a catalogue of archives and projects in machine-readable text. Because he suggested that a progress report would be welcome, I've compiled the following brief sketch. Since April of 1989 we have gathered information (in varying degrees of completeness) on 274 projects in twenty-five different countries. Of these projects, 82 emphasize linguistics and language study, while 192 focus on other disciplines in the humanities. Arranged in geographical order, the entries contain ten categories of information: 1. Identifying Acronym 2. Name and Affiliation of Operation 3. Contact Person 4. Disciplinary Interests 5. Focus (period, location, individual, or genre) 6. Language(s) Encoded 7. Intended Use 8. Format 9. Forms of Access 10. Source(s) of Archival Holdings Because of the flow of correspondence and the lag time in updating entries, the information is always in a state of flux; therefore, we have been reluctant to distribute obsolescent drafts of the catalogue. Nevertheless, Jean Feerick, our Project Coordinator, responds directly to inquiries about archives or disciplines on which we have information, and we are constructing a database that will support dial-in, on-line access. We're grateful for the on-going support we've received from Bob Kraft (who provided the initial vision and the original data for the project), Marianne Gaunt and Bob Hollander of the Rutgers-Princeton Project (a major source of information about specific texts in electronic form), Lou Burnard of the Oxford Text Archive (the primary repository of etexts in the humanities), Ian Lancashire and Willard McCarty for the valuable information in the Humanities Computing Yearbook, and the many project directors who have responded to our surveys and follow-up letters. A complete account of our indebtedness would require a separate file on the listserv. Michael Neuman Georgetown Center for Text and Technology Georgetown University Washington, DC 20057 (202) 687-6096 neuman@guvax.bitnet neuman@guvax.georgetown.edu (2) --------------------------------------------------------------31---- Date: Fri, 23 Feb 90 11:56:21 CST From: "Michael S. Hart" <HART@UIUCVMD> Subject: Books on disk I note most of the discussion involves books with accompanying disks. It does not seem clear whether the disks are an extension of the books or an exact copy. At the libraries I work with, the book are available on disk, and all the students and staff have to do is bring a floppy and copy the files to take back to their own machines for research. The licenses include use by all members of the college and the price breaks down to between a penny and a dime per student for the complete works of Shakespeare. Thank you for your interest, Michael S. Hart, Director, Project Gutenberg National Clearinghouse for Machine Readable Texts BITNET: HART@UIUCVMD INTERNET: HART@VMD.CSO.UIUC.EDU (*ADDRESS CHANGE FROM *VME* TO *VMD* AS OF DECEMBER 18!!**) (THE GUTNBERG SERVER IS LOCATED AT GUTNBERG@UIUCVMD.BITNET) (3) --------------------------------------------------------------69---- Date: Fri, 23 Feb 90 10:00 CST From: John Baima <D024JKB@UTARLG> Subject: RE: Annotated e-texts, retrieval In response to Pieter C. Masereeuw and Steven DeRose: Steven DeRose states: "The features you described are basically the extensions of everyday search tools to ***hierarchical*** documents. For example, in most texts sentences and words are demarcated, but not discourse units above the sentence, nor elements smaller than words, such as morphemes. Any scheme which represents these levels should allow annotations at all levels." This is precisely what Lbase allows as a search retrieval engine. Lbase supports hierarchical, recursive, multilingual tagged texts. Recursion is a necessary feature for a retrieval engine because recursion is a common feature. Tags can range from a single character to about 4,000 characters. Lbase allows regular expression like searches on the tags, including specifying agreement between tags on different elements (e.g., give me all instances of an infinitive followed by an indicative, but they have to have the same dictionary form). So far, only Greek, Hebrew and Roman alphabets have been supported, although others could be added and probably will be for the next release. Besides searches, Lbase can also make word concordances based on tags at the word or morpheme level. For example, if one of the tags at the word or morpheme level is the dictionary form, Lbase can make a word concordance based on that dictionary form. While the search engine of Lbase supports all this, there are a couple of problems with making this practical today. One problem is that Lbase runs under MS-DOS and I am limited by 64k segments. Since a search must often backtrack, this size limitation makes it impractical to search an element that is larger than 64k. Thus it is not practical at this time to allow paragraph or larger elements because they could exceed that limit, although there is no built in limitation with the search engine. The second and main problem is that there has never been a standard for encoding such tests. Thus I have several different "drivers" for the different texts that Lbase knows about, but even the format of these texts changes from time to time without warning. Since I am not on any of the TEI committees, I am eagerly awaiting to see what they recommend. Hopefully, they will provide us all with a usable standard. I have had many requests to support brand X file format, but it is simply not economically feasible support a new format to make one sale. (Lbase has never received any outside funding.) One other note. While the Summer Institute of Linguistic's "IT" program helps in creating a text that is tagged at the word or morpheme level, it lacks a search engine. Lbase can search these files also. Lbase also supports the TLG and PHI/CCAT CD-ROM's. If anyone wants more information, please write and I will try to answer. John Baima email:silver@utafll.lonestar.org Silver Mountain Software 7246 Cloverglen Dr. Dallas, TX 75249 (214) 709-6364