bcs@PRC.Unisys.COM (Barry Silk) (11/07/90)
Can anyone point me to existing software, preferably freeware or shareware, that can examine a free-text file and determine its components? That is, I'm looking for software that can build indexes into a free-text file for words, sentances, and paragraphs -- perhaps with options to give a user control of how delimeters are defined and handled. I'm also interested in locating freeware for noun phrase recognition and/or part-of-speech tagging. Any help would be greatly appreciated!! Thanks in advance! ----------------------------------------------------------------------- Barry Silk bcs@prc.unisys.com Unisys - Center for Advanced Information Technology (215) 648-2509 -----------------------------------------------------------------------
bcs@PRC.Unisys.COM (Barry Silk) (11/15/90)
Thanks to everyone who responded to my query! Several people had requested that I share the results of my posting with them. This posting summarizes the responses I received. In article <15508@burdvax.PRC.Unisys.COM> I wrote: >Can anyone point me to existing software, preferably freeware or >shareware, that can examine a free-text file and determine its >components? That is, I'm looking for software that can build indexes >into a free-text file for words, sentances, and paragraphs -- perhaps >with options to give a user control of how delimeters are defined and >handled. > >I'm also interested in locating freeware for noun phrase recognition >and/or part-of-speech tagging. > >Any help would be greatly appreciated!! Thanks in advance! Several people gave interesting suggestions for how they would approach the problem of identifying the components of free-text files. Their responses are included below. Looks like we might go with Kimball Collins' Emacs suggestion for identifying text components, since Emacs already has definitions for sentences and paragraphs. There was also a response from a commercial software vendor. The part-of-speech tagging problem was not addressed by anyone. -------------------------------------------------------------------------- Date: Wed, 7 Nov 90 10:45:24 EST From: djm@caen.engin.umich.edu (David Musliner) To: bcs@prc.unisys.com Subject: file indexing If you are working on a UNIX based machine and no ideal software seems to be out there, I would suggest you consider hacking your own in Perl. Perl is a very powerful language that combines the string handling powers of awk and sed with the language features of C and the C shell. I have found it ideal for hacking up quick parsers, translators, etc. It also supports associative arrays, which might be ideal for the indexes you desire. I just mention this b/c I dont think too many AI folk know/use perl. -->Dave -------------------------------------------------------------------------- Date: Wed, 7 Nov 90 10:52:23 -0600 From: Robert Goldman <rpg@rex.cs.tulane.edu> Subject: Looking for IR/Keyword Software Reply-To: rpg@cs.tulane.edu (Robert Goldman) Will you please let me know if you get any responses about freeware for p-o-s tagging? I know of some stuff done at Xerox which does tagging, but haven't been able to pry a copy out of them. Am just about to give up trying and recode it myself... There is also something called CLAWS that's been done at the University of Lancaster, and there may be some work at Bell Labs. Best, Robert -------------------------------------------------------------------------- Date: Wed, 7 Nov 90 12:21:39 cst From: Sanjiv K. Bhatia <sanjiv@hoss.unl.edu> Subject: Re: Looking for IR/Keyword Software Organization: Computing Resource Center, University of Nebraska Assuming that you have access to BSD Unix, try the commands indxbib. Works good if your documents are in refer format and you can attach your own keywords. Sanjiv -------------------------------------------------------------------------- Date: Thu, 8 Nov 90 17:05 PST From: ames!juts.ccc.amdahl.com!kpc00@garp.MIT.EDU (Kimball P Collins) Subject: Looking for IR/Keyword Software For simple text components, try the emacs paradigm: pretend that you are editing a file, and go abck or forward to the beginning or end of a text component. I'm also interested in locating freeware for noun phrase recognition and/or part-of-speech tagging. I'd love to see this also. Let me know if you find anything. -------------------------------------------------------------------------- From: jcg27%CAS@pucc.PRINCETON.EDU (Jon Gilliam) Date: Fri, 09 Nov 1990 07:36 EDT Subject: Re: Looking for IR/Keyword Software Hi! I've worked on two filters, one to recognize sentences (and eventually other structures) and another, more general purpose program, to recognize structures (such as headers, cross references, etc. ) and mark them with SGML tags in free text. All recognition is done from the format of the text ... i.e., capitalization, indentation, ending punct marks, etc. The sentence recognition program was never completely debugged (that part of the work was dropped), but the Structures Recognition program was completed. I'll include the technical doc for it here ... I'm not sure what the status of these programs are with my company, but if you're interested, I can see what it would take to allow you to use them, if you like. Talk with you later! :jon [interesting technical documentation deleted - please contact author for details] -------------------------------------------------------------------------- Date: Fri, 9 Nov 90 18:04:04 EST From: Tim Bray <tbray@watsol.waterloo.edu> Subject: Text search software Saw your note on comp.ai. This is more or less a commercial message, so stop now if you're going to be offended. If you have no luck with shareware/freeware, we're in the business of selling search/retrieval software that does more or less exactly what you describe. If you're interested, drop me a line with your address and I'll send you some more info on it. Cheers, Tim Bray, Open Text Systems ----------------------------------------------------------------------- Barry Silk bcs@prc.unisys.com Unisys - Center for Advanced Information Technology (215) 648-2509 -----------------------------------------------------------------------