[comp.ai] Looking for IR/Keyword Software

bcs@PRC.Unisys.COM (Barry Silk) (11/07/90)

Can anyone point me to existing software, preferably freeware or
shareware, that can examine a free-text file and determine its
components?  That is, I'm looking for software that can build indexes
into a free-text file for words, sentances, and paragraphs -- perhaps
with options to give a user control of how delimeters are defined and
handled.  

I'm also interested in locating freeware for noun phrase recognition
and/or part-of-speech tagging.

Any help would be greatly appreciated!!  Thanks in advance!

-----------------------------------------------------------------------
Barry Silk				             bcs@prc.unisys.com
Unisys - Center for Advanced Information Technology  (215) 648-2509
-----------------------------------------------------------------------

bcs@PRC.Unisys.COM (Barry Silk) (11/15/90)

Thanks to everyone who responded to my query!  Several people had
requested that I share the results of my posting with them.  This
posting summarizes the responses I received.

In article <15508@burdvax.PRC.Unisys.COM> I wrote:
>Can anyone point me to existing software, preferably freeware or
>shareware, that can examine a free-text file and determine its
>components?  That is, I'm looking for software that can build indexes
>into a free-text file for words, sentances, and paragraphs -- perhaps
>with options to give a user control of how delimeters are defined and
>handled.
>
>I'm also interested in locating freeware for noun phrase recognition
>and/or part-of-speech tagging.
>
>Any help would be greatly appreciated!!  Thanks in advance!

Several people gave interesting suggestions for how they would
approach the problem of identifying the components of free-text files.
Their responses are included below.  Looks like we might go with
Kimball Collins' Emacs suggestion for identifying text components,
since Emacs already has definitions for sentences and paragraphs.

There was also a response from a commercial software vendor.  The
part-of-speech tagging problem was not addressed by anyone.

--------------------------------------------------------------------------
Date: Wed, 7 Nov 90 10:45:24 EST
From: djm@caen.engin.umich.edu (David Musliner)
To: bcs@prc.unisys.com
Subject: file indexing


If you are working on a UNIX based machine and no ideal software seems
to be out there, I would suggest you consider hacking your own in
Perl.  Perl is a very powerful language that combines the string handling
powers of awk and sed with the language features of C and the C shell. 
I have found it ideal for hacking up quick parsers, translators, etc.
It also supports associative arrays, which might be ideal for the indexes
you desire.

I just mention this b/c I dont think too many AI folk know/use perl.

-->Dave
--------------------------------------------------------------------------
Date: Wed, 7 Nov 90 10:52:23 -0600
From: Robert Goldman <rpg@rex.cs.tulane.edu>
Subject: Looking for IR/Keyword Software
Reply-To: rpg@cs.tulane.edu (Robert Goldman)

Will you please let me know if you get any responses about freeware
for p-o-s tagging?  I know of some stuff done at Xerox which does
tagging, but haven't been able to pry a copy out of them.  Am just
about to give up trying and recode it myself...

There is also something called CLAWS that's been done at the
University of Lancaster, and there may be some work at Bell Labs.

Best,

Robert
--------------------------------------------------------------------------
Date: Wed, 7 Nov 90 12:21:39 cst
From: Sanjiv K. Bhatia <sanjiv@hoss.unl.edu>
Subject: Re: Looking for IR/Keyword Software
Organization: Computing Resource Center, University of Nebraska

Assuming that you have access to BSD Unix, try the commands indxbib.  Works
good if your documents are in refer format and you can attach your own
keywords.

Sanjiv
--------------------------------------------------------------------------
Date: Thu, 8 Nov 90 17:05 PST
From: ames!juts.ccc.amdahl.com!kpc00@garp.MIT.EDU (Kimball P Collins)
Subject: Looking for IR/Keyword Software

For simple text components, try the emacs paradigm: pretend that you
are editing a file, and go abck or forward to the beginning or end of
a text component.

   I'm also interested in locating freeware for noun phrase recognition
   and/or part-of-speech tagging.

I'd love to see this also.  Let me know if you find anything.
--------------------------------------------------------------------------
From: jcg27%CAS@pucc.PRINCETON.EDU (Jon Gilliam)
Date: Fri, 09 Nov 1990 07:36 EDT
Subject: Re:  Looking for IR/Keyword Software

Hi!

  I've worked on two filters, one to recognize sentences (and
eventually other structures) and another, more general purpose
program, to recognize structures (such as headers, cross references,
etc. ) and mark them with SGML tags in free text.  All recognition
is done from the format of the text ... i.e., capitalization,
indentation, ending punct marks, etc.

  The sentence recognition program was never completely debugged
(that part of the work was dropped), but the Structures Recognition
program was completed.  I'll include the technical doc for it here ...
I'm not sure what the status of these programs are with my company,
but if you're interested, I can see what it would take to allow you
to use them, if you like.

Talk with you later!
:jon

[interesting technical documentation deleted - please contact author
for details]
--------------------------------------------------------------------------
Date: Fri, 9 Nov 90 18:04:04 EST
From: Tim Bray <tbray@watsol.waterloo.edu>
Subject: Text search software

Saw your note on comp.ai.  This is more or less a commercial message, so
stop now if you're going to be offended.  If you have no luck with
shareware/freeware, we're in the business of selling search/retrieval
software that does more or less exactly what you describe.  If you're
interested, drop me a line with your address and I'll send you some more
info on it.

Cheers, Tim Bray, Open Text Systems
-----------------------------------------------------------------------
Barry Silk				             bcs@prc.unisys.com
Unisys - Center for Advanced Information Technology  (215) 648-2509
-----------------------------------------------------------------------