[comp.society.futures] looking for work on text `interest score'

ST601716@BROWNVM.BITNET ("Seth R. Trotz") (07/28/89)

Re: Getting a neural network to read newspaper articles.

     It is certainly a good suggestion that a neural network is
well suited to the task of providing a numeric rating for some form of
input. However, you suggest that the network be fed in an article as the
input vector. Generally, one uses an input vector with only a handful of
coordinates...Say 10..500 input units, 10..300 hidden units, and then one
output unit if the output is supposed to be a single real number. If the
entire text of an article were fed into the network as a vector of integers
of length equal to the length (in characters) of the article where each
component of the vector was the Ascii value of the given character, then
the network would have to be huge!! [I.E. number of input units=length of
largest article you want it to process] What you need to do, I would guess,
is provide some form of hash function to reduce the task. Perhaps create
a dictionary of 10,000 of the most common words in the English language. This
would cover a good percentage of all words in any given article. Then, process
your article by substituting the number from 1..10,000 for each word. If 1=The
and 2=dog and 3=ran then "The dog ran"-->(1,2,3) This three component vector
could easily be fed into the network. You get the advantage that the size of
the input layer of the network is reduced to the length of the article in
words. Arguably, you might want to then examine word groups...to reduce things
even further.
     But more generally, I would suggest that you examine the standard text
in neural networks, Parallel Distributed Processing by Rumelhart and McClellan
(Chapter 9 I think) to see what kinds of tasks NN have been trained to perform
successfully. Hope this helps !

                                  Seth Trotz

koreth@panarthea.sun.com (Steven Grimm) (07/28/89)

In article <8907271735.AA05275@multimax.encore.com> ST601716@BROWNVM.BITNET ("Seth R. Trotz") writes:
>input. However, you suggest that the network be fed in an article as the
>input vector. Generally, one uses an input vector with only a handful of
>coordinates...Say 10..500 input units, 10..300 hidden units, and then one
>output unit if the output is supposed to be a single real number.

If I were doing this (and it does sound like an interesting project), I
would feed the network a few lines of the message header (subject, from,
and maybe keywords), the first few lines of the message, and the last few
lines, trying to filter out .signatures and quoted text.  Of course, the
network would still have to be rather large, and training time could be
long, but it would be an interesting experiment.

---
This message is a figment of your imagination.  Any opinions are yours.
Steven Grimm		Moderator, comp.{sources,binaries}.atari.st
sgrimm@sun.com		...!sun!sgrimm

jps@cat.cmu.edu (James Salsman) (07/29/89)

If you want to evaluate netnews for interest level, just
build a rule-based expert system based on regexp matching.
Whenever you are shown something that you don't want to see,
tell the user interface why (bogus author, boring subject,
too many others with same topic, too long, too many
buzzwords, etc.) and have it store that data as a new or
modified rule.  Even better, after each article the
interface could ask for a critique of the message (Thumbs Up
or Down, and a reason why from a ~10 item menu; maybe a
keyword entry list if the menu item dictates), and the
newsreader's rule base would slowly mutate to suit your
choices.  At the end of the session you could be asked to
verify all of the mutations that you've selected, just in
case you changed your mind.

In article <8907271735.AA05275@multimax.encore.com> ST601716@BROWNVM.BITNET ("Seth R. Trotz") writes:

>      It is certainly a good suggestion that a neural network is
> well suited to the task of providing a numeric rating for some form of
> input. If the entire text of an article were fed into the network ...
> the network would have to be huge!!

Right.  At CMU some of McClelland's students are working on
connectionist parsing algorithms.  IMHO, they have ignored
the comp-sci theory behind parsing, so not only are they
re-inventing the wheel, but they are taking plenty of time
in doing so, and making up new terms for things compiler
writers have almost standardized.  Talk about a
Tower-of-Babel effect!  There are a few good researchers
starting to emerge in the field of "symbolic connectionism."

> What you need to do, I would guess,
> is provide some form of hash function to reduce the task. Perhaps create
> a dictionary of 10,000 of the most common words in the English language. This
> would cover a good percentage of all words in any given article. 

It would also ignore morphological characteristics of
words, which convey much of the meaning.  Multilevel
parsing <--> planning is the way to go.

:James

Disclaimer:  The University thinks I'm insane, or something.
-- 

:James P. Salsman (jps@CAT.CMU.EDU)

nazgul@apollo.HP.COM (Kee Hinckley) (08/01/89)

In article <33982@grapevine.uucp> koreth (Steven Grimm) writes:
>lines, trying to filter out .signatures and quoted text.  Of course, the

I don't know, I think there might be a heavy correlation on .signatures!
-- 
### User Environment, Apollo Computer Inc. ###  Public Access ProLine BBS   ###
###     {mit-eddie,yale}!apollo!nazgul     ###  nazgul@pro-angmar.cts.com   ###
###           nazgul@apollo.com            ### (617) 641-3722 300/1200/2400 ###
I'm not sure which upsets me more; that people are so unwilling to accept      responsibility for their own actions, or that they are so eager to regulate   everyone else's.