[comp.databases] >Fault-tolerant Information Recall

hargrove@bee.corp.sgi.com (Mark Hargrove) (04/07/90)

In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:


   Fault-tolerant searches always succeed. FT searches return a 
   list of APPROXIMATE matches, of which the EXACT (or correct)
   match may or may not be present in this list.  A common example
   of fault-tolerant searching is a spelling checker.  Most spelling
   checkers will always return a list of APPROXIMATE matches, of which
   the EXACT (or correct) spelling may or may not be present in the list

This is a powerful notion, and certainly non-trivial.  As a long time
user of SQL and several other QBF applications, I've spent many an hour
contstructing "approximate" searches using regex() (or similar) style
patterns.  A query language that did approximate matching for me
automatically is a bit scary.  How would you control the degree of
"approximation?"  I could argue that every row in a database was
approximately correct, since it contained Roman characters, or Arabic
digits.  I might be able to conceive of a circumstance where this is
desireable, but for *my* work, it wouldn't be useful.  

How do you intend to define "approximately?"

endrizzi@sctc.com (Michael Endrizzi ) (04/09/90)

hargrove@bee.corp.sgi.com (Mark Hargrove) writes:

>In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:

>contstructing "approximate" searches using regex() (or similar) style
>patterns.  A query language that did approximate matching for me
>automatically is a bit scary.  How would you control the degree of

You are not the only one to express this.  I cannot figure out why.
If I said that I have a spelling checker, and it will tell you which
word is wrong, but WON'T offer alternatives that are "close" to the
word in question, would you find this usefull???  

This is what current database technology offers. If you are not
sure that the data or the query is perfect, there is no way to locate
information that is "approximate" to the incomplete and/or incorrect
information that you feed it.

regex() searches are very positionaly dependent. It cannot detect
errors such as tranposed characters very well. You bring up another
interesting point. I feel that our model offers much superior recall
rates with none of the "*h()|\(+adfadf?\)"  gibberish involved in
regexp searching.

Another question. You said you have spent hours performing regexp(),
searches but turn around and said you don't find our model usefull. Explain
this anamoly please.

>"approximation?"  I could argue that every row in a database was

And I would totally agree with you!!! 

>How do you intend to define "approximately?"

In our model it is a parameter to the search process.  There are
2 metrices of recall in information retrieval:

	1) Recall Rate
	2) Precision Rate

Recall rate is the number of records retrieved over all the
records in the database.

Precision rate is the percentage of records retrieved that are
relevant (definition of relevant depends on user, application,
scenerio).

We have developed a metric that combines the two of these into
a single metric called the Retrieval Rate. This is a user defined
rate.  The user can adjust this rate to make the search process
only return 100% exact matches which would make the search process
act like a traditional RDBMS, or can lower this rate to search
for "approxmiate" matches.  



					Thank you for responding,

						Dreez

=================================================================
=================================================================
               Michael J. Endrizzi
	Secure Computing Technology Corp.
	   1210 W. County Road E #100
	      Arden Hills, Mn. 55112
	        endrizzi@sctc.com
	          (612) 482-7425
	
*Disclaimer: The opinions expressed above are not of my employer
             but of the American people.
=================================================================
=================================================================

jkrueger@dgis.dtic.dla.mil (Jon) (04/10/90)

hargrove@bee.corp.sgi.com (Mark Hargrove) writes:

> A query language that did approximate matching for me
> automatically is a bit scary.  How would you control the degree of
>"approximation?"

And more than a bit scary on updates.  "Give the employees who work
in the shoe department a 10% raise" is the safe example.  "Give the
employees who work in the more deserving departments a 10% raise" is
the getting-interesting-time example.  "Loan the customers with
acceptable credit histories the money" is the fear and loathing example.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Drop in next time you're in the tri-planet area!

endrizzi@sctc.com (Michael Endrizzi ) (04/10/90)

Hi Jon,

Thanks for taking time looking at the examples.

>>And more than a bit scary on updates.  "Give the employees who work
>>in the shoe department a 10% raise" is the safe example.  "Give the

I would have to agree with you on updates. If you told a spelling
checker to correct your document and ship it to the customer without
any type of manual review, you would loose your job awfully fast.
However, this does not mean that spelling checkers are worthless.

This is the only claim I am making for our model.  Errors exist in
queries and in data bases, and currently there is no way of filtering
through this information ( I do not consider regexp() a very powerful
tool).  If  I gave you a document that is 10,000 pages long and told
you to manually find the errors, you never would. If I gave you a
database with 10,000 records and told you to manully find the errors,
you never would.

>employees who work in the more deserving departments a 10% raise" is
>the getting-interesting-time example.  "Loan the customers with
>acceptable credit histories the money" is the fear and loathing example.

This example does not make any sense in both traditional RDBMS
search methods or our model. Our model only searches on syntactic
information.  Words like "acceptable" and "more deserving" are
filled with semantics that only the AI types could attempt to 
figure out.


					Thanks Again Jon,
						Dreez

=================================================================
=================================================================
               Michael J. Endrizzi
	Secure Computing Technology Corp.
	   1210 W. County Road E #100
	      Arden Hills, Mn. 55112
	        endrizzi@sctc.com
	          (612) 482-7425
	
*Disclaimer: The opinions expressed above are not of my employer
             but of the American people.
=================================================================
=================================================================

hargrove@bee.corp.sgi.com (Mark Hargrove) (04/12/90)

In article <1990Apr8.195542.14214@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:

 
   hargrove@bee.corp.sgi.com (Mark Hargrove) writes:

   >In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:

   >contstructing "approximate" searches using regex() (or similar) style
   >patterns.  A query language that did approximate matching for me
   >automatically is a bit scary.  How would you control the degree of

   You are not the only one to express this.  I cannot figure out why.
   If I said that I have a spelling checker, and it will tell you which
   word is wrong, but WON'T offer alternatives that are "close" to the
   word in question, would you find this usefull???  

As a matter of fact, yes, I would (and do) find such a spelling checker
useful.  That doesn't imply that that I wouldn't find one that made 
suggestions useful as well.
 

  This is what current database technology offers. If you are not
   sure that the data or the query is perfect, there is no way to locate
   information that is "approximate" to the incomplete and/or incorrect
   information that you feed it.

   regex() searches are very positionaly dependent. It cannot detect
   errors such as tranposed characters very well. You bring up another
   interesting point. I feel that our model offers much superior recall
   rates with none of the "*h()|\(+adfadf?\)"  gibberish involved in
   regexp searching.

Maybe I missed something earlier.  Just what the heck *is* your model
anyway?  Are you offering a extension to a query language (like SQL)
that will return "matches" that are "close" in the same sense that a
spelling checker suggests "close" words?  Or are you offering a extension
that would return "Palo Alto", when I searched on CITY_NAME="Menlo Park"?

   Another question. You said you have spent hours performing regexp(),
   searches but turn around and said you don't find our model usefull. Explain
   this anamoly please.

It only appears to be an anomaly when you take what I said out of
context. I didn't say that I didn't find your model useful.  On the
contrary, I *started* my posting by saying that I thought this
was a powerful concept.  I'm simply worried that you're oversimplifying 
the problems with defining "approximate" matches.

   >"approximation?"  I could argue that every row in a database was

   And I would totally agree with you!!! 

   >How do you intend to define "approximately?"

   In our model it is a parameter to the search process.  There are
   2 metrices of recall in information retrieval:

	   1) Recall Rate
	   2) Precision Rate

   Recall rate is the number of records retrieved over all the
   records in the database.

   Precision rate is the percentage of records retrieved that are
   relevant (definition of relevant depends on user, application,
   scenerio).

   We have developed a metric that combines the two of these into
   a single metric called the Retrieval Rate. This is a user defined
   rate.  The user can adjust this rate to make the search process
   only return 100% exact matches which would make the search process
   act like a traditional RDBMS, or can lower this rate to search
   for "approxmiate" matches.  


I'm sorry, but I guess I'm too dumb to understand what this means.  How
do you decide *which* records to retrieve?  What is your approximation
function?  I simply don't believe you can define a general function which
will do "approximate" matching, UNLESS YOU'RE ONLY WORRYING ABOUT SPELLING
DIFFERENCES.  Is this all your "system" does?  If so, it's still a step
forward in retrieval problems, but not particularly revolutionary.

endrizzi@sctc.com (Michael Endrizzi ) (04/14/90)

hargrove@bee.corp.sgi.com (Mark Hargrove) writes:


>As a matter of fact, yes, I would (and do) find such a spelling checker
>useful.  That doesn't imply that that I wouldn't find one that made 
>suggestions useful as well.

Well ok, but you slipped out of my trap. If you were using a spelling
checker that only located errors, wouldn't it be nice to have it also
suggest alternatives? Currently databases only allow you to search
for "perfect" information. Wouldn't it be nice to have it also suggest
alternatives if the data and/or query was imperfect?? Not revolutionary,
just nice.

>Maybe I missed something earlier.  Just what the heck *is* your model
>anyway?  Are you offering a extension to a query language (like SQL)
>that will return "matches" that are "close" in the same sense that a
>spelling checker suggests "close" words?  Or are you offering a extension
>that would return "Palo Alto", when I searched on CITY_NAME="Menlo Park"?

Bango!!! I think you've got it. *Our* model is fault tolerant in the
syntatic sense, not semantic.  Semantics are a TOUGH problem and current
solutions to semantic interpretations are skimpy and require much manual
interpretation and classification.

*Our* model has no mandatory extensions to the SQL. 
The only parameters to adjust are the ones that relax or restrict the
strictness of the search process (how close is "close" when performing
comparisions).  The user does not have to format search strings
like in regexp(). 

>I'm sorry, but I guess I'm too dumb to understand what this means.  How
>do you decide *which* records to retrieve?  What is your approximation
>function?  I simply don't believe you can define a general function which
>will do "approximate" matching, UNLESS YOU'RE ONLY WORRYING ABOUT SPELLING
>DIFFERENCES.  Is this all your "system" does?  If so, it's still a step
>forward in retrieval problems, but not particularly revolutionary.

Don't feel bad. Took me a while to understand defintions of recall,
precision and retrieval rates too.  Like I said all along, only syntatic
matches. I never said it was revolutionary.  What is impressive about
it is how accurate the search process is and that we can search on 
complex SQL like

	SELECT WHERE (car="Krysler") and (make="Reeeliant") or
                      (car="Tayoto") and (make="Comry")
	JOIN....
	INTERESECT...
	SUBTRACT...
	etc...

THanks again for answering...


				Dreez

endrizzi@sctc.com (Michael Endrizzi ) (04/14/90)

I have been getting phone calls from around the country
on my questionaire on fault-tolerant retrieval.  I would
like to share one interesting call with you from Peter
Rowell at ThirdEye Software.  

ThirdEye software does text-retrieval and classification
for large firms like Dialog and Nexus. ThirdEye attempts
to perform semantic classification and retireval by looking
for keywords and prioritizing them according to the
context they appear in. There are other techniqes, but
this is only an example.

Peter said that ThirdEye talked to a major RDBMS firm
about combining technologies, and the RDBMS firm went
critical when ThirdEye explained the text-retrieval 
process.  I won't speak for Peter, but I can only guess
that it is a combination of misunderstanding and being
set in your ways. Peter said that text-retrieval is  considered
an orphan from Computer Science perhaps because it is
not an exact science, so it found a home in Library Science.

So my question is, OLTP databases have terabytes of information
stored in them. The whole RDBMS community is operating on the
assumption that these terabytes and associated queries have
perfect and complete information stored in them, which is a
crock.  Why is it so threatening and revolutionary to increase
the value of this information by being able to process both
the correct information and corrupted information? I would
really love to have an answer to this.


				Thank you people,

				Dreez


=================================================================
=================================================================
               Michael J. Endrizzi
	Secure Computing Technology Corp.
	   1210 W. County Road E #100
	      Arden Hills, Mn. 55112
	        endrizzi@sctc.com
	          (612) 482-7425
	
*Disclaimer: The opinions expressed above are not of my employer
             but of the American people.
=================================================================
=================================================================

pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)

In article <1990Apr4.204241.2249@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:

   This depends on the application.  If I am using Unix and I type
   in the command "zgrep" instead of "grep", it would be great if
   the command parser could say "I ain't got no 'zgrep', but here
   are a list of commands that are approximately close to 'zgrep'".

Already Exists! Try the 'tcsh' shell with the '<ESC>$' command (well, it
actually corrects the spelling for you; you can then change it if you
don't like it).

   Some applications would consider this command parser fault-tolerant
   because it is able to determine that a fault occurs and not return
   the empty set, but instead attempt to recover.  Other applcations
   would consider this a nuisance and not fault-tolerant. (Works great
   on Emacs though).

I totally disagree with you calling this fault-tolerant: this has a very
specific meaning in CS circles, that processing continues in the
presence of faults.  What you are discussing is known as 'fuzzy
retrieval'.

Actually, looking at the other articles in this thread, there are two
distinct, but interoperable, technologies that people; one if fuzzy
retrieval, the other is "deductive retrieval".

The first is often found in bibliographic databases, where for example
you want to ask fuzzy queries about a subject, the second is in a sense
the definition of expert system.

Fuzzy retrieval can be based on a many-valued logic called fuzzy logic,
in which predicates are not true or false, but have a weight between 0
and 1. In bibliographic databases the predicate you want to use is
'relevant to my query', and then each document is indexed not just by a
set of keywords, but also by a weight/relevance factor per each subject
you may want to index on.

Note that fuzzy databases are not the same as statistical databases,
even if there are obvious similarities.

Deductive retrieval is the idea that the sum of facts stored in a
database contains also the facts that can be derived, according to some
inference rules, from those explicitly stored in it. Many so called
expert systems are just deductive databases.

You can also have fuzzy deductive retrieval, in which the inference
rules use fuzzy instead of classical logic, and weights as truth values.

Some issue of 'The Economist' ventured to predict that one of the
biggest software businesses to come will be providing intelligent, i.e.
fuzzy, statistical and deductive, database navigators, as even now the
problem of extracting interesting data from huge, amorphous corporate,
commercial or governmental databases is pretty tough, and as data
masses, things can only get worse.

There is periodic conference on so called 'intelligent databases', and I think
reading the proceedings is a good sport.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

peter@thirdi.UUCP (Peter Rowell) (04/15/90)

Although I appreciate Michael's plug for Third Eye, I would like to
clarify a few things.

In article <1990Apr13.191658.10233@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:
>ThirdEye software does text-retrieval and classification
>for large firms like Dialog and Nexus.

We do not have any business relationship with either Knight-Ridder's
Dialog group or Mead Data Central (Nexis/Lexis). (Which is not to say
we wouldn't like to!)

>Peter said that ThirdEye talked to a major RDBMS firm
>about combining technologies, and the RDBMS firm went
>critical when ThirdEye explained the text-retrieval 
>process.

Actually, they had no problems with straight "keyword and boolean",
because it is "precise" (i.e.  there is an "exactly correct" set of
documents which will be returned for any given query).  What they
were not too wild about was our "content similarity" method which
is based on Salton's vector-space-model involving weighted attribute
vectors.  RDBMS people are not happy when you say that a document
"might" be retrieved.

>Peter said that text-retrieval is  considered
>an orphan from Computer Science perhaps because it is
>not an exact science, so it found a home in Library Science.

Actually, it isn't really in the library schools either.  It mostly
isn't anywhere (there are a few exceptions, like Cornell, Virginia Tech
and some others).  The problem seems to be that text retrieval
algorithms do not submit well to classic academic analysis and thus do
not make good thesis material.  Most of the existing algorithms have
been arrived at through almost purely empirical means - probably not
what an advisor wants to see.

----------------------------------------------------------------------
Peter Rowell				peter@thirdi.uucp
Third Eye Software, Inc.		...!{apple,pyramid,sun}!peter
535 Middlefield ROad, Suite 170		(415) 321-0967
Menlo Park, CA  94025