hargrove@bee.corp.sgi.com (Mark Hargrove) (04/07/90)
In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:
Fault-tolerant searches always succeed. FT searches return a
list of APPROXIMATE matches, of which the EXACT (or correct)
match may or may not be present in this list. A common example
of fault-tolerant searching is a spelling checker. Most spelling
checkers will always return a list of APPROXIMATE matches, of which
the EXACT (or correct) spelling may or may not be present in the list
This is a powerful notion, and certainly non-trivial. As a long time
user of SQL and several other QBF applications, I've spent many an hour
contstructing "approximate" searches using regex() (or similar) style
patterns. A query language that did approximate matching for me
automatically is a bit scary. How would you control the degree of
"approximation?" I could argue that every row in a database was
approximately correct, since it contained Roman characters, or Arabic
digits. I might be able to conceive of a circumstance where this is
desireable, but for *my* work, it wouldn't be useful.
How do you intend to define "approximately?"
endrizzi@sctc.com (Michael Endrizzi ) (04/09/90)
hargrove@bee.corp.sgi.com (Mark Hargrove) writes: >In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes: >contstructing "approximate" searches using regex() (or similar) style >patterns. A query language that did approximate matching for me >automatically is a bit scary. How would you control the degree of You are not the only one to express this. I cannot figure out why. If I said that I have a spelling checker, and it will tell you which word is wrong, but WON'T offer alternatives that are "close" to the word in question, would you find this usefull??? This is what current database technology offers. If you are not sure that the data or the query is perfect, there is no way to locate information that is "approximate" to the incomplete and/or incorrect information that you feed it. regex() searches are very positionaly dependent. It cannot detect errors such as tranposed characters very well. You bring up another interesting point. I feel that our model offers much superior recall rates with none of the "*h()|\(+adfadf?\)" gibberish involved in regexp searching. Another question. You said you have spent hours performing regexp(), searches but turn around and said you don't find our model usefull. Explain this anamoly please. >"approximation?" I could argue that every row in a database was And I would totally agree with you!!! >How do you intend to define "approximately?" In our model it is a parameter to the search process. There are 2 metrices of recall in information retrieval: 1) Recall Rate 2) Precision Rate Recall rate is the number of records retrieved over all the records in the database. Precision rate is the percentage of records retrieved that are relevant (definition of relevant depends on user, application, scenerio). We have developed a metric that combines the two of these into a single metric called the Retrieval Rate. This is a user defined rate. The user can adjust this rate to make the search process only return 100% exact matches which would make the search process act like a traditional RDBMS, or can lower this rate to search for "approxmiate" matches. Thank you for responding, Dreez ================================================================= ================================================================= Michael J. Endrizzi Secure Computing Technology Corp. 1210 W. County Road E #100 Arden Hills, Mn. 55112 endrizzi@sctc.com (612) 482-7425 *Disclaimer: The opinions expressed above are not of my employer but of the American people. ================================================================= =================================================================
jkrueger@dgis.dtic.dla.mil (Jon) (04/10/90)
hargrove@bee.corp.sgi.com (Mark Hargrove) writes: > A query language that did approximate matching for me > automatically is a bit scary. How would you control the degree of >"approximation?" And more than a bit scary on updates. "Give the employees who work in the shoe department a 10% raise" is the safe example. "Give the employees who work in the more deserving departments a 10% raise" is the getting-interesting-time example. "Loan the customers with acceptable credit histories the money" is the fear and loathing example. -- Jon -- Jonathan Krueger jkrueger@dtic.dla.mil uunet!dgis!jkrueger Drop in next time you're in the tri-planet area!
endrizzi@sctc.com (Michael Endrizzi ) (04/10/90)
Hi Jon, Thanks for taking time looking at the examples. >>And more than a bit scary on updates. "Give the employees who work >>in the shoe department a 10% raise" is the safe example. "Give the I would have to agree with you on updates. If you told a spelling checker to correct your document and ship it to the customer without any type of manual review, you would loose your job awfully fast. However, this does not mean that spelling checkers are worthless. This is the only claim I am making for our model. Errors exist in queries and in data bases, and currently there is no way of filtering through this information ( I do not consider regexp() a very powerful tool). If I gave you a document that is 10,000 pages long and told you to manually find the errors, you never would. If I gave you a database with 10,000 records and told you to manully find the errors, you never would. >employees who work in the more deserving departments a 10% raise" is >the getting-interesting-time example. "Loan the customers with >acceptable credit histories the money" is the fear and loathing example. This example does not make any sense in both traditional RDBMS search methods or our model. Our model only searches on syntactic information. Words like "acceptable" and "more deserving" are filled with semantics that only the AI types could attempt to figure out. Thanks Again Jon, Dreez ================================================================= ================================================================= Michael J. Endrizzi Secure Computing Technology Corp. 1210 W. County Road E #100 Arden Hills, Mn. 55112 endrizzi@sctc.com (612) 482-7425 *Disclaimer: The opinions expressed above are not of my employer but of the American people. ================================================================= =================================================================
hargrove@bee.corp.sgi.com (Mark Hargrove) (04/12/90)
In article <1990Apr8.195542.14214@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes: hargrove@bee.corp.sgi.com (Mark Hargrove) writes: >In article <1990Apr3.200220.9513@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes: >contstructing "approximate" searches using regex() (or similar) style >patterns. A query language that did approximate matching for me >automatically is a bit scary. How would you control the degree of You are not the only one to express this. I cannot figure out why. If I said that I have a spelling checker, and it will tell you which word is wrong, but WON'T offer alternatives that are "close" to the word in question, would you find this usefull??? As a matter of fact, yes, I would (and do) find such a spelling checker useful. That doesn't imply that that I wouldn't find one that made suggestions useful as well. This is what current database technology offers. If you are not sure that the data or the query is perfect, there is no way to locate information that is "approximate" to the incomplete and/or incorrect information that you feed it. regex() searches are very positionaly dependent. It cannot detect errors such as tranposed characters very well. You bring up another interesting point. I feel that our model offers much superior recall rates with none of the "*h()|\(+adfadf?\)" gibberish involved in regexp searching. Maybe I missed something earlier. Just what the heck *is* your model anyway? Are you offering a extension to a query language (like SQL) that will return "matches" that are "close" in the same sense that a spelling checker suggests "close" words? Or are you offering a extension that would return "Palo Alto", when I searched on CITY_NAME="Menlo Park"? Another question. You said you have spent hours performing regexp(), searches but turn around and said you don't find our model usefull. Explain this anamoly please. It only appears to be an anomaly when you take what I said out of context. I didn't say that I didn't find your model useful. On the contrary, I *started* my posting by saying that I thought this was a powerful concept. I'm simply worried that you're oversimplifying the problems with defining "approximate" matches. >"approximation?" I could argue that every row in a database was And I would totally agree with you!!! >How do you intend to define "approximately?" In our model it is a parameter to the search process. There are 2 metrices of recall in information retrieval: 1) Recall Rate 2) Precision Rate Recall rate is the number of records retrieved over all the records in the database. Precision rate is the percentage of records retrieved that are relevant (definition of relevant depends on user, application, scenerio). We have developed a metric that combines the two of these into a single metric called the Retrieval Rate. This is a user defined rate. The user can adjust this rate to make the search process only return 100% exact matches which would make the search process act like a traditional RDBMS, or can lower this rate to search for "approxmiate" matches. I'm sorry, but I guess I'm too dumb to understand what this means. How do you decide *which* records to retrieve? What is your approximation function? I simply don't believe you can define a general function which will do "approximate" matching, UNLESS YOU'RE ONLY WORRYING ABOUT SPELLING DIFFERENCES. Is this all your "system" does? If so, it's still a step forward in retrieval problems, but not particularly revolutionary.
endrizzi@sctc.com (Michael Endrizzi ) (04/14/90)
hargrove@bee.corp.sgi.com (Mark Hargrove) writes: >As a matter of fact, yes, I would (and do) find such a spelling checker >useful. That doesn't imply that that I wouldn't find one that made >suggestions useful as well. Well ok, but you slipped out of my trap. If you were using a spelling checker that only located errors, wouldn't it be nice to have it also suggest alternatives? Currently databases only allow you to search for "perfect" information. Wouldn't it be nice to have it also suggest alternatives if the data and/or query was imperfect?? Not revolutionary, just nice. >Maybe I missed something earlier. Just what the heck *is* your model >anyway? Are you offering a extension to a query language (like SQL) >that will return "matches" that are "close" in the same sense that a >spelling checker suggests "close" words? Or are you offering a extension >that would return "Palo Alto", when I searched on CITY_NAME="Menlo Park"? Bango!!! I think you've got it. *Our* model is fault tolerant in the syntatic sense, not semantic. Semantics are a TOUGH problem and current solutions to semantic interpretations are skimpy and require much manual interpretation and classification. *Our* model has no mandatory extensions to the SQL. The only parameters to adjust are the ones that relax or restrict the strictness of the search process (how close is "close" when performing comparisions). The user does not have to format search strings like in regexp(). >I'm sorry, but I guess I'm too dumb to understand what this means. How >do you decide *which* records to retrieve? What is your approximation >function? I simply don't believe you can define a general function which >will do "approximate" matching, UNLESS YOU'RE ONLY WORRYING ABOUT SPELLING >DIFFERENCES. Is this all your "system" does? If so, it's still a step >forward in retrieval problems, but not particularly revolutionary. Don't feel bad. Took me a while to understand defintions of recall, precision and retrieval rates too. Like I said all along, only syntatic matches. I never said it was revolutionary. What is impressive about it is how accurate the search process is and that we can search on complex SQL like SELECT WHERE (car="Krysler") and (make="Reeeliant") or (car="Tayoto") and (make="Comry") JOIN.... INTERESECT... SUBTRACT... etc... THanks again for answering... Dreez
endrizzi@sctc.com (Michael Endrizzi ) (04/14/90)
I have been getting phone calls from around the country on my questionaire on fault-tolerant retrieval. I would like to share one interesting call with you from Peter Rowell at ThirdEye Software. ThirdEye software does text-retrieval and classification for large firms like Dialog and Nexus. ThirdEye attempts to perform semantic classification and retireval by looking for keywords and prioritizing them according to the context they appear in. There are other techniqes, but this is only an example. Peter said that ThirdEye talked to a major RDBMS firm about combining technologies, and the RDBMS firm went critical when ThirdEye explained the text-retrieval process. I won't speak for Peter, but I can only guess that it is a combination of misunderstanding and being set in your ways. Peter said that text-retrieval is considered an orphan from Computer Science perhaps because it is not an exact science, so it found a home in Library Science. So my question is, OLTP databases have terabytes of information stored in them. The whole RDBMS community is operating on the assumption that these terabytes and associated queries have perfect and complete information stored in them, which is a crock. Why is it so threatening and revolutionary to increase the value of this information by being able to process both the correct information and corrupted information? I would really love to have an answer to this. Thank you people, Dreez ================================================================= ================================================================= Michael J. Endrizzi Secure Computing Technology Corp. 1210 W. County Road E #100 Arden Hills, Mn. 55112 endrizzi@sctc.com (612) 482-7425 *Disclaimer: The opinions expressed above are not of my employer but of the American people. ================================================================= =================================================================
pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)
In article <1990Apr4.204241.2249@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes:
This depends on the application. If I am using Unix and I type
in the command "zgrep" instead of "grep", it would be great if
the command parser could say "I ain't got no 'zgrep', but here
are a list of commands that are approximately close to 'zgrep'".
Already Exists! Try the 'tcsh' shell with the '<ESC>$' command (well, it
actually corrects the spelling for you; you can then change it if you
don't like it).
Some applications would consider this command parser fault-tolerant
because it is able to determine that a fault occurs and not return
the empty set, but instead attempt to recover. Other applcations
would consider this a nuisance and not fault-tolerant. (Works great
on Emacs though).
I totally disagree with you calling this fault-tolerant: this has a very
specific meaning in CS circles, that processing continues in the
presence of faults. What you are discussing is known as 'fuzzy
retrieval'.
Actually, looking at the other articles in this thread, there are two
distinct, but interoperable, technologies that people; one if fuzzy
retrieval, the other is "deductive retrieval".
The first is often found in bibliographic databases, where for example
you want to ask fuzzy queries about a subject, the second is in a sense
the definition of expert system.
Fuzzy retrieval can be based on a many-valued logic called fuzzy logic,
in which predicates are not true or false, but have a weight between 0
and 1. In bibliographic databases the predicate you want to use is
'relevant to my query', and then each document is indexed not just by a
set of keywords, but also by a weight/relevance factor per each subject
you may want to index on.
Note that fuzzy databases are not the same as statistical databases,
even if there are obvious similarities.
Deductive retrieval is the idea that the sum of facts stored in a
database contains also the facts that can be derived, according to some
inference rules, from those explicitly stored in it. Many so called
expert systems are just deductive databases.
You can also have fuzzy deductive retrieval, in which the inference
rules use fuzzy instead of classical logic, and weights as truth values.
Some issue of 'The Economist' ventured to predict that one of the
biggest software businesses to come will be providing intelligent, i.e.
fuzzy, statistical and deductive, database navigators, as even now the
problem of extracting interesting data from huge, amorphous corporate,
commercial or governmental databases is pretty tough, and as data
masses, things can only get worse.
There is periodic conference on so called 'intelligent databases', and I think
reading the proceedings is a good sport.
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
peter@thirdi.UUCP (Peter Rowell) (04/15/90)
Although I appreciate Michael's plug for Third Eye, I would like to clarify a few things. In article <1990Apr13.191658.10233@sctc.com> endrizzi@sctc.com (Michael Endrizzi ) writes: >ThirdEye software does text-retrieval and classification >for large firms like Dialog and Nexus. We do not have any business relationship with either Knight-Ridder's Dialog group or Mead Data Central (Nexis/Lexis). (Which is not to say we wouldn't like to!) >Peter said that ThirdEye talked to a major RDBMS firm >about combining technologies, and the RDBMS firm went >critical when ThirdEye explained the text-retrieval >process. Actually, they had no problems with straight "keyword and boolean", because it is "precise" (i.e. there is an "exactly correct" set of documents which will be returned for any given query). What they were not too wild about was our "content similarity" method which is based on Salton's vector-space-model involving weighted attribute vectors. RDBMS people are not happy when you say that a document "might" be retrieved. >Peter said that text-retrieval is considered >an orphan from Computer Science perhaps because it is >not an exact science, so it found a home in Library Science. Actually, it isn't really in the library schools either. It mostly isn't anywhere (there are a few exceptions, like Cornell, Virginia Tech and some others). The problem seems to be that text retrieval algorithms do not submit well to classic academic analysis and thus do not make good thesis material. Most of the existing algorithms have been arrived at through almost purely empirical means - probably not what an advisor wants to see. ---------------------------------------------------------------------- Peter Rowell peter@thirdi.uucp Third Eye Software, Inc. ...!{apple,pyramid,sun}!peter 535 Middlefield ROad, Suite 170 (415) 321-0967 Menlo Park, CA 94025