mo@uunet.UU.NET (Mike O'Dell) (10/28/87)
Sequential text search machines have been around for a long time, but recently they have started to become affordable for many applications. The Gould Hypersearch, the new small GEScan, and the TRW FDF-II are just a few examples of this new generation of machines. They all suffer a few interesting problems however. Like everything highly specialized, they can win very big and lose very big. Their biggest assest is they can scan data with little or no preprocessing. This is tremendous for tasks like reading a bunch of wireservice news feeds looking for text matching a specific profile (contains certain terms, etc) because they can do it in real time by simply scanning the data as it goes by. For other tasks like searching large resident collections, however, their strength can become a great weakness because they must read through all the data for each and every search. If the collection is say one to two gigabytes, which is rather modest, dumping that much data off the filesystem and through the box for each query quickly gets pretty painful. More traditional search systems build some kind of index in one pass and then "save the work" by consulting the index instead of the data to process queries. Building these indicies is often computationally and I/O intensive, but if the system must answer many such queries with great frequency, the indices win because only a tenth of 1% as much data must be funnelled through the computer to perform the search. This is why large commercial systems like Lockheed DIALOG, BRS, NEXIS and MEDIS all use index-based systems. So, like everything, sequential search machines can be used and abused. They definitely have a place in the world, but they are not always an obvious replacement for index-based software systems. -Mike O'Dell
hollaar@utah-cs.UUCP (Lee Hollaar) (10/29/87)
In article <2667@uunet.UU.NET> mo@uunet.UU.NET (Mike O'Dell) writes:
Sequential text search machines have been around for a long
time, but recently they have started to become affordable
for many applications. ... They all suffer a few interesting
problems however. Like everything highly specialized, they
can win very big and lose very big.
... their strength can become a great
weakness because they must read through all the data for
each and every search. If the collection is say one to two
gigabytes, which is rather modest, dumping that much data
off the filesystem and through the box for each query quickly
gets pretty painful.
In fact, there is no reason why a reasonably contructed text search machine
can't use some form of index to reduce the amount of searching that must be
done. (The text search machine and its accompanying system, which we have had
running for a number of years with satisfied users, does exactly that.) In
fact, they work well together. The ability to rapidly search selected areas
on a disk means that the index does not need to be as complete, and therefore
takes less disk space (compared to the commercial systems, which have an index
at least as large as the database size, a real drag for gigabyte databases!).
We use a partially inverted file index, where the index indicates which
documents contain a particular term, but not its location. Phrases are
handled by determining all documents containing the constituent terms, and
then searching them to see if the actual phrase occurs. The index overhead is
about 15% - 20%, and the index eliminates 95% - 99% of the database from the
search, resulting in search times of under ten seconds.
We also have the search machine connected to its own disk, which stores the
text information in a contiguous file system. This eliminates the bandwidth
requirements on the normal file system, speeds up searching considerably
because a seek is not necessary before reading each block, and there is a
separate searcher for each disk, so that the natural parallelism in a big
database stored on many disks is exploited. For rapid searching of other than
the complete database (as is the case where an index is used), the seek time
of the disk plays a major role in overall system performance.
For a description of what we have done, see:
L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval
System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232.
R L Haskin and L A Hollaar, "Operational Characteristics of a Hardware-based
Pattern Matcher", ACM Transactions on Database Systems, Volume 8
Number 1, March 1983.
L A Hollaar, "The Utah Text Search Engine: Implementation Experiences and
Future Plans", Proceedings of the Fourth International Workshop on
Database Machines, March 1985.
L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for
Matching Encoded Characters", May 22, 1984.
bpendlet@esunix.UUCP (Bob Pendleton) (10/30/87)
in article <2667@uunet.UU.NET>, mo@uunet.UU.NET (Mike O'Dell) says: > > Sequential text search machines have been around for a long > time, but recently they have started to become affordable > for many applications. . . . > Like everything highly specialized, they can win very big and > lose very big. > > Their biggest assest is they can scan data with little or > no preprocessing. . . . > For other tasks like searching large resident > collections, however, their strength can become a great > weakness because they must read through all the data for > each and every search. If the collection is say one to two > gigabytes, which is rather modest, dumping that much data > off the filesystem and through the box for each query quickly > gets pretty painful. > More traditional search systems build some kind of index > in one pass and then "save the work" by consulting the index > instead of the data to process queries. > -Mike O'Dell An alternative is to build a seqeuntial text search processor on a chip, attach one to each drive in the storage system and use an index to select chunks of text to search. This can give you the flexibility of text search and the speed of an indexed system. You should be able to process an average query on a 50 gb textual database in less than a couple of seconds. Simulating a system like this was my masters thesis topic. I'd give more exact numbers but its been a few years since I've read the thing. Anyone still working on the project want to fill us in on the latest and greatest progress? Bob P. -- Bob Pendleton @ Evans & Sutherland UUCP Address: {decvax,ucbvax,ihnp4,allegra}!decwrl!esunix!bpendlet Alternate: {ihnp4,seismo}!utah-cs!utah-gr!uplherc!esunix!bpendlet I am solely responsible for what I say.
dan@srs.UUCP (Dan Kegel) (10/30/87)
> In fact, there is no reason why a reasonably contructed text search machine > can't use some form of index to reduce the amount of searching that must be > done. One of the big selling points of the TRW Fast Data Finder search engine is that it can search for extremely specific things. For instance, it can probably handle tasks such as 'find all four-word sequences involving a three-letter animal name and two numbers between $4,500 and $900,000'. I'm not sure how one would build an inverted index that would help this kind of search. On the other hand, I'm not sure who needs to run searches like that. - Dan Kegel
farber@udel.EDU (Dave Farber) (11/01/87)
One per drive is also older than most computers in service ============================================================= David J. Farber University of Delaware Dept of EE Newark De 19716 Office: 302-451-1163; Arpanet/CSNet: farber@udel.edu
dlee@tut.UUCP (11/02/87)
In article <5090@utah-cs.UUCP> hollaar@cs.utah.edu.UUCP (Lee Hollaar) writes: > >For a description of what we have done, see: >L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval > System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232. > ... ... ... > ... ... ... >L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for > Matching Encoded Characters", May 22, 1984. Full text serach engines were mainly motivated by the advanced of VLSI (making these machines cheap enough) and the elimination of expensive overhead in index maintenance. For large databases, it is in general recognized that exhaustive full text search alone is not the solution, but some sort of indexing must be emploited. In fact full text search and indexing are complimentary rather than competitive methods. In addtion to Lee Hollaar's method, other methods using a combination of coarse indexing (signature file or surrogate file) and text searching can be found in: Lee, D.L. The performance and evaluation of a text retrieval machine for large databases. CSRI-172, Computer Systems Research Institute, University of Toronto, Sep. 1985. --- ALTEP - A Cellular Processor for High Speed Pattern Matching. New Generation Computing, 4,3 (Sept 1986), 225-244. --- A Word-parallel, bit-serial signature processor for superimposed coding, Proc. 2nd Intl. Conf. Data Eng. Feb. 1986, 352-359. - Dik Lee -- Dept. Computer and Information Science dlee@tut.cis.ohio-state.edu The Ohio State University ..!cbosgd!tut.cis.ohio-state!dlee Columbus, OHIO 43210-1277 614-292-2568