[comp.arch] Serial Search Machines

mo@uunet.UU.NET (Mike O'Dell) (10/28/87)

Sequential text search machines have been around for a long
time, but recently they have started to become affordable
for many applications.  The Gould Hypersearch, the new small
GEScan, and the TRW FDF-II are just a few examples of this
new generation of machines.  They all suffer a few interesting
problems however.  Like everything highly specialized, they
can win very big and lose very big.

Their biggest assest is they can scan data with little or
no preprocessing.  This is tremendous for tasks like reading
a bunch of wireservice news feeds looking for text matching
a specific profile (contains certain terms, etc) because
they can do it in real time by simply scanning the data as
it goes by. For other tasks like searching large resident
collections, however, their strength can become a great 
weakness because they must read through all the data for
each and every search.  If the collection is say one to two
gigabytes, which is rather modest, dumping that much data
off the filesystem and through the box for each query quickly
gets pretty painful.

More traditional search systems build some kind of index
in one pass and then "save the work" by consulting the index
instead of the data to process queries.  Building these 
indicies is often computationally and I/O intensive, but
if the system must answer many such queries with great
frequency, the indices win because only a tenth of 1% as
much data must be funnelled through the computer to perform
the search.  This is why  large commercial systems like
Lockheed DIALOG, BRS, NEXIS and MEDIS all use index-based
systems.  

So, like everything, sequential search machines can be used
and abused.  They definitely have a place in the world, but
they are not always an obvious replacement for index-based
software systems.

	-Mike O'Dell

hollaar@utah-cs.UUCP (Lee Hollaar) (10/29/87)

In article <2667@uunet.UU.NET> mo@uunet.UU.NET (Mike O'Dell) writes:
    Sequential text search machines have been around for a long
    time, but recently they have started to become affordable
    for many applications. ... They all suffer a few interesting
    problems however.  Like everything highly specialized, they
    can win very big and lose very big.
    
        ... their strength can become a great 
    weakness because they must read through all the data for
    each and every search.  If the collection is say one to two
    gigabytes, which is rather modest, dumping that much data
    off the filesystem and through the box for each query quickly
    gets pretty painful.

In fact, there is no reason why a reasonably contructed text search machine
can't use some form of index to reduce the amount of searching that must be
done.  (The text search machine and its accompanying system, which we have had
running for a number of years with satisfied users, does exactly that.)  In
fact, they work well together.  The ability to rapidly search selected areas
on a disk means that the index does not need to be as complete, and therefore
takes less disk space (compared to the commercial systems, which have an index
at least as large as the database size, a real drag for gigabyte databases!).
We use a partially inverted file index, where the index indicates which
documents contain a particular term, but not its location.  Phrases are
handled by determining all documents containing the constituent terms, and
then searching them to see if the actual phrase occurs.  The index overhead is
about 15% - 20%, and the index eliminates 95% - 99% of the database from the
search, resulting in search times of under ten seconds.

We also have the search machine connected to its own disk, which stores the
text information in a contiguous file system.  This eliminates the bandwidth
requirements on the normal file system, speeds up searching considerably
because a seek is not necessary before reading each block, and there is a
separate searcher for each disk, so that the natural parallelism in a big
database stored on many disks is exploited.  For rapid searching of other than
the complete database (as is the case where an index is used), the seek time
of the disk plays a major role in overall system performance.

For a description of what we have done, see:
L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval
        System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232.

R L Haskin and L A Hollaar, "Operational Characteristics of a Hardware-based
        Pattern Matcher", ACM Transactions on Database Systems, Volume 8
        Number 1, March 1983.

L A Hollaar, "The Utah Text Search Engine: Implementation Experiences and
        Future Plans", Proceedings of the Fourth International Workshop on
        Database Machines, March 1985.

L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for
        Matching Encoded Characters", May 22, 1984.

bpendlet@esunix.UUCP (Bob Pendleton) (10/30/87)

in article <2667@uunet.UU.NET>, mo@uunet.UU.NET (Mike O'Dell) says:
> 
> Sequential text search machines have been around for a long
> time, but recently they have started to become affordable
> for many applications.  
	.
	.
	.
> Like everything highly specialized, they can win very big and
> lose very big.
> 
> Their biggest assest is they can scan data with little or
> no preprocessing.  
	.
	.
	.
> For other tasks like searching large resident
> collections, however, their strength can become a great 
> weakness because they must read through all the data for
> each and every search.  If the collection is say one to two
> gigabytes, which is rather modest, dumping that much data
> off the filesystem and through the box for each query quickly
> gets pretty painful.

> More traditional search systems build some kind of index
> in one pass and then "save the work" by consulting the index
> instead of the data to process queries.  

> 	-Mike O'Dell

An alternative is to build a seqeuntial text search processor on a chip,
attach one to each drive in the storage system and use an index to select
chunks of text to search. This can give you the flexibility of text search
and the speed of an indexed system. You should be able to process an
average query on a 50 gb textual database in less than a couple of seconds.

Simulating a system like this was my masters thesis topic. I'd give more
exact numbers but its been a few years since I've read the thing. Anyone
still working on the project want to fill us in on the latest and greatest
progress?

			Bob P.
-- 
Bob Pendleton @ Evans & Sutherland
UUCP Address:  {decvax,ucbvax,ihnp4,allegra}!decwrl!esunix!bpendlet
Alternate:     {ihnp4,seismo}!utah-cs!utah-gr!uplherc!esunix!bpendlet
        I am solely responsible for what I say.

dan@srs.UUCP (Dan Kegel) (10/30/87)

> In fact, there is no reason why a reasonably contructed text search machine
> can't use some form of index to reduce the amount of searching that must be
> done.

One of the big selling points of the TRW Fast Data Finder search engine
is that it can search for extremely specific things.  For instance, it
can probably handle tasks such as 'find all four-word sequences involving a 
three-letter animal name and two numbers between $4,500 and $900,000'.

I'm not sure how one would build an inverted index that would help this 
kind of search. 
On the other hand, I'm not sure who needs to run searches like that.
- Dan Kegel

farber@udel.EDU (Dave Farber) (11/01/87)

One per drive is also older than most computers in service
=============================================================
David J. Farber
University of Delaware
Dept of EE Newark De 19716

Office: 302-451-1163;
Arpanet/CSNet: farber@udel.edu

dlee@tut.UUCP (11/02/87)

In article <5090@utah-cs.UUCP> hollaar@cs.utah.edu.UUCP (Lee Hollaar) writes:
>
>For a description of what we have done, see:
>L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval
>        System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232.
> ... ... ...
> ... ... ...
>L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for
>        Matching Encoded Characters", May 22, 1984.

Full text serach engines were mainly motivated by the advanced of VLSI
(making these machines cheap enough) and the elimination of expensive
overhead in index maintenance. For large databases, it is in general
recognized that exhaustive full text search alone is not the solution,
but some sort of indexing must be emploited. In fact full text search
and indexing are complimentary rather than competitive methods. In
addtion to Lee Hollaar's method, other methods using a combination of
coarse indexing (signature file or surrogate file) and text searching
can be found in:

Lee, D.L. The performance and evaluation of a text retrieval machine
for large databases. CSRI-172, Computer Systems Research Institute,
University of Toronto, Sep. 1985.

--- ALTEP - A Cellular Processor for High Speed Pattern Matching.
New Generation Computing, 4,3 (Sept 1986), 225-244.

--- A Word-parallel, bit-serial signature processor for superimposed
coding, Proc. 2nd Intl. Conf. Data Eng. Feb. 1986, 352-359.

- Dik Lee
-- 
Dept. Computer and Information Science	dlee@tut.cis.ohio-state.edu
The Ohio State University		..!cbosgd!tut.cis.ohio-state!dlee
Columbus, OHIO 43210-1277		614-292-2568