[comp.databases] How do you...

john@anasaz.UUCP (John Moore) (03/26/89)

Context: Relational Database in Transaction Processing (high performance)

I have a relation that consists of a large (>500,000) set of last and
first names, and an application that needs to do partial searches of
the list. For example, the query might be to find names starting with
"mo". A further restriction is that it must be in alphabetical order.
Finally (and the only tough one for me) is that I only want the first
20 names (for display on a screen). There may be 10,000 names that
meet the criteria. Also, the list is dynamic - names are continuously
being added and deleted by multiple processes.

If the relation is indexed on last name, it should be possible to
do this, but how do I (in SQL in general, and Informix-Turbo in particular):

(1) Make sure that the set of all tuples matching the query not be sorted
    before I get the records (i.e. it is prohibitively expensive if the
    database engine extracts a temporary file of the 10,000
    names, sorts them, and then presents them to my process).
(2) Limit the query to the first 20 names - it is also too expensive
    if I get 10,000 names and just throw away the last 9980 of them.
(3) Later on, without holding open a cursor, get the NEXT 20 names.

I should note that this will be implemented in SQL embedded in C.
I am interested in this at two levels: in theory (does SQL itself
and relational theory provide an answer to this), and in the
particular case of Informix-Turbo with C embedded SQL.

Thanks in advance. This is an important question to me and help is
gratefully accepted.
-- 
John Moore (NJ7E)           mcdphx!anasaz!john asuvax!anasaz!john
(602) 861-7607 (day or eve) 
The opinions expressed here are obviously not mine, so they must be
someone else's. :-)

jkrueger@daitc.daitc.mil (Jonathan Krueger) (03/29/89)

In article <1707@anasaz.UUCP>, john@anasaz (John Moore) writes:
>I have a relation that consists of a large (>500,000) set of last and
>first names, and an application that needs to do partial searches of
>the list. For example, the query might be to find names starting with
>"mo". A further restriction is that it must be in alphabetical order.
>Finally (and the only tough one for me) is that I only want the first
>20 names (for display on a screen). There may be 10,000 names that
>meet the criteria. Also, the list is dynamic - names are continuously
>being added and deleted by multiple processes.
>
>If the relation is indexed on last name, it should be possible to
>do this, but how do I (in SQL in general, and Informix-Turbo in particular):
>
>(1) Make sure that the set of all tuples matching the query not be sorted
>    before I get the records (i.e. it is prohibitively expensive if the
>    database engine extracts a temporary file of the 10,000
>    names, sorts them, and then presents them to my process).

No problem, just don't ask for the sort.

The relational model specifically does not specify top-to-bottom
ordering of rows.  If a particular implementation guarentees retrieval
in a default sorting as a side effect of a storage structure,
retrievals may return rows in a sort order with no extra overhead But
this is non-portable, of course.  But you said you didn't care if they
were sorted, you just wanted no extra cost for it.  BTW, the database
engine that can't sort 10K names efficiently is not worth bothering
about.

>(2) Limit the query to the first 20 names - it is also too expensive
>    if I get 10,000 names and just throw away the last 9980 of them.
>(3) Later on, without holding open a cursor, get the NEXT 20 names.

No way to express "first n matches" in relational calculus.
Practically, when embedded query languages define how retrievals
connect to the semantics of host language loops, there may be a safe
way to break off the query, or to perform other work for each row
retrieved, or for each group of rows retrieved.  For example:

#define	BUNCH	20	/* how many rows to get at a time */
#define	LIMIT	200	/* if this many or more returned, abort */
##	char	last[20], first[20];
	int	ctr = 0;
##	range of n is names
##	retrieve (last = n.lname, first = n.fname)
##	{
		if (++ctr > LIMIT)
##			endretrieve
		if (ctr%BUNCH == 0)
			/* do stuff */
##	}

Note that this works only if the "stuff" can be embedded inline.  If
you need instead the semantics of a function that gets called
externally and returns the next 20 rows each time, this syntax isn't
powerful enough.  You need something that "saves context".  In SQL
such context savers are known as cursors.

-- Jon
--

gupta@cullsj.UUCP (Yogesh Gupta) (03/29/89)

In article <1707@anasaz.UUCP>, john@anasaz.UUCP (John Moore) writes:
> Context: Relational Database in Transaction Processing (high performance)
> 
> I have a relation that consists of a large (>500,000) set of last and
> first names, and an application that needs to do partial searches of
> the list. For example, the query might be to find names starting with
> "mo". A further restriction is that it must be in alphabetical order.
> Finally (and the only tough one for me) is that I only want the first
> 20 names (for display on a screen). There may be 10,000 names that
> meet the criteria. Also, the list is dynamic - names are continuously
> being added and deleted by multiple processes.
> 
> If the relation is indexed on last name, it should be possible to
> do this, but how do I (in SQL in general, and Informix-Turbo in particular):
> 
> (1) Make sure that the set of all tuples matching the query not be sorted
>     before I get the records (i.e. it is prohibitively expensive if the
>     database engine extracts a temporary file of the 10,000
>     names, sorts them, and then presents them to my process).

If your partial search on names contains known prefixes* (like the example
you gave) and there is an index on that column, and the index is worth using**,
any good optimizer will figure out that the sort is unnecessary.  e.g.

	Table t1 has index on last_name.

	select * from t1 where last_name like 'smi%'
	    order by last_name;

should do the trick.

> (2) Limit the query to the first 20 names - it is also too expensive
>     if I get 10,000 names and just throw away the last 9980 of them.

In the C program, just declare a cursor for the above select, open it,
and fetch the first 20 rows only.

> (3) Later on, without holding open a cursor, get the NEXT 20 names.

Why not keep the cursor open?  After all, that is what lets the DBMS
know where you were in your select.

If you are concerned about locking the data that has been read, note that
not keeping the cursor open implies that the first 20 rows may have changed
by the time you look for the "next 20".  If you are willing to live with this,
you need a DBMS that supports the Cursor Stability mode of consistency.  Then,
only the last row fetched by this cursor will be locked (if the DBMS supports
row-level locking, otherwise the page containing the last row fetched will be
locked).

> particular case of Informix-Turbo with C embedded SQL.

Don't know whether Informix-Turbo optimizer eliminates the sort or
not when it is not necessary.  Also, do not know whether it supports
cursor stability, row-level locking ...

--
*  If the partial searches are of the form (name = "%son") (known suffixes),
I do not know of any system that will help you.
** An index may not be worth using given the statistics about the information,
the physical clustering of the data, and the given query.
-- 
Yogesh Gupta                    | If you think my company will let me
Cullinet Software, Inc.         | speak for them, you must be joking.

gupta@cullsj.UUCP (Yogesh Gupta) (03/29/89)

> In article <1707@anasaz.UUCP>, john@anasaz (John Moore) writes:
> >A further restriction is that it must be in alphabetical order.
>

To which, jkrueger@daitc.daitc.mil (Jonathan Krueger) replies:
> But you said you didn't care if they
> were sorted, you just wanted no extra cost for it.

???
-- 
Yogesh Gupta                    | If you think my company will let me
Cullinet Software, Inc.         | speak for them, you must be joking.

emuleomo@yes.rutgers.edu (Emuleomo) (03/30/89)

It seems that the answer to your problem will be to 

1) create an index on the last names 
2) Make sure you have a serial fld in your table
3) write the following queries


 declare cusrs1 cursor for
 select * from tab1 where last_name >= :last_name
 for update;

  /* 'for update' gaurantees row-level locking!! */

 open curs1;

 repeat
 fetch next curs1 into :serial_fld :last_name_fld  etc... ;
 until(enough_records_fetched || SQLNOTFOUND);

 Now, to find out if any particular record still exists, all you need to
 do is access it via its serial fld.

 -- Emuleomo O.O. 
 Systems Automation.
 Now to find out

jkrueger@daitc.daitc.mil (Jonathan Krueger) (04/03/89)

John Moore wrote:
>> >A further restriction is that it must be in alphabetical order.

I proved that I didn't read his article carefully:
>> But you said you didn't care if they
>> were sorted, you just wanted no extra cost for it.

Yogesh Gupta spotted my mistake.  Thanks to Yogesh, apologies to John.

As noted by others, your query optimizer should decide to perform the
select first, the sort on the selected rows second.  If it's really
good it might skip the sort if the storage structure maintains an
order.  Failing that, a single pass to verify that it's sorted should
cost little.  Failing even that, you can skip the explicit "order by"
requirement and depend on side effects of the storage structure.

-- Jon
--

leo@philmds.UUCP (Leo de Wit) (04/06/89)

In article <1707@anasaz.UUCP> john@anasaz.UUCP (John Moore) writes:
    []
|If the relation is indexed on last name, it should be possible to
|do this, but how do I (in SQL in general, and Informix-Turbo in particular):
|
|(1) Make sure that the set of all tuples matching the query not be sorted
|    before I get the records (i.e. it is prohibitively expensive if the
|    database engine extracts a temporary file of the 10,000
|    names, sorts them, and then presents them to my process).
|(2) Limit the query to the first 20 names - it is also too expensive
|    if I get 10,000 names and just throw away the last 9980 of them.
|(3) Later on, without holding open a cursor, get the NEXT 20 names.

Several people said some sensible things about this, so I'll just add
my 2 cents (no real solution for how to limit the query was presented):

Besides using an index on the attribute to be sorted upon (to avoid
ORDER BY), you can narrow down the number of rows to be retrieved
already in the WHERE clause. If you're opening a cursor to fetch the
rows, the 10,000 rows satisfying the query would have to be identified,
even if you don't actually fetch them (if I'm not mistaken). By
narrowing down the number of rows beforehand you should be able to
avoid even this identifying.

This restriction would go along the lines of:

    select ...
    from ...
    where rownum <= 20
    and ...

(rownum is a pseudo-column that contains the number of the tuple
retrieved.  Oracle has it, but I don't know whether this also is in
ANSI SQL).

The next 20 records are also easy:

    select ...
    from ...
    where rownum between 21 and 40
    and ...

Of course you should use host variables to make this 'portion of rows
selection' variable.

	 Leo.

john@anasaz.UUCP (John Moore) (04/08/89)

In article <513@cullsj.UUCP> gupta@cullsj.UUCP (Yogesh Gupta) writes:
>In article <1707@anasaz.UUCP>, john@anasaz.UUCP (John Moore) writes:
>> Context: Relational Database in Transaction Processing (high performance)
>> (3) Later on, without holding open a cursor, get the NEXT 20 names.
>
>Why not keep the cursor open?  After all, that is what lets the DBMS
>know where you were in your select.

The problem here (note the context) is that there are too many users
to keep one process per terminal. Most RDBMS' have cursors associated
with processes. We want to use a few processes as transaction servers
for lots (>1000) of terminals. Hence the question.
-- 
John Moore (NJ7E)           mcdphx!anasaz!john asuvax!anasaz!john
(602) 861-7607 (day or eve) long palladium, short petroleum
The opinions expressed here are obviously not mine, so they must be
someone else's. :-)