[comp.databases] need HELP with index problem

citron@cit-vax.Caltech.Edu (Mark C. Citron) (12/24/88)

Hope someone can understand what is going wrong here. I have a table in
a database that has three indices: 
a composite unique index made up of date, name1 and slotnumber
an index of name2
an index of unitnumber

there are about 50,000 rows in the table. In about 90% of the rows the
unitnumber is 0. In time these will be changed to other numbers (the
slots will be "filled"). 

Now here is the problem: When I update the unitnumber field
and that field already contains a number OTHER THAN 0 
the update goes very fast.  GREAT!
Updating the unitnumber field to 0 (independent of what it was) also
goes very fast (less than a sec).
HOWEVER, when I update the unitnumber field
and that field already contains a 0, the update takes forever (about 30
sec). There is alot of disk accessing.

This is an Informix database. I might guess that the problem is
that although unitnumber is indexed, there are so many rows with
the same unit number that Informix has trouble locating the
record to update. Is this correct? How do I get around this?
This performance is awful!

Thanks for any (ANY!) suggestions. Write or post.
citron@csvax.caltech.edu
citron@cit-vax.UUCP

ben@calvin.sybase.com (ben ullrich) (12/26/88)

In article <8973@cit-vax.Caltech.Edu> citron@cit-vax.UUCP (Mark C. Citron) writes:
>Hope someone can understand what is going wrong here. I have a table in
>a database that has three indices: 
>a composite unique index made up of date, name1 and slotnumber
>an index of name2
>an index of unitnumber
>
>Now here is the problem: When I update the unitnumber field
>and that field already contains a number OTHER THAN 0 
>the update goes very fast.  GREAT!
>Updating the unitnumber field to 0 (independent of what it was) also
>goes very fast (less than a sec).

yes. the length of time involves finding the row to update, not what it ends up
being.

>HOWEVER, when I update the unitnumber field
>and that field already contains a 0, the update takes forever (about 30
>sec). There is alot of disk accessing.
>
>This is an Informix database. I might guess that the problem is
>that although unitnumber is indexed, there are so many rows with
>the same unit number that Informix has trouble locating the
>record to update. Is this correct? How do I get around this?
>This performance is awful!

your assumption about ``having trouble locating the record to update''
sounds correct to me. with all those rows having the same unit number, it
probably has to do something really time consuming like a table scan.

it also sounds to me like you're using only the unit number to qualify the row.
i'd suggest using other fields in the row (preferably ones contained in one
of the other indexes described above) to further qualify the row. hopefully
informix is smart enough to use one of the other indexes when you give it
fields within it instead of doing a table scan through all those 0 unit numbers.

...ben
---
ben ullrich					consider my words disclaimed
sybase, inc.		"everybody gets so much information all day long that
emeryville, ca			they lose their common sense." -- gertrude stein
(415) 596 - 3654
ben%sybase.com@sun.com		{pyramid,pacbell,sun,lll-tis,capmkt}!sybase!ben

davek@infmx.UUCP (David Kosenko) (12/28/88)

In article <8973@cit-vax.Caltech.Edu>, 
	citron@cit-vax.Caltech.Edu (Mark C. Citron) writes:
> Hope someone can understand what is going wrong here. I have a table in
> a database that has three indices: 
> an index of unitnumber
> 
> there are about 50,000 rows in the table. In about 90% of the rows the
> unitnumber is 0. In time these will be changed to other numbers (the
> slots will be "filled"). 
 [ problem description deleted ]

	An index on a field that has 90% duplicate entries is only
useful when accessing (or updating) records where the value is other
than that field; the index can be used to get to the record quickly.
Thus, update where unitnumber = 1 (or any value other than 0) will
work quite nicely.  When the value is 0, however, the index becomes
not only useless, but detrimental; a sequential access would be better
in this case.  Unfortunately, the optimizer does not have the means to
detect this (at least not now). Using  the index to grab all those
records means a disk read (often) for each index entry, thus the poor
performance.  This is expected behavior.

	You may want to reconsider the index on unitnumber.  If most
of your accesses are where unitnumber = 0, the index is doing more
harm than good.  If, however, you tend to access where unitnumber != 0
most often, you probably want to keep it around, though at the cost
of the performance you are now seeing.

	Also, if you can provide any values for the other fields indexed
the queries may run faster (i.e., cause other indexes to be used when
unitnumber = 0 - if you can provide values matching the composite index,
it would be best since it is unique).

	Hope this helps.

Dave
-- 
Disclaimer:  The opinions expressed herein |"As we hang beneath the heavens
are by no means those of Informix Software | and we hover over hell
(though they make you wonder about the     | our hearts become the instruments
 strange people they hire).                | we learn to play so well"

davek@infmx.UUCP (David Kosenko) (12/28/88)

In article <710@infmx.UUCP>, davek@infmx.UUCP (David Kosenko) writes:
> 	An index on a field that has 90% duplicate entries is only
> useful when accessing (or updating) records where the value is other
> than that field; the index can be used to get to the record quickly.
            ^^^^^ this should be "value" (i.e., 0), not "field"

My apologies for the flub.

Dave
-- 
Disclaimer:  The opinions expressed herein |"As we hang beneath the heavens
are by no means those of Informix Software | and we hover over hell
(though they make you wonder about the     | our hearts become the instruments
 strange people they hire).                | we learn to play so well"

cjh@hpausla.HP.COM (Clifford Heath) (12/28/88)

Sorry, couldn't get to you by E-mail.  Anyhow, this is interesting to others.

> Hope someone can understand what is going wrong here. I have a table in
> a database that has three indices: 
> a composite unique index made up of date, name1 and slotnumber
> an index of name2
> an index of unitnumber

> there are about 50,000 rows in the table. In about 90% of the rows the
> unitnumber is 0. In time these will be changed to other numbers (the
> slots will be "filled"). 

> HOWEVER, when I update the unitnumber field
> and that field already contains a 0, the update takes forever (about 30
> sec). There is alot of disk accessing.

This is because Informix uses a 16 bit "duplicate number" in order to
maintain the proper (ANSI COBOL required) historical sequencing of
duplicates.  In order to find the BTREE index entry to delete, it cannot
search for the appropriate duplicate by number, since that information
is only in the index, but rather must search the entire chain of
duplicates to find the index entry that matches the record number being
deleted.  In this case up to 45000 entries, or on average over 22000
entries must be searched.  Because of the time-ordering, recent entries
are at the end of the list, resulting in worst-case searching.

This is a good case for requesting Informix to add a new flag to be used
for creation of duplicate indices, saying "Don't bother time-ordering
duplicates".

BTW, if you get to the limit of 65K duplicates in a chain, bcheck will
*ALWAYS* report "index out of order", simply because the duplicate number
wraps around and bcheck notices it.  I don't know what the wrapping does
to time-ordering of duplicates, but it doesn't seem to corrupt the index
otherwise.

Disclaimer:
This information is based on version 2.10 of C/ISAM* source code, and so
may be out of date.  * C/ISAM is a trademark of Informix Corp.
This article contains only opinions that are my own, and may be wrong.

Clifford Heath, HP Australian Software Operation. ACSnet: cjh@hpausla.oz,
Internet: cjh%hpausla@hplabs.HP.COM, UUCP: hplabs!hpfcla!hpausla!cjh

aland@infmx.UUCP (Dr. Scump) (12/28/88)

In article <8973@cit-vax.Caltech.Edu>, citron@cit-vax.Caltech.Edu (Mark C. Citron) writes:
> Hope someone can understand what is going wrong here. I have a
> table in a database that has three indices: 
> a composite unique index made up of date, name1 and slotnumber
> an index of name2
> an index of unitnumber
>
> there are about 50,000 rows in the table. In about 90% of the rows the
> unitnumber is 0. In time these will be changed to other numbers (the
> slots will be "filled"). 

I am assuming that you query on this field often, looking for the 
"exceptions" (e.g. those rows with "filled" slots).  If you do *not*
search on this field often, then having an index on it is counter-
productive.

> Now here is the problem: When I update the unitnumber field
> and that field already contains a number OTHER THAN 0 
> the update goes very fast.  GREAT!

Well, that's nice to hear :-]

> Updating the unitnumber field to 0 (independent of what it was) also
> goes very fast (less than a sec).

Now I'm confused.  Well, maybe I'm not.  Lessee... in other words, 
as long as the field's value *before the update* was other than 0, it 
goes fast.  OK, makes sense...

> HOWEVER, when I update the unitnumber field
> and that field already contains a 0, the update takes forever (about 30
> sec). There is a lot of disk accessing.

Yikes!  30 seconds is forever?  I'm older than I thought!  :-]
Seriously, though, this makes sense.  What happens when there is a lot
of duplication within a field is that those "identical" index entries
(since you had just the one field in that index) must be traversed
sequentially, since it can't identify whether or not other WHERE 
clause criteria (if any) are satisfied just by looking at that 
particular index.

When you run your update, the proper row(s) must be identified first.
*That* is what takes so much time; since you have an index on that
exact field already, the optimizer will use that index first to 
qualify the needed row(s).  The high degree of disk usage that you see
is the traversal of these identical entries in the B+ index tree,
checking the other search criteria for each of these rows which
have unitnumber = 0.

> This is an Informix database. I might guess that the problem is
> that although unitnumber is indexed, there are so many rows with
> the same unit number that Informix has trouble locating the
> record to update. Is this correct? How do I get around this?
> This performance is awful!

Exactly.  Not so much that there is "trouble locating the record" 
as "we have to check every record qualified by this value in the
index."  Same effect.  

To avoid this duplicate search, the best bet is to make that index
a composite index also.  I would guess that this update you run
probably uses the same criteria each time, just with different 
values; you probably search for something like:
    .. WHERE unitnumber = 0  AND slotnumber = 32768
or 
    .. WHERE unitnumber = 0  AND date = today
or whatever.  The solution to this is simple: create that index for
unitnumber as a *composite* index, tacking whatever other field is
your most frequent "additional" search criteria after the unitnumber.
For example, if the first criterion above is what you use, create
a composite of unitnumber and slotnumber, thus:
    CREATE INDEX unitnslot ON tablename(unitnumber, slotnumber)

or, if the second criterion above is what you use, create
a composite of unitnumber and date:
    CREATE INDEX unitndate ON tablename(unitnumber, curdate)

*instead of* the singleton index on unitnumber.  Now, there will be
very few duplicates (or none at all) in the index.  Also, for a compound
condition search (e.g. WHERE unitnumber = 0 AND slotnumber = 32768),
the composite index will be used directly to find what should be only
a few rows (and therefore find them quickly).  If a search is performed
on unitnumber alone, it will still be fast, since the optimizer can use
composite indexes for searches on the first column in the index as
easily as if it was a singleton index on that column.  (Of course, if
you search on unitnumber=0 alone, it will be slower because you will 
be qualifying some 45,000 rows).

By the way, there is a limit of 64K occurrences of the same value
in a C-ISAM index (in the current and past releases, anyway).
If you had a column with, say, the value 0 in 70,000 of the rows and
had a singleton index on that column, you would have problems with
that index.  (BCHECK and CHECK TABLE would complain).

> Thanks for any (ANY!) suggestions. Write or post.
> citron@csvax.caltech.edu
> citron@cit-vax.UUCP

You're welcome.

-- 
 Alan S. Denney  |  Informix Software, Inc.  |  {pyramid|uunet}!infmx!aland
 Disclaimer: These opinions are mine alone.  If I am caught or killed,
             the secretary will disavow any knowledge of my actions.
 Santos' 4th Law: "Anything worth fighting for is worth fighting *dirty* for"