[comp.lang.smalltalk] performance of IdentityDictionary in st80 r4

huggins@ticipa.ti.com (Gray Huggins) (06/12/91)

Hi,
  We have noticed a big difference between populating an IdentityDictionary
with 15000 entries and one with 30000 entries.  

 - Why?
 - What can be done to improve performance?
 - Why do we get "a primitive has failed" when we do really big stuff without
   garbage collecting?

  Here is our test

1)
| a |
a := IdentityDictionary new.
Time millisecondsToRun: [
        1 to: 15000 do: [ :i |
                a at: DummyClass new put: i]]

Result => 11390 milliseconds

2)
| a |
a := IdentityDictionary new.
Time millisecondsToRun: [
        1 to: 30000 do: [ :i |
                a at: DummyClass new put: i]]

Result =>  1049351 milliseconds

Regards,
--
Gray Huggins                   Internet: huggins@ticipa.csc.ti.com
Texas Instruments                                          
PO Box 655012  M/S 3635        TI MSG:  GHUG
Dallas, TX 75265               Voice:   (214) 917-2202

hubert@cs.uni-sb.de (Hubert Baumeister) (06/17/91)

In article <HUGGINS.91Jun12092724@dino.ti.com>, huggins@ticipa.ti.com (Gray Huggins) writes:
|> Hi,
|>   We have noticed a big difference between populating an IdentityDictionary
|> with 15000 entries and one with 30000 entries.  
|> 
|>  - Why?
|>  - What can be done to improve performance?
|>  - Why do we get "a primitive has failed" when we do really big stuff without
|>    garbage collecting?
|> 
|>   Here is our test
|> 
|> 1)
|> | a |
|> a := IdentityDictionary new.
|> Time millisecondsToRun: [
|>         1 to: 15000 do: [ :i |
|>                 a at: DummyClass new put: i]]
|> 
|> Result => 11390 milliseconds
|> 
|> 2)
|> | a |
|> a := IdentityDictionary new.
|> Time millisecondsToRun: [
|>         1 to: 30000 do: [ :i |
|>                 a at: DummyClass new put: i]]
|> 
|> Result =>  1049351 milliseconds
|> 
|> Regards,
|> --
|> Gray Huggins                   Internet: huggins@ticipa.csc.ti.com
|> Texas Instruments                                          
|> PO Box 655012  M/S 3635        TI MSG:  GHUG
|> Dallas, TX 75265               Voice:   (214) 917-2202

When storing an association in an IdentityDictionary Smalltalk uses a hash function
on the key element. If there are collisions it looks for nearest free entry.
The result of the hash function defined in Object which is the one always used (except
for Characters and Smallintegers) is a 14 bit quantity. Thus the result of the
hash function is a value between 0 and 16383. If you create 15000 objects each object
gets a unique hash value. That happens in the first case. But if you create 30000 objects
then after 16384 objects the hash values repeat itself. Thus after 16384 objects there
are only collisions which are resolved by looking for the next free entry. But since
there are no 'holes' in the set of hash values, for each new entry about 7000 and more
elements have to be accessed.

A quick solution would be to redefine the method identityHash in a subclass of Object to
give a more suitable result. If you know that you will have 30000 elements in your
dictionary then modify the identityHash method of DummyClass as follows:
identityHash
	^super identityHash * 2
This will not avoid collisions after 16000 elements but in average an empty slot is found
after one search. In the example above the time increases only by a factor 3. But this
won't work for even higher numbers of elements.

Hubert

-- 
					Hubert Baumeister
					Max-Planck-Institut f"ur Informatik
					Im Stadtwald
					6600 Saarbr"ucken

					Telefon: (x49-681)302-5432
					e-mail: hubert@cs.uni-sb.de

objtch@extro.ucc.su.OZ.AU (Peter Goodall) (06/18/91)

huggins@ticipa.ti.com (Gray Huggins) writes:

>Hi,
>  We have noticed a big difference between populating an IdentityDictionary
>with 15000 entries and one with 30000 entries.  

> - Why?
> - What can be done to improve performance?
> - Why do we get "a primitive has failed" when we do really big stuff without
>   garbage collecting?

>  Here is our test

>1)
>| a |
>a := IdentityDictionary new.
>Time millisecondsToRun: [
>        1 to: 15000 do: [ :i |
>                a at: DummyClass new put: i]]

>Result => 11390 milliseconds

>2)
>| a |
>a := IdentityDictionary new.
>Time millisecondsToRun: [
>        1 to: 30000 do: [ :i |
>                a at: DummyClass new put: i]]

>Result =>  1049351 milliseconds

>Regards,
>--
>Gray Huggins                   Internet: huggins@ticipa.csc.ti.com
>Texas Instruments                                          
>PO Box 655012  M/S 3635        TI MSG:  GHUG
>Dallas, TX 75265               Voice:   (214) 917-2202

Greetings,

As another reply suggested the problem is probably in the hash table used 
to store the objects. I had a problem on a client site with Smalltalk/286.
I told them that it appeared that their perfomance was being eaten in 
SymbolTable hashing. Often, if you don't have a profiler, you can see what
the system is doing by creating a user-break in the system and looking at
the Smalltalk stack. If you see the same method sequence a lot, that's
likely where the time is going.

I eventually built an interrupt-driven profiler to prove a few points. We
looked at rewriting the hashing function, but after a little
experimentation, I changed the load-factor in the hash-tables from the
85% or so to 55% - and the system increased its speed by about 40%.

I did lots of tests to measure the number of collisions in the has function
and found it was horrific with about 7000 symbols, which was thereal size
of the SymbolTable.

I notice that Digitalk has dropped the loading to about 50% in Smalltalk/V
WIndows. 

Advice: Beg borrow or buy a profiling system, and read Knuth on hash
functions.

Bye,
-- 
Peter Goodall - Smalltalk Systems Consultant - objtch@extro.ucc.su.oz.au
      ObjecTech Pty. Ltd. - Software Tools, Training, and Advice
162 Burns Bay Rd, LANE COVE, NSW, AUSTRALIA. - Phone/Fax: +61 2 418-7433