[comp.lang.perl] Pattern matching of a large dbm file

rahardj@ccu.umanitoba.ca (Budi Rahardjo) (05/06/91)

How do I get a fast pattern matching in a large associative array ? 

I have a large associative array (approx. 17000), read from a dbm file.
I'd like to match a pattern with the contents of the array.
Assigning the index/keys to an array takes forever ... , never mind 
doing the matching. 
 
dbmopen(RECORD,"dbtest",0666) || die "Can't open dbm file";
@k = keys(%RECORD);  <====== this takes almost 9 seconds !!      
 
For comparison, doing a grep thru a flatfile (with the same contents)
takes 2 secs.
 
If I do several  "@k = keys(%RECORD);", this is going to be 
a problem. 
 
Is there a way I can iterate over dbm file without stepping thru
the index/keys ? (Basically doing a 'grep' on the contents).
Or maybe read the raw-dbm file (without the index), ie treat it
as a flat file ... 
 
Thanks, 
-- budi

merlyn@iwarp.intel.com (Randal L. Schwartz) (05/07/91)

In article <1991May6.163310.21805@ccu.umanitoba.ca>, rahardj@ccu (Budi Rahardjo) writes:
| For comparison, doing a grep thru a flatfile (with the same contents)
| takes 2 secs.
|  
| If I do several  "@k = keys(%RECORD);", this is going to be 
| a problem. 
|  
| Is there a way I can iterate over dbm file without stepping thru
| the index/keys ? (Basically doing a 'grep' on the contents).
| Or maybe read the raw-dbm file (without the index), ie treat it
| as a flat file ... 

You mean, like:

while (($key,$val) = each(%RECORD)) {
	if ($key =~ /pattern/) {
		"someaction";
	}
}

or do you mean something else?

@_{4,3,2,1}=("hacker,","Perl ","another ","Just ");print while ($_,$_)=each _;
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

allbery@NCoast.ORG (Brandon S. Allbery KB8JRR/AA) (05/07/91)

As quoted from <1991May6.163310.21805@ccu.umanitoba.ca> by rahardj@ccu.umanitoba.ca (Budi Rahardjo):
+---------------
| Assigning the index/keys to an array takes forever ... , never mind 
| doing the matching. 
+---------------

Standard dbm is slow.

+---------------
| Is there a way I can iterate over dbm file without stepping thru
| the index/keys ? (Basically doing a 'grep' on the contents).
| Or maybe read the raw-dbm file (without the index), ie treat it
| as a flat file ... 
+---------------

Dbm files aren't stored in any format that's meaningful to humans.

You can combine the search with the processing, if it will make things faster:
the "each" operator returns the next available value each time it is called,
instead of doing it all at once.  Check the manual.

++Brandon
-- 
Me: Brandon S. Allbery			  Ham: KB8JRR/AA  10m,6m,2m,220,440,1.2
Internet: allbery@NCoast.ORG		       (restricted HF at present)
Delphi: ALLBERY				 AMPR: kb8jrr.AmPR.ORG [44.70.4.88]
uunet!usenet.ins.cwru.edu!ncoast!allbery       KB8JRR @ WA8BXN.OH

rahardj@ccu.umanitoba.ca (Budi Rahardjo) (05/07/91)

Just another perl hacker (Randal L. Schwartz) writes:
...[deleted, read previous posting] ...

>You mean, like:
>while (($key,$val) = each(%RECORD)) {
>	if ($key =~ /pattern/) {
>		"someaction";
>	}
>}

actually it's more like
... 
    if ($val =~ /$pattern/) { 
...

It's so slow. I am tempted to dump the dbm file into a flatfile
(keep 2 files, dbm and flat file), and grep the flatfile.

-- budi

oz@yunexus.yorku.ca (Ozan Yigit) (05/07/91)

Brandon S. Allbery writes:

> Standard dbm is slow.

The dbm/ndbm package (as with any other external hashing scheme) is
designed for a particular type of data access, and the result is a single
[on the average] access to database for locating a key and its associated
data.  There may actually be a a slower re-implementation of dbm/ndbm out
there, but in general, the dbm/ndbm and their re-implementations exist
exactly because they work extremely well for the type of access they were
designed for.

So, I can only assume that when you say standard dbm is slow, you really
mean a sequential pass over a database, which dbm is *not* designed for,
is slow.

In tested this assertion for the average use and I found that the usual
sequential scan of keys in the entire dict/words-generated database with
firstkey/nextkey/strncmp (using sdbm) is about 4 times slower than a
fread/strncmp loop over the entire flat version of the database.  I have
not tested the regular ndbm. [2]

On the other hand, BSD HASH [1] with ndbm compatibility routines ran as
fast as the fread/strncmp loop.

I have no idea why perl would be "so slow" in its sequential access
to the database.

oz
---
[1] Seltzer, Margo and Ozan Yigit, ``A New Hash Package for UNIX''
    Proceedings of the USENIX Technical Conference, Dallas, Texas.
    January 1991

[2] sdbm reads the entire database block-by-block during a sequential scan,
but cannot avoid holes, ndbm uses a very clever hash value reconstruction
algorithm to access only the occupied blocks, but is not noticably faster.

---
SunOS: There are many edible ways to serve | internet: oz@nexus.yorku.ca
up scrambled eggs, but one cannot get them | uucp: utzoo/utai!yunexus!oz
back as over-easy or sunny side up.        | phone: [416] 736 5257x33976

allbery@NCoast.ORG (Brandon S. Allbery KB8JRR/AA) (05/08/91)

As quoted from <22673@yunexus.YorkU.CA> by oz@yunexus.yorku.ca (Ozan Yigit):
+---------------
| Brandon S. Allbery writes:
| 
| > Standard dbm is slow.
| 
| So, I can only assume that when you say standard dbm is slow, you really
| mean a sequential pass over a database, which dbm is *not* designed for,
| is slow.
+---------------

I mean that *general* database operations, such as the original poster is
asking for, are slow.  dbm/ndbm have their place, but that place isn't and
isn't intended to be as a replacement for a regular database manager.  Or for
sequential or relative access files.

++Brandon
-- 
Me: Brandon S. Allbery			  Ham: KB8JRR/AA  10m,6m,2m,220,440,1.2
Internet: allbery@NCoast.ORG		       (restricted HF at present)
Delphi: ALLBERY				 AMPR: kb8jrr.AmPR.ORG [44.70.4.88]
uunet!usenet.ins.cwru.edu!ncoast!allbery       KB8JRR @ WA8BXN.OH

oz@yunexus.yorku.ca (Ozan Yigit) (05/08/91)

Brandon S. Allbery writes:

>dbm/ndbm have their place, but that place isn't and
>isn't intended to be as a replacement for a regular database manager.

I know, and I thought I did say something like that. ;-) Maybe Larry may
want to look into Marcus J. Ranum's b+tree package as a more general-purpose
database package for more general-purpose uses that perl users seem to want.

oz