[comp.sys.mac.programmer] New Disk Scanning Algorithm

jln@accuvax.nwu.edu (John Norstad) (04/13/89)

In Tech Note #68, "Searching All Directories on an HFS Volume", Apple 
gives a very simple algorithm for disk scanning.  There's a problem with
this algorithm, however, which I discovered while working on my anti-virus 
program Disinfectant.  I've come up with an improved algorithm that 
solves the problem.  The new algorithm will be part of Disinfectant version
1.1, which we hope to release early next week.

I've wanted to "publish" this new algorithm so that everyone can
benefit from it.  comp.sys.mac.programmer seems as good a place as any!

Please understand that this problem is not a "bug" in Disinfectant 1.0,
despite what MacWeek has to say :-)  The "bug" is shared by any program
which uses the TN 68 algorithm to do disk scanning, which I suspect is all
programs which do disk scanning.

The basic idea outlined in Tech Note #68 is to make indexed calls to the
PBGetCatInfo file manager routine.  We'll use (abuse) the following
notation for these calls:

   r = PBGetCatInfo(d, i, o)
   
means "call PBGetCatInfo to get the i'th object o in directory d, with
result code r."  Note that r will be non-zero if there are no more objects 
in the directory.

The algorithm in TN 68, expressed in pseudo-c and stripped of all the
bells and whistles, is as follows:

   i = 1
   while (true) {
      if (PBGetCatInfo(d, i, o)) break
      if o is a subdirectory call ourselves recursively to scan o
      if o is a file scan it
      i++
   }
   
This algorithm seems quite simple and fool-proof at first glance, but it
only works if you assume that no other users or tasks are creating or
deleting files or directories while the scan is in progress.

As an extreme example, suppose we're scanning a server volume that contains
two files named A and B and a directory C that contains another 1000 files.
Suppose that while we're scanning file B some other user deletes file A.
Our index i in the above algorithm is 2 while we're scanning file B.  When
we finish scanning file B we increment i to 3 and loop, calling
PBGetCatInfo to get the third object in the directory.  But there are
now only two objects in the directory (B and C), so the PBGetCatInfo call
returns a non-zero result code and we break out of the loop and quit.  The
net result is that we end up scanning only 2 out of the 1002 total files
on the server!

This problem is most serious when scanning server volumes, where the 
probability of other users creating or deleting objects is often
significant.  The problem can also occur on local volumes under MultiFinder
if other tasks are creating or deleting objects during a scan, or if our
program itself creates or deletes objects on the volume during the scan.
(Disinfectant 1.0 suffers from all three problems, but only the server
problem is really serious.)

My solution is quite simple.  I simply recall PBGetCatInfo immediately
after scanning an object to see if it has changed its position in the
directory.  If the position has changed, I rescan the directory to attempt
to locate the new position.

The revised algorithm is:

   i = 1
   while (true) {      
      if (PBGetCatInfo(d, i, o)) break
      if o is a subdirectory call ourselves recursively to scan o
      if o is a file scan it
      n = the name of object o
      if (!PBGetCatInfo(d, i, o)) {  /* recall PBGetCatInfo */
         m = the name of object o
         if (n == m) {              /* usual case - no position change */
            i++                     /* continue scan with next object */
            continue
         }
      }
      oldi = i                     /* save our old location */
      i = 1                        /* start looking for our new location */
      while (true) {
         if (PBGetCatInfo(d, i, o)) {
            i = oldi               /* just in case we've been deleted in
            break                  /* the last few milliseconds */
         }
         m = the name of object o
         if (n == m) {             /* found new location */
            i++                    /* continue scan with next object */
            break
         }
         i++
      }
   }

There is still an unavoidable window in this algorithm where our 
PBGetCatInfo indices can get out of synch with reality, but it is now
only milliseconds wide instead of seconds or even minutes wide.  So the
new algorithm is still not perfect, but it's orders of magnitude better than
the old naive one.

In my first attempt to design this new algorithm I tried to be fancy -
I didn't rescan from the beginning of the directory, but I instead tried
to scan backwards or forwards from the current position.  This technique
was slightly faster, but assumed that the directory was maintained in 
alphabetical order using the RelString toolbox routine with caseSens=false
and diacSens=true.  This works OK on normal volumes, but with foreign file
systems and in other "non-standard" cases we can't assume that directories
are in any particular order.  The final algorithm presented above does
not depend on directories being maintained in any particular order.

Please note that my new algorithm hasn't yet been put to the acid test
of use by millions of real live users.  But I think it's reasonable and
it has worked just fine in my tests.  Apple, of course, knows nothing
about all this.  If they did they'd probably tell me that it would break
in system 7.0 :-)  So use it at your own risk, etc., etc.

It's interesting that this problem is not shared by UNIX and other operating
systems.  In UNIX once an entry is made in a directory its position never
changes.  When entries are deleted they're simply marked "unused".  The
system does not attempt to move all the following entries down to close up
the hole.  There is no attempt made to keep the directories in any 
particular order.

The new algorithm is part of the reusable module scan.c, which is part
of the "public" source code of Disinfectant.  Write to me at the address
below if you'd like a copy.

Please excuse the length of this posting.  I thought this was a nifty
trick, and there might be others who will find it useful.

John Norstad
Academic Computing and Network Services
Northwestern University

Bitnet:      jln@nuacc
Internet:    jln@acns.nwu.edu
AppleLink:   a0173
CompuServe:  76666,573