[comp.bugs.4bsd] stealth technology for find

dls@mentor.cc.purdue.edu (David L. Stevens) (08/08/89)

Index:	/usr/src/usr.bin/find/find.c 4.3BSD

Description:
	find(1) changes atimes for all directories it searches, which makes
the "-atime" predicate less useful. The find(1) itself changes the directory
access times so they never age.

Repeat-By:
	find / -atime ...

Fix:
	Save the utimes for directories before reading them and restore them
on the way back up.  (Works for root or directory owners, only) Diffs follow:

*** NEW find.c	Mon Aug  7 22:33:48 1989
--- OLD find.c	Mon Aug  7 22:30:29 1989
***************
*** 6,11 ****
--- 6,14 ----
  #include <sys/param.h>
  #include <sys/dir.h>
  #include <sys/stat.h>
+ #ifdef	STEALTH
+ #include <sys/time.h>
+ #endif	/* STEALTH */
  
  #define A_DAY	86400L /* a day full of seconds */
  #define EQ(x, y)	(strcmp(x, y)==0)
***************
*** 665,670 ****
--- 668,676 ----
  	char *endofname;
  	auto char sbkeep_dir[MAXPATHLEN+MAXNAMLEN+2];
  	struct stat lstatb;
+ #ifdef	STEALTH
+ 	struct	timeval	tvp[2];
+ #endif	/* STEALTH */
  
  	if ((follow?stat(fname, &Statb):lstat(fname, &Statb))<0) {
  		fprintf(stderr, "find: bad status < %s >\n", name);
***************
*** 699,704 ****
--- 705,714 ----
  
  	if (chdir(fname) == -1)
  		return(0);
+ #ifdef	STEALTH
+ 	tvp[0].tv_sec = Statb.st_atime;
+ 	tvp[1].tv_sec = Statb.st_mtime;
+ #endif	/* STEALTH */
  	if ((dir = opendir(".")) == NULL) {
  		fprintf(stderr, "find: cannot open < %s >\n", name);
  		rv = 0;
***************
*** 725,730 ****
--- 735,743 ----
  ret:
  	if(dir)
  		closedir(dir);
+ #ifdef	STEALTH
+ 	(void) utimes(".", tvp);
+ #endif	/* STEALTH */
  	if (chdir(*sbkeep_dir ? sbkeep_dir : "..") == -1) {
  		*endofname = '\0';
  		fprintf(stderr, "find: bad directory <%s>\n", name);

-- 
					+-DLS  (dls@mentor.cc.purdue.edu)

jeffc@soba.osf.org (Jeff Carter) (08/08/89)

In article <3584@mentor.cc.purdue.edu> dls@mentor.cc.purdue.edu (David L. Stevens) writes:
>Index:	/usr/src/usr.bin/find/find.c 4.3BSD
>
>Description:
>	find(1) changes atimes for all directories it searches, which makes
>the "-atime" predicate less useful. The find(1) itself changes the directory
>access times so they never age.
>
>Repeat-By:
>	find / -atime ...
>
>Fix:
>	Save the utimes for directories before reading them and restore them
>on the way back up.  (Works for root or directory owners, only) Diffs follow:

Are you sure this is really 
(A) a bug?
(B) a fix?

Consulting my 4.3BSD manual, for stat(2):
  st_atime: Time when _file_ _data_ was last read or modified. Changed by the 
	following system calls: mknod(2), utimes(2), read(2), and write(2).
	For reasons of efficiency, st_atime is _not_ set when a directory is
	searched, although this would be more logical. [emphasis mine]

And running a quickie experiment on ULTRIX 3.0: (a berkeley derivative)
  % date
  Tue Aug  8 09:39:44 EDT 1989
  % ls -ldg TRC
  drwxr-x---  6 jeffc    osf           512 Apr 25 11:15 TRC/	[modify time]
  % ls -ldug TRC
  drwxr-x---  6 jeffc    osf           512 Mar 11 14:15 TRC/	[access time]
  % find ./TRC -print
  [much output deleted]
  % ls -ldg TRC
  drwxr-x---  6 jeffc    osf           512 Apr 25 11:15 TRC/	[modify time]
  % ls -ldgu TRC
  drwxr-x---  6 jeffc    osf           512 Mar 11 14:15 TRC/	[access time]
  % find ./TRC -atime -2 -print
  [no output. i.e., no file/directory under ./TRC accessed in less than 2 days]
Access time did not change.

Additionally, the "fix" has the following effect:
4.3BSD utimes(2):
  The utimes call uses the "accessed" and "updated" [ should be "modified" ]
  times in that order from the tvp vector to set the corresponding recorded 
  times for _file_.

  The caller must be the owner of the file or the super-user. The 
  "inode-changed" time of the the file is set to the current time.

The effect of this is to change st_ctime on every directory that you do this
to. Why is this bad? (OK, suboptimal) because dump(8) uses st_ctime as 
one of the criteria for whether or not an inode should be dumped. This will
make dump run slower and take more tape. There may be other side-effects
that I am not aware of.

Jeff Carter

dls@mentor.cc.purdue.edu (David L Stevens) (08/08/89)

	ARRRRRRRRRRRRRRRRRRRGH.

	I tested the stock 4.3 find(1) and it does not update atimes.
Apparently a local change has this side effect and has allowed me to
gracefully insert my foot in my mouth.
	My apologies to all and thanks to Jeff Carter for not believing
everything he reads.

-- 
					+-DLS  (dls@mentor.cc.purdue.edu)

dupuy@cs.columbia.edu (Alexander Dupuy) (08/08/89)

The only problem with your fix is that by resetting the atime of the directory
to the old time, you also set the ctime of the directory to be the current
time.  While this is not always a problem, in some circumstatnces, you may be
more concerned with preserving the ctimes (e.g. for incremental backup
purposes) than you are with preserving the atime.

You could make your stealth code conditional on some sort of option flag for
find, but some will certainly argue that find already has to many options, and
that the subtleties of ctime/[am]time interactions are a bit too much for most
users of find to grasp.

For reference, the rules are:

Files:

	atime:	updated when created, read().

	mtime:	updated when created, write(), truncate().

	ctime:	updated when created, write(), truncate(), chown/chmod(),
		utime/utimes(), link/unlink/rename() of self.

Directories

	atime:	updated when created, read(), getdents/getdirentries().

	mtime:	updated when created, link/unlink/rename/rmdir() of entries.

	ctime:	updated when created, link/unlink/rename/rmdir() of entries,
		chown/chmod(), utime/utimes(), link/unlink/rename() of self. 

In general, the atime is updated whenever the data in a file is read, the mtime
is updated whenever the data in a file is modified, and the ctime is updated
whenever the data associated with the inode is changed.

@alex
--
-- 
inet: dupuy@cs.columbia.edu
uucp: ...!rutgers!cs.columbia.edu!dupuy

dls@mentor.cc.purdue.edu (David L Stevens) (08/09/89)

	For what it's worth, I have further information and I'm removing
my foot from my mouth and replacing it for the premature retraction....

	The "local changes" that caused find(1) to suddenly start changing
the atimes on directories were in fact the Tahoe changes and not something
we did. It was in fact the 4.3 version, not the Tahoe version, that I tested
and that did not have the problem.
	The STEALTH code avoids that and allows, for example, /tmp to be
cleared based on access times, without leaving a tree of empty directories
for some other cleanup method. Some have suggested that it be a command line
option.
	I don't have a good feel for the dump/ctime argument; all of the
directories generally aren't much compared to all of the files, anyway. At
any rate, the code is there for you to use or not. :-)

	Another find(1) question that we're addressing locally is the
unintuitive meaning of numbers in the comparisons. As it is, there are
three forms ("n", "+n" and "-n"). However, fractions are completely truncated
so to match a "-mtime +1" requires a file to actually be *two* days or older.
A file that's anywhere from 1 day and 1 second to 1 day, twenty three hours
59 minutes and 59 seconds old are all considered to be one day old and fail
the "greater than a day" test.
	I propose:

	1) To match "+n", a file need be n days + 1 second or older.
		(current: n days + 24 hours)
	2) to match "-n", a file need be n days - 1 second or younger.
		(current: same)
	3) to match "n", a file should be +/- a reasonable epsilon.
		(current: n+1 sec to n+23 hours 59 mins 59 secs)
		I suggest an hour, so files 23.00.01-25.00.59 would be
		considered an "exact" day-old match, but a file that's
		1 day, 22 hours old would not. Could also use epsilon in
		1) and 2) to maintain a dichotomy.

	The most obtuse example is a file that's 1 second short of two days
old and won't match on "+1", even though the file is in fact 1.99998 days old.
Most people'd call that 2, but anyone'd call it >1.
-- 
					+-DLS  (dls@mentor.cc.purdue.edu)

jgreely@oz.cis.ohio-state.edu (J Greely) (08/09/89)

In article <3608@mentor.cc.purdue.edu> dls@mentor.cc.purdue.edu
 (David L Stevens) writes:
>	Another find(1) question that we're addressing locally is the
>unintuitive meaning of numbers in the comparisons. As it is, there are
>three forms ("n", "+n" and "-n").

What's unintuitive?  If I say "-mtime +1", I certainly hope it is
interpreted as "two or more days".  I never had a problem with the
current style: 0 is the past twenty-four hours, +0 is anything before
that, and -1 is 0.  As long as the units are full days, this behavior
is correct.  If you want a different behavior, extend the syntax to
floating point (+1.0 is what you seem to be asking for).  In the long
run, you'll probably find that more useful.

  Of course, if you really want to have fun, check out the paper in
the Summer USENIX proceedings (whose title isn't handy at the moment,
unfortunately), detailing the implementation of a portable file system
tree walking library.  Their find replacement, "tw", has a very handy
awk-like language embedded into it, allowing truly fun things to be
done to files.

[suggested changes]
>	3) to match "n", a file should be +/- a reasonable epsilon.
>		(current: n+1 sec to n+23 hours 59 mins 59 secs)
>		I suggest an hour, so files 23.00.01-25.00.59 would be
>		considered an "exact" day-old match, but a file that's
>		1 day, 22 hours old would not.

Ugh.  You're better off going to floating point.  Magical fudging like
this from something that claims to work in "days" could make for
confusing results.

>	The most obtuse example is a file that's 1 second short of two days
>old and won't match on "+1", even though the file is in fact 1.99998 days old.
Most people'd call that 2, but anyone'd call it >1.

But a computer will insist that, by the supplied definition of "day",
that file is one day old.  It ain't 0 days old, it ain't 2 days or
older, so it's 1.
-=-
J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely)

dls@mentor.cc.purdue.edu (David L Stevens) (08/09/89)

	Just when you thought it was over...

	The 4.3 version and the 4.3 Tahoe version and, according to Keith
Bostic, a 1982 version, of find(1) all result in changed access times on
directories.
	I have no idea why Ultrix 3.0's version apparently does not and I
have no idea why my test of the 4.3 version the first time reproduced Jeff
Carter's results (ie, no atime changes). Bostic suggests it might be a kernel
bug, but it isn't reliable, whatever the cause.
	This means if you can live with dumping all your directories (NOT
all your files, just all directories), the STEALTH code I posted can be
applied to any version of find(1) back to 1982 with some effect. Unless
you're running Ultrix 3.0, apparently, though I haven't confirmed that.

	Note, too, that because find(1) does it's work on the way down,
without some change to find(1), you can't get the functionality of "remove
all files that haven't been accessed by a user in the last day". Find(1)
will insure that the directories are always accessed and further, the
directories won't be empty on the way down (only on the way up...).
	Perhaps find(1) in its present form isn't the best solution to this
problem.
-- 
					+-DLS  (dls@mentor.cc.purdue.edu)

bart@videovax.tv.Tek.com (Bart Massey) (08/10/89)

In article <3608@mentor.cc.purdue.edu> dls@mentor.cc.purdue.edu (David L Stevens) writes:
> 
> 	The "local changes" that caused find(1) to suddenly start changing
> the atimes on directories were in fact the Tahoe changes and not something
> we did. It was in fact the 4.3 version, not the Tahoe version, that I tested
> and that did not have the problem.

We're running 4.3 "tahoe" on a VAX750, and our man page still says

     st_atime    Time when file data was last accessed.  Changed
                 by the following system calls: mknod(2),
                 utimes(2), and read(2).  For reasons of effi-
                 ciency, st_atime is not set when a directory is
                 searched, although this would be more logical.

An experiment convinced me that either the manpage or the kernel is wrong.
As the manpage says, it was mainly an efficiency win to not do this before,
so maybe the kernel behavior was deliberately changed and not documented.
Or maybe it's a kernel bug.  Could somebody at Berkeley clarify this?

					Bart Massey
					
					Tektronix, Inc.
					TV Systems Engineering
					M.S. 58-639
					P.O. Box 500
					Beaverton, OR 97077
					(503) 627-5320

					..tektronix!videovax.tv.tek.com!bart

dls@mentor.cc.purdue.edu (David L Stevens) (08/10/89)

	I believe the man page is referring to namei() (kernel) directory
searches, and not read(2) (readdir()) "searches." The following does not
change the atime on "hose", even though the kernel searched it to find
"bag":
	cat /tmp/hose/bag

-- 
					+-DLS  (dls@mentor.cc.purdue.edu)

peter@ficc.uu.net (Peter da Silva) (08/24/89)

In article <5521@videovax.tv.Tek.com>, bart@videovax.tv.Tek.com (Bart Massey) writes:
>      st_atime    Time when file data was last accessed.  Changed
>                  by the following system calls: mknod(2),
>                  utimes(2), and read(2).  For reasons of effi-
>                  ciency, st_atime is not set when a directory is
>                  searched, although this would be more logical.

This means:

	% cat /usr/fred/project/wheaties/raisins
					 ^^^^^^^-- This file is read.
	      ^^^^^^^^^^^^^^^^^^^^^^^^^^-- These directories are *searched*.
					   for reasons of efficiency, atime
					   is not modified.
	% ls /usr/fred/project/wheaties
	     ^^^^^^^^^^^^^^^^^------------ These directories are searched.
			       ^^^^^^^^--- This directory is *read*. That is,
					   it is opened and the read(2) sys
					   call is performed (maybe multiple
					   times). This is of course hidden
					   in the directory access routines.

A directory being searched has a specific meaning in UNIX: it's what namei
does to resolve a path. Find actually opens and reads the directory.
-- 
Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation.
Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-'
"export ENV='${Envfile[(_$-=1)+(_=0)-(_$-!=_${-%%*i*})]}'" -- Tom Neff     'U`
"I didn't know that ksh had a built-in APL interpreter!" -- Steve J. Friedl