[comp.unix.wizards] Anything faster than stat

jik@athena.mit.edu (Jonathan I. Kamens) (11/10/89)

In article <2586@unisoft.UUCP> greywolf@unisoft.UUCP (The Grey Wolf) writes:
>Portable, almost.  Usable by anyone, probably not.
>
>It would require being run as a super-user, and it would be fairly quick --
>whether it's faster than a stat or not, I'm not sure.  What you could
>do is, once you get the entry, try and chdir() to it.  If it works, it's
>a directory, otherwise it's a file.  CAVEAT (obvious): this is not fool-
>proof if you're running a system with symbolic links.

  Bad idea for several reasons.  First of all, after you've used
chdir() several times (or even once) to go down a directory tree, you
have to use chdir("..") to get back up to the top.  In general, I find
that it is a bad idea to change the current working directory of a
process unless you are *sure* that you can get back to where you
started.  You're not sure in this case.

  Now, I know that you said "it would require being run as a
super-user", by which I assume that you meant to imply (among other
things) that the program would have read access to all directories and
therefore be able to get back to where it started no matter what.
This is not necessarily true, now that we're in the age of remote
filesystems (NFS, AFS, etc.).  Root on my workstation does not have
root access to NFS filesystems I have mounted.

>I think, IMHO, you're better off going with stat().

  Yup, I think so too.  That's how I do it in the code I've written.

  One final note: an interesting question is whether it's faster to
(a) stat() a file and use opendir() on it only if it's a directory, or
(b) just do the opendir() on it and keep going if the opendir()
succeeds.  I've found that (a) is much faster because opendir() always
does an open() on the file, even when it's not a directory.
Therefore, when you try to opendir() a non-directory, it's got to do
the open, then realize that it's not a directory using fstat, then
close the file.

Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8495			      Home: 617-782-0710

dbin@norsat.UUCP (Dave Binette) (11/10/89)

The consensus for the question "Anything faster than stat(S)?"
is "No"

Now for plan "B" (ignore me if I am trying your patience)

"Is there anything faster than opendir(S), and readdir(S)"
I can't and don't want to use  ftw(S)

Or maybe I should just ask.... WHY does it take so LONG to count the number
of directorys and files below "."

I've written a little program called "ldir" that counts files and subdirs
to the level you specify as in  "ldir -d2"  its not THAT slow but in my
application it is used a lot and speed is... well you know the story.

Tell ya what... I'll post it if you think you can do it faster, smaller
and/or better.


Here is what happens from vi when I do
!!(cd /usr/spool/news/comp; time ldir -d2 [ab]*; w)
------------------------------------------------------------
ai              (  30 Files   6 Dirs)
arch            (  64 Files   0 Dirs)
archives        (   0 Files   0 Dirs)
binaries        (+  9 Files   8 Dirs)
bugs            (+  5 Files   5 Dirs)

real         4.0
user         0.0
sys          0.5

11:05pm  up 2 days, 4 mins,  2 users,  load average: 0.19, 0.14, 0.01
------------------------------------------------------------

Subsequent invocations produce: (obviously cached)

...
real         0.8
user         0.0
sys          0.3

...
real         0.5
user         0.0
sys          0.3


Does the user time of 0.0 mean there is no hope?
Where do the other 3.5  0.5  and 0.2  seconds go?  (real - sys)

Oh yeah, we are running: (Compaq 386-20, 5Meg Ram, 2 28ms ST4096)
sysname=XENIX
nodename=norsat
release=2.3.1
version=SysV
machine=i80386

-- 
My girlfriend is a screamer..., my computer just HUMMMMs
uucp:  {uunet,ubc-cs}!van-bc!norsat!dbin | 302-12886 78th Ave
bbs:   (604)597-4361     24/12/PEP/3     | Surrey BC CANADA
voice: (604)597-6298     (Dave Binette)  | V3W 8E7

jfh@rpp386.cactus.org (John F. Haugh II) (11/10/89)

In article <15769@bloom-beacon.MIT.EDU> jik@athena.mit.edu (Jonathan I. Kamens) writes:
>In article <2586@unisoft.UUCP> greywolf@unisoft.UUCP (The Grey Wolf) writes:
>>I think, IMHO, you're better off going with stat().
>
>  Yup, I think so too.  That's how I do it in the code I've written.

Just for the sake of disagreeing, what about other system calls that
are able to distinguish between a file being a directory or not?

The error return entries for access(1) tell me something useful -

	A component of the path prefix is not a directory. [ENOTDIR]
	The named file does not exist. [ENOENT]

How about code like this -

	#include <errno.h>

	isadir (char *path)
	{
		char dir[PATH_MAX];

		if (access (path, 0))
			return 0;

		strcpy (dir, path);
		strcat (dir, "/x");

		errno = 0;
		access (dir, 0);

		return errno == 0 || errno == ENOENT;
	}

We know all of the initial path exists because of the first access()
call.  And with the second access() call we can discover if the
last component of `path' isn't a directory since errno would be ENOTDIR
rather than ENOENT.

Ain't perfect either, but maybe better?
-- 
John F. Haugh II                        +-Things you didn't want to know:------
VoiceNet: (512) 832-8832   Data: -8835  | The real meaning of EMACS is ...
InterNet: jfh@rpp386.cactus.org         |   ... EMACS makes a computer slow.
UUCPNet:  {texbell|bigtex}!rpp386!jfh   +--<><--<><--<><--<><--<><--<><--<><---

cpcahil@virtech.uucp (Conor P. Cahill) (11/11/89)

In article <17264@rpp386.cactus.org>, jfh@rpp386.cactus.org (John F. Haugh II) writes:
> Just for the sake of disagreeing, what about other system calls that
> are able to distinguish between a file being a directory or not?
> 
> How about code like this -
> 
	[sample of using access() deleted]
> 
> We know all of the initial path exists because of the first access()
> call.  And with the second access() call we can discover if the
> last component of `path' isn't a directory since errno would be ENOTDIR
> rather than ENOENT.

So you want to replace a single call to stat() with multiple calls to 
access().  That doesn't make any sense since the major overhead to both
the stat and access system calls is that the path must be traversed and since
you are calling access twice, you have to traverse the path twice.

stat() is the most effecient mechanism that can be used to obtain information
about a file system entry since it just looks up an inode and copies
the data to the user's data area. If you are stating all entries in a
directory on a very heavily loaded system you could probably get some
performance gain by chdir()ing to the directory and then stating the
entities with just the basename (thereby not having to parse the path
every time).  This shouldn't have much of an effect on a lightly loaded
system due to caching.

-- 
+-----------------------------------------------------------------------+
| Conor P. Cahill     uunet!virtech!cpcahil      	703-430-9247	!
| Virtual Technologies Inc.,    P. O. Box 876,   Sterling, VA 22170     |
+-----------------------------------------------------------------------+

jfh@rpp386.cactus.org (John F. Haugh II) (11/12/89)

In article <1989Nov11.154312.6675@virtech.uucp> cpcahil@virtech.uucp (Conor P. Cahill) writes:
>So you want to replace a single call to stat() with multiple calls to 
>access().  That doesn't make any sense since the major overhead to both
>the stat and access system calls is that the path must be traversed and since
>you are calling access twice, you have to traverse the path twice.

The objective was to take advantage of path-name caching on BSD systems.

Of course, if you know "/path/name" exists, you only need -one- call
to access() with "/path/name/foo" and you save the mumbo-jumbo required
to get data from kernel to user space.

>stat() is the most effecient mechanism that can be used to obtain information
>about a file system entry since it just looks up an inode and copies
>the data to the user's data area.

Probably true.  Now, go off and actually run the benchmarks.  =Always=
question everything.  On some machines copies from system to user address
space are cheap.  On others it can be =very= difficult.  There is a big
difference between a Vax where the supervisor and user have separate
address spaces which can be directly addressed one from the other, and a
PDP-11 where the supervisor and user occupy the same address space and
have no [ MTPD and MFPD aren't implemented on all PDP-11 CPUs! ] easy
way of communicating short of mapping memory all over God's creation.

Anyway, it was only meant to stimulate discussion.  The only portable
and clean solution =is= to use stat().  I can't stand clever hacks,
unless I write them myself ;-)
-- 
John F. Haugh II                        +-Things you didn't want to know:------
VoiceNet: (512) 832-8832   Data: -8835  | The real meaning of EMACS is ...
InterNet: jfh@rpp386.cactus.org         |   ... EMACS makes a computer slow.
UUCPNet:  {texbell|bigtex}!rpp386!jfh   +--<><--<><--<><--<><--<><--<><--<><---