[comp.unix.questions] Detecting type of file in a program

po@volta.ece.utexas.edu (02/04/89)

In my program, I am using opendir() to read in the names of 
text files from a directory.
How can I tell whether a file is text or an object file ?
Is there a better way than using :
	system("file filename > /tmp/tempfile")

Thanks,
  Po 

gwyn@smoke.BRL.MIL (Doug Gwyn ) (02/04/89)

In article <192@henry.ece.utexas.edu> po@volta.ece.utexas.edu () writes:
>How can I tell whether a file is text or an object file ?
>Is there a better way than using :
>	system("file filename > /tmp/tempfile")

Anything you do will be pretty much what "file" does,
namely, inspect yay many bytes of the file to see if there is a
known "magic number" header present or if there are non-diplayable
byte values present.

tale@pawl.rpi.edu (David C Lawrence) (02/04/89)

Just something to note here, too, since this is c.u.questions and not
c.u.wizards (wizards already know this).  `file' is a handy utility
for a lot of things, but alas, it is quite easy to trip it up.  For
one thing, it does not know lisp (as indicated on the manual page) and
for another, comments can really screw it up.  In fact, cpp directives
can screw it up too.  I have a programme that starts out with lots of
#defines.  `file' loves to think that it is just ascii text.  Another
programme has a large comment section at the beginning of the form
/*text\n *more text\n */.  This, alas, is just but ascii text to
`file'.  (Aside: I am curious how it determines something is English
text rather than just ascii text.)
 
Another thing that can screw-up `file' is short, tabular information,
as near as I can tell.  I have a README in my bin directory which
consists of the name of each programme, the language it is written in
and a message about what it does.  This is reported by `file' to be
*roff, et al, input.  Similar behavious has been exhibited on other
files.

One thing I -love- file for is reading raster headers; it tells me
useful information in a nice compact way -- size of image, encoding
information (run-time encoded, standard format, etc).
 
For the purposes of the original poster, `file' is a great way to find
scripts and machine executables; don't rely on it for source files,
especially, though.
 
Dave
--
      tale@rpitsmts.bitnet, tale%mts@rpitsgw.rpi.edu, tale@pawl.rpi.edu

guy@auspex.UUCP (Guy Harris) (02/05/89)

 >Anything you do will be pretty much what "file" does,
 >namely, inspect yay many bytes of the file to see if there is a
 >known "magic number" header present or if there are non-diplayable
 >byte values present.

Note, of course, that:

	1) the set of known magic numbers is not "constant" in any
	   sense, so you have to pick a set and hope it's sufficient (or
	   let "file" do it, if your system supports an S5-style "file",
	   and make sure the "/etc/magic" file is up-to-date and
	   complete);

	2) you should use "isprint" to determine what is a
	   "non-displayable byte value"; do NOT assume that any byte
	   with its 8th bit set is necessarily non-printable (actually,
	   if you have files with character codes longer than 1 byte,
	   e.g. files containing Kanji, it gets even more complicated).

debra@alice.UUCP (Paul De Bra) (02/05/89)

In article <192@henry.ece.utexas.edu> po@volta.ece.utexas.edu () writes:
>In my program, I am using opendir() to read in the names of 
>text files from a directory.
>How can I tell whether a file is text or an object file ?
>Is there a better way than using :
>	system("file filename > /tmp/tempfile")

The only way is to look at the contents of the file, which is what the
utility "file" does too.
So read a number of bytes from the file, and then guess, depending on what
you see.

Paul.
-- 
------------------------------------------------------
|debra@research.att.com   | uunet!research!debra     |
------------------------------------------------------

leo@philmds.UUCP (Leo de Wit) (02/07/89)

In article <192@henry.ece.utexas.edu> po@volta.ece.utexas.edu () writes:
|In my program, I am using opendir() to read in the names of 
|text files from a directory.
|How can I tell whether a file is text or an object file ?
|Is there a better way than using :
|	system("file filename > /tmp/tempfile")

Since Unix does not have the notion of a file type (at least not like
V..) you'll end up doing something equivalent to what 'file' does.
Using file perhaps enhances portability.

Try as a variant:
  pp = popen("exec file filename","r");
to get the lines of 'file' into your program (and use 'pclose', not
'fclose' to close the stream). You can even grep for 'text' in the 
output from file.

     Leo.

dg@lakart.UUCP (David Goodenough) (02/10/89)

tale@pawl.rpi.edu (David C Lawrence) sez:
Stuff about file(1) deleted
> (Aside: I am curious how it determines something is English
> text rather than just ascii text.)

I'd hazard a guess that it looks at the letter distributions. English
has well defined (well fairly well defined) ratios of letters. So you
count how many E's, T's etc. etc. occur, see how close you are to the
"standard". If you are close, say it's English, else say it's ascii.

This may be wrong - those in the know are welcome to correct me, but it's
one possibility that could be made to work.
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com		  	  +---+

mchinni@ardec.arpa (Michael J. Chinni, SMCAR-CCS-E) (02/11/89)

From: po@volta.ece.utexas.edu:
> Is there a better way than using :
> 	system("file filename > /tmp/tempfile")

Better in what sense. If better than using "file" not really (for reasons
mentioned by other responses).  If better than using "system" and 
'fpoen("/tmp/tempfile","r");'  try using "popen" (i.e. popen("file
filename","r"); ).

	Michael J. Chinni (<mchinni@ardec.arpa>)
	US Army ARDEC
	Picatinny Arsenal, New Jersey  

bph@buengc.BU.EDU (Blair P. Houghton) (02/11/89)

In article <949@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>
>Note, of course, that:
>
>	1) the set of known magic numbers is not "constant" in any
>	   sense, so you have to pick a set and hope it's sufficient (or
>	   let "file" do it, if your system supports an S5-style "file",
>	   and make sure the "/etc/magic" file is up-to-date and
>	   complete);

Hokay, mah innorance is REALLY gonna show, now.

At least this is c.u.q and not c.u.w...

Just what are magic numbers for and of what significance is the /etc/magic
file.  Every once in a while I get an error saying something about bad
magic numbers or whatnot, and I usually just punt.  What's the scoop?

				--Blair
				  "Amaze your friends and coworkers:
				   alias Presto-chango 'mv -f * /bit/bucket'"

guy@auspex.UUCP (Guy Harris) (02/13/89)

>Just what are magic numbers for

Some software subsystems mark files of some format that they know about
with some particular number or set of characters, usually at the front
of the file.  The name, at least as I remember it being used in UNIX
systems, originally referred to stuff stuck at the end of object and
executable files that flagged the type of executable file.

Back in the dark ages, I think an executable file began with a PDP-11
branch instruction that jumped around a header that specified the size
of the code+data in the file; the instruction was a 16-bit number, value
octal 407.  When separate text&data, and split I&D, executables were
added, "407" as the first 16 bits of the executable file was treated as
a flag indicating that the image was a non-separate text&data image, and
410 was used for separate text&data and 411 (? - I haven't dealt with
PDP-11s in ages...) meant separate I&D.  (Also, the header was no longer
part of the image itself, since 410 and 411 would branch further than
407 did and I guess they thought it was silly to pad the header to cope
with this...).

Anyway, the term "magic number" then applied to other flags stuck at the
front of files by subsystems; e.g, 0177555 (word size) for a very old archive
format, 0177545 (16 bits) for an older archive format, "!<arch>\n" as a
string for the "modern" archive format (4.xBSD/S5R2 and later), etc.

>and of what significance is the /etc/magic file.

The S5 version of the "file" command (S5R2 and later, anyway) has a file
"/etc/magic" that it can read to find out some of the magic
numbers/strings it should look for; that way, you can extend its
repertoire without having to hack the source code.  SunOS 3.2 and later
use a "file" based on this, with some extensions.

>Every once in a while I get an error saying something about bad
>magic numbers or whatnot, and I usually just punt.  What's the scoop?

That probably means you tried to link or execute a file that the linker
or the "exec" call didn't recognize as an object or executable file,
because it didn't have a proper magic number at the front of it; hence,
"bad magic number".