[comp.unix.programmer] Unix binary/text files: is there a difference?

jdubb@bucsf.bu.edu (jay dubb) (03/21/91)

    I've looked in a bunch of C and Unix books, and can't seem to find
a good explanation of this - maybe someone can help... Is there a way
to tell (from a C program) whether a given file contains text or data?
The reason I'd like to know, is that I've noticed that if you have a
file into which you have done something like
write(fid,&an_int,sizeof(int)) and then you take this file to another
machine via FTP (in binary mode), and try to read() the int back, it
doesn't work (because of byte-order differences, I assume). So, what
I'd like to know is, is there a difference (in terms of something
stat() could tell me, for example) between straight text files and
files which contain raw numbers (without searching through the whole
file to check, hopefully)? the 'file' command seems to be able to do
this - I've tried it on a text file, and on a file with raw ints and
floats, and it says "text" and "data" respectively. Does it really
know, or is it making a guess (and if so, how good is its method of
guessing?)? Hoping for an explanation of Unix binary/text files...

P.S. If this is the wrong group for this type of question, or you know
of a good book that describes this in detail, please let me know.

silverio@cass (Brian Silverio) (03/21/91)

In article <77384@bu.edu.bu.edu> jdubb@bucsf.bu.edu (jay dubb) writes:
>
>....... So, what
>I'd like to know is, is there a difference (in terms of something
>stat() could tell me, for example) between straight text files and
>files which contain raw numbers......

Try "man a.out" the key to what you are looking for is in the file header.
f_magic indicates an executable file.  Values for ascii data do not match
(or should not) match valid magic numbers.  

mwarren@rws1.ma30.bull.com (Mark Warren) (03/21/91)

jdubb@bucsf.bu.edu (jay dubb) writes:


>    I've looked in a bunch of C and Unix books, and can't seem to find
>a good explanation of this - maybe someone can help... Is there a way
>to tell (from a C program) whether a given file contains text or data?

There is no difference in Unix.  Other operating systems have some flags
associated with files that indicate contents, format, etc., but Unix
treats all files as uninterpreted byte streams.

Unix supplies a "file" utility that tries (usually successfully) to guess
what the file is, but it is nothing more than a guess that applies some
heuristics to the first few bytes read from the file.

Sorry, there's no guaranteed way to do it.
--

 == Mark Warren                      Bull HN Information Systems Inc. ==
 == (508) 294-3171 (FAX 294-3020)    300 Concord Road     MS852A      ==
 == Mark.Warren@bull.com             Billerica, MA 01821              ==

mjr@hussar.dco.dec.com (Marcus J. Ranum) (03/21/91)

jdubb@bucsf.bu.edu (jay dubb) writes:

>    I've looked in a bunch of C and Unix books, and can't seem to find
>a good explanation of this - maybe someone can help... Is there a way
>to tell (from a C program) whether a given file contains text or data?

	Lots of applications use a "magic number" scheme, whereby the
first long int (just for example) in a file is some value, or some
string. That's why we are stuck with stuff like:
%%!PS-Adobe
	at the beginning of PostScript files, and so forth. It's a
tricky problem. Other systems (not to name names) make arbitrary
decisions based on naming conventions. If it ends in ".FOR" it is a
FORTRAN program, etc.

	The short answer to your question is, "no."

	Whenever I have to tackle this kind of problem, I usually write
a magic number in network byte order (see the man page on htonl()) at
some known offset into the file, and scream loudly if it's not there.
You can get arbitrarily insane trying to make sure that the file isn't
one that someone accidentally picked out of a hat with that value.

mjr.
-- 
             The world is just backing store for virtual reality.

mouse@thunder.mcrcim.mcgill.edu (der Mouse) (03/26/91)

In article <77384@bu.edu.bu.edu>, jdubb@bucsf.bu.edu (jay dubb) writes:
> I've looked in a bunch of C and Unix books, and can't seem to find a
> good explanation of this - maybe someone can help... Is there a way
> to tell (from a C program) whether a given file contains text or
> data?

No.  It's not a well-defined distinction, for one thing.  Many files
are both text and data - any file interpreted by a program can be
considered data....

> The reason I'd like to know, is that I've noticed that if you have a
> file into which you have done something like
> write(fid,&an_int,sizeof(int)) and then you take this file to another
> machine via FTP (in binary mode), and try to read() the int back, it
> doesn't work (because of byte-order differences, I assume).

Possibly size differences as well; sometimes an int is only 16 bits.

> So, what I'd like to know is, is there a difference (in terms of
> something stat() could tell me, for example) between straight text
> files and files which contain raw numbers (without searching through
> the whole file to check, hopefully)?

No.  The only distinction is the contents.  (It's true that executable
binaries typically have their execute bits turned on, but so do shell
scripts, and many binary files don't.)  UNIX is not a system like VMS,
with lots and lots of structure imposed on file contents by the
filesystem.

> the 'file' command seems to be able to do this - I've tried it on a
> text file, and on a file with raw ints and floats, and it says "text"
> and "data" respectively. Does it really know, or is it making a guess

It is making a guess based on reading some small portion of the file
(typically the first 1K or 4K or so) and applying various heuristics.
Often there is a file which describes various identifiable patterns,
such as the 0x1f 0x9d in the first two bytes of a compressed file, but
that's a frill for the purposes under discussion.

You were also lucky.  If your int happened to have the value 0x0a6f6f66
(175075174 in decimal) on a little-endian machine, a data file
containing just that int will look like a text file with just one line
reading "foo".  Of course, the chance of this goes down sharply with
the number of "raw" numbers being written, and other factors, but you
get the idea.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu