jdubb@bucsf.bu.edu (jay dubb) (03/21/91)
I've looked in a bunch of C and Unix books, and can't seem to find a good explanation of this - maybe someone can help... Is there a way to tell (from a C program) whether a given file contains text or data? The reason I'd like to know, is that I've noticed that if you have a file into which you have done something like write(fid,&an_int,sizeof(int)) and then you take this file to another machine via FTP (in binary mode), and try to read() the int back, it doesn't work (because of byte-order differences, I assume). So, what I'd like to know is, is there a difference (in terms of something stat() could tell me, for example) between straight text files and files which contain raw numbers (without searching through the whole file to check, hopefully)? the 'file' command seems to be able to do this - I've tried it on a text file, and on a file with raw ints and floats, and it says "text" and "data" respectively. Does it really know, or is it making a guess (and if so, how good is its method of guessing?)? Hoping for an explanation of Unix binary/text files... P.S. If this is the wrong group for this type of question, or you know of a good book that describes this in detail, please let me know.
silverio@cass (Brian Silverio) (03/21/91)
In article <77384@bu.edu.bu.edu> jdubb@bucsf.bu.edu (jay dubb) writes: > >....... So, what >I'd like to know is, is there a difference (in terms of something >stat() could tell me, for example) between straight text files and >files which contain raw numbers...... Try "man a.out" the key to what you are looking for is in the file header. f_magic indicates an executable file. Values for ascii data do not match (or should not) match valid magic numbers.
mwarren@rws1.ma30.bull.com (Mark Warren) (03/21/91)
jdubb@bucsf.bu.edu (jay dubb) writes: > I've looked in a bunch of C and Unix books, and can't seem to find >a good explanation of this - maybe someone can help... Is there a way >to tell (from a C program) whether a given file contains text or data? There is no difference in Unix. Other operating systems have some flags associated with files that indicate contents, format, etc., but Unix treats all files as uninterpreted byte streams. Unix supplies a "file" utility that tries (usually successfully) to guess what the file is, but it is nothing more than a guess that applies some heuristics to the first few bytes read from the file. Sorry, there's no guaranteed way to do it. -- == Mark Warren Bull HN Information Systems Inc. == == (508) 294-3171 (FAX 294-3020) 300 Concord Road MS852A == == Mark.Warren@bull.com Billerica, MA 01821 ==
mjr@hussar.dco.dec.com (Marcus J. Ranum) (03/21/91)
jdubb@bucsf.bu.edu (jay dubb) writes: > I've looked in a bunch of C and Unix books, and can't seem to find >a good explanation of this - maybe someone can help... Is there a way >to tell (from a C program) whether a given file contains text or data? Lots of applications use a "magic number" scheme, whereby the first long int (just for example) in a file is some value, or some string. That's why we are stuck with stuff like: %%!PS-Adobe at the beginning of PostScript files, and so forth. It's a tricky problem. Other systems (not to name names) make arbitrary decisions based on naming conventions. If it ends in ".FOR" it is a FORTRAN program, etc. The short answer to your question is, "no." Whenever I have to tackle this kind of problem, I usually write a magic number in network byte order (see the man page on htonl()) at some known offset into the file, and scream loudly if it's not there. You can get arbitrarily insane trying to make sure that the file isn't one that someone accidentally picked out of a hat with that value. mjr. -- The world is just backing store for virtual reality.
mouse@thunder.mcrcim.mcgill.edu (der Mouse) (03/26/91)
In article <77384@bu.edu.bu.edu>, jdubb@bucsf.bu.edu (jay dubb) writes: > I've looked in a bunch of C and Unix books, and can't seem to find a > good explanation of this - maybe someone can help... Is there a way > to tell (from a C program) whether a given file contains text or > data? No. It's not a well-defined distinction, for one thing. Many files are both text and data - any file interpreted by a program can be considered data.... > The reason I'd like to know, is that I've noticed that if you have a > file into which you have done something like > write(fid,&an_int,sizeof(int)) and then you take this file to another > machine via FTP (in binary mode), and try to read() the int back, it > doesn't work (because of byte-order differences, I assume). Possibly size differences as well; sometimes an int is only 16 bits. > So, what I'd like to know is, is there a difference (in terms of > something stat() could tell me, for example) between straight text > files and files which contain raw numbers (without searching through > the whole file to check, hopefully)? No. The only distinction is the contents. (It's true that executable binaries typically have their execute bits turned on, but so do shell scripts, and many binary files don't.) UNIX is not a system like VMS, with lots and lots of structure imposed on file contents by the filesystem. > the 'file' command seems to be able to do this - I've tried it on a > text file, and on a file with raw ints and floats, and it says "text" > and "data" respectively. Does it really know, or is it making a guess It is making a guess based on reading some small portion of the file (typically the first 1K or 4K or so) and applying various heuristics. Often there is a file which describes various identifiable patterns, such as the 0x1f 0x9d in the first two bytes of a compressed file, but that's a frill for the purposes under discussion. You were also lucky. If your int happened to have the value 0x0a6f6f66 (175075174 in decimal) on a little-endian machine, a data file containing just that int will look like a text file with just one line reading "foo". Of course, the chance of this goes down sharply with the number of "raw" numbers being written, and other factors, but you get the idea. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu