bbs@alchemy.UUCP (BBS Administration) (09/23/90)
Could someone explain how the command "file" works? Specifically, I am writing a program that allows users to navigate their $HOME directory and any subdirectories (they cannot leave their $HOME directory though, for security reasons) to find files that are to be read into a text editor. Some text editor forks this program, and when the user selects a file to read, it writes the pathname to a temporary file which the editor reads and then loads into its' buffer. I wrote this "navigator" program as a separate entity, so that either my line based editor (non-curses) or my full screen editor (subset of curses) can call upon it and use its facilities (the navigator does lots of other things too) without giving the user shell access directly. Anyhow, once they select a file for reading, I'd like to be able to determine if the file is "ascii text" as the program "file" reports when this is true, and if not, inform the user that the contents are NOT ascii text and that they may want to reconsider. Should I make a pass through the contents and make sure that each character has the high bit OFF (so it's 7-bit data) or what? I don't need to determine what kind of file it is, just whether or not it's something the editors will "like." Thanks in advance! -- John John Donahue, Senior Partner | UUCP: ucrmath!alchemy!{bbs, gumby} | The Future Alchemy Software Designs | INET: {bbs, gumby}@alchemy.UUCP | Begins Now -------------------+---------+-------------------------------------+----------- Communique On-line | +1-714-243-7150 {3, 12, 24, 96HST} Bps. 8-N-1 | Next Wave: Information System | Alchemy Software Designs Support System | Communique
robertb@cs.washington.edu (Robert Bedichek) (09/23/90)
In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes: > > Could someone explain how the command "file" works? Specifically, I am >writing a program that allows users to navigate their $HOME directory and <text deleted> I suggest that you read the man page for 'file'. Also, read the file that the man pages specifies as the database that 'file' uses. You can find lots of useful stuff by reading man pages and examining user-readable system files. It is something that still distinguishes most versions of UNIX from most other operating systems. >Anyhow, once they select a file for reading, I'd like to be able to >determine if the file is "ascii text" as the program "file" reports >when this is true, and if not, inform the user that the contents are >NOT ascii text and that they may want to reconsider. > > Should I make a pass through the contents and make sure that each >character has the high bit OFF (so it's 7-bit data) or what? I don't >need to determine what kind of file it is, just whether or not it's >something the editors will "like." There are many file types that editors will like besides files reported by 'file' as text. For example shell scripts are usually reported as such and not as text. So the result of 'file' isn't what I think that you want. Also, some text editors can edit any file, including executable files. > >Thanks in advance! Sure, I hope that this helps. Rob Bedichek robertb@cs.washington.edu > >-- John > >John Donahue, Senior Partner | UUCP: ucrmath!alchemy!{bbs, gumby} | The Future > Alchemy Software Designs | INET: {bbs, gumby}@alchemy.UUCP | Begins Now >-------------------+---------+-------------------------------------+----------- >Communique On-line | +1-714-243-7150 {3, 12, 24, 96HST} Bps. 8-N-1 | Next Wave: >Information System | Alchemy Software Designs Support System | Communique
jeenglis@girtab.usc.edu (Joe English Muffin) (09/23/90)
robertb@cs.washington.edu (Robert Bedichek) writes: >In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes: >> >> Could someone explain how the command "file" works? Specifically, I am >>writing a program that allows users to navigate their $HOME directory and ><text deleted> >I suggest that you read the man page for 'file'. Also, read the file >that the man pages specifies as the database that 'file' uses. Not all versions of 'file' use a separate database; I believe the 4.2BSD 'file' has it hardcoded. (Not to mention the fact that not all Unices have on-line man pages, and not all sites make the hard-copy versions easy to get to, but that's another gripe :-) To answer the original question, 'file' first does a stat() to determine if the file is an executable, setuid, symbolic link, etc. Then it reads in the first N characters of the file and checks it against a predefined set of patterns. Many of the patterns are just ``magic numbers''; for example, under SunOS the file types "mc68020 demand paged dynamically linked executable" and "shell script" are determined from the first two bytes of the file. Some of the other patterns it looks for are a little more complicated; for example, a period at the beginning of the line indicates "[nt]roff, tbl, or eqn input" (which is why it tends to think makefiles are for troff so often.) Certain patterns of punctuation and capitalization (not too sure what they are) distinguish "English text" from "ascii text." If none of the patterns match, it looks for non-printable characters; if there are any it will report "data", otherwise "ascii text." >There are many file types that editors will like besides files reported >by 'file' as text. For example shell scripts are usually reported as >such and not as text. So the result of 'file' isn't what I think that >you want. Also, some text editors can edit any file, including >executable files. This is true. Your best bet is to write a simple C program that reads in the first block of the file and checks for non-printing characters and possibly for lines that are too long as well. --Joe English jeenglis@alcor.usc.edu
tif@doorstop.austin.ibm.com (Paul Chamberlain) (09/24/90)
In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes: > Could someone explain how the command "file" works? Specifically, I am >writing a program that allows users to navigate their $HOME directory and ... I agree that reading in the first block and making basic sanity checks is probably the best thing to do to verify the sanity of editing it. However, if you desire any more detail, I would seriously consider reading the output of the "file" command itself. Or if you have some deep reason to avoid that, get one of the PD implementations of "file" and suck it into your source. Paul Chamberlain | I do NOT represent IBM tif@doorstop, sc30661@ausvm6 512/838-7008 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/27/90)
In article <12141@chaph.usc.edu> jeenglis@girtab.usc.edu (Joe English Muffin) writes:
: Not all versions of 'file' use a separate database; I
: believe the 4.2BSD 'file' has it hardcoded. (Not to
: mention the fact that not all Unices have on-line
: man pages, and not all sites make the hard-copy versions
: easy to get to, but that's another gripe :-)
:
: To answer the original question, 'file' first does a
: stat() to determine if the file is an executable,
: setuid, symbolic link, etc. Then it reads in the
: first N characters of the file and checks it against a
: predefined set of patterns. Many of the patterns are
: just ``magic numbers''; for example, under SunOS the
: file types "mc68020 demand paged dynamically linked
: executable" and "shell script" are determined from the
: first two bytes of the file.
:
: Some of the other patterns it looks for are a little
: more complicated; for example, a period at the
: beginning of the line indicates "[nt]roff, tbl, or eqn
: input" (which is why it tends to think makefiles are
: for troff so often.) Certain patterns of punctuation
: and capitalization (not too sure what they are)
: distinguish "English text" from "ascii text."
:
: If none of the patterns match, it looks for
: non-printable characters; if there are any it will
: report "data", otherwise "ascii text."
Nice summary.
The main problem with using "file" it might induce bitrot when "file"
mutates out from under you. Just because "file" reports "ascii text"
today is no guarantee that it won't report "D-News history file" sometime
next year. :-)
: >There are many file types that editors will like besides files reported
: >by 'file' as text. For example shell scripts are usually reported as
: >such and not as text. So the result of 'file' isn't what I think that
: >you want. Also, some text editors can edit any file, including
: >executable files.
:
: This is true. Your best bet is to write a simple C
: program that reads in the first block of the file and
: checks for non-printing characters and possibly for
: lines that are too long as well.
Why write another one? I've already got one you can use. :-)
perl -e 'print "text" if -T shift' filename
If you really do want a "simple" C program, rip out the routine that Perl
uses, do_fttext(). (But be advised that "simple" programs are just about
as hard to maintain across multiple architectures as complicated ones.
You get a lot of leverage by installing something like Perl across all
your architectures. End of sermon.)
Larry Wall
lwall@jpl-devvax.jpl.nasa.gov
drd@siia.mv.com (David Dick) (09/27/90)
In <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes: > Could someone explain how the command "file" works? Specifically, I am >writing a program that allows users to navigate their $HOME directory and >any subdirectories (they cannot leave their $HOME directory though, for >security reasons) to find files that are to be read into a text editor. >Some text editor forks this program, and when the user selects a file to >read, it writes the pathname to a temporary file which the editor reads >and then loads into its' buffer. [more description omitted] I consider file(1) to be a useful heuristic program for manual use, but I would never put it in a script for automatic use. In other words, it's just a guesser, and does not contribute to making a robust application. If you have particular requirements of a target file, you should establish them with your own code. David Dick Software Innovations, Inc.
drd@siia.mv.com (David Dick) (09/27/90)
[initial query about file(1) and answers elided] >Not all versions of 'file' use a separate database; I >believe the 4.2BSD 'file' has it hardcoded. When we move a customer's applications to UNIX we often come up with new file types. Part of fully integrating an application to UNIX is establishing magic numbers and making file(1) work, IMHO. Forcing a hard-coded database makes this difficult, as well as being silly (e.g., is the extra efficiency really needed?) and quite contrary to the original idea of editable control files in UNIX. (Of course, how many other things are there in BSD UNIX that are contrary to the original idea of UNIX? :-) David Dick Software Innovations, Inc. [the Software Moving Company (sm)]
chris@mimsy.umd.edu (Chris Torek) (09/29/90)
In article <1990Sep27.145844.28546@siia.mv.com> drd@siia.mv.com (David Dick) writes: [in response to `the BSD file program database is contained inside the BSD file program'] >... a hard-coded database ... [is] quite contrary to the original >idea of editable control files in UNIX. (Of course, how many other >things are there in BSD UNIX that are contrary to the original idea >of UNIX? :-) Sounds like revisionist history to me :-) The 4.[0123]BSD `file' program is a direct descendent of the original research Unix `file' program. It was somewhere along the USG/USDL tree that the System V `file' program acquired an external database. CSRG have an external-database `file', but in some ways it has turned out not to work as well. We are still running a locally-hacked version of the 4.3BSD `file' here. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris