[comp.unix.programmer] File "type"

bbs@alchemy.UUCP (BBS Administration) (09/23/90)

	Could someone explain how the command "file" works? Specifically, I am
writing a program that allows users to navigate their $HOME directory and
any subdirectories (they cannot leave their $HOME directory though, for
security reasons) to find files that are to be read into a text editor.
Some text editor forks this program, and when the user selects a file to
read, it writes the pathname to a temporary file which the editor reads 
and then loads into its' buffer.

	I wrote this "navigator" program as a separate entity, so that either
my line based editor (non-curses) or my full screen editor (subset of
curses) can call upon it and use its facilities (the navigator does lots
of other things too) without giving the user shell access directly.
Anyhow, once they select a file for reading, I'd like to be able to
determine if the file is "ascii text" as the program "file" reports
when this is true, and if not, inform the user that the contents are
NOT ascii text and that they may want to reconsider.

	Should I make a pass through the contents and make sure that each
character has the high bit OFF (so it's 7-bit data) or what? I don't
need to determine what kind of file it is, just whether or not it's
something the editors will "like."

Thanks in advance!

-- John

John Donahue, Senior Partner | UUCP: ucrmath!alchemy!{bbs, gumby}  | The Future
  Alchemy Software Designs   | INET: {bbs, gumby}@alchemy.UUCP     | Begins Now
-------------------+---------+-------------------------------------+-----------
Communique On-line | +1-714-243-7150 {3, 12, 24, 96HST} Bps. 8-N-1 | Next Wave:
Information System |    Alchemy Software Designs Support System    | Communique

robertb@cs.washington.edu (Robert Bedichek) (09/23/90)

In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes:
>
>	Could someone explain how the command "file" works? Specifically, I am
>writing a program that allows users to navigate their $HOME directory and
<text deleted>

I suggest that you read the man page for 'file'.  Also, read the file
that the man pages specifies as the database that 'file' uses.  You can
find lots of useful stuff by reading man pages and examining
user-readable system files.  It is something that still distinguishes
most versions of UNIX from most other operating systems.

>Anyhow, once they select a file for reading, I'd like to be able to
>determine if the file is "ascii text" as the program "file" reports
>when this is true, and if not, inform the user that the contents are
>NOT ascii text and that they may want to reconsider.
>
>	Should I make a pass through the contents and make sure that each
>character has the high bit OFF (so it's 7-bit data) or what? I don't
>need to determine what kind of file it is, just whether or not it's
>something the editors will "like."

There are many file types that editors will like besides files reported
by 'file' as text.  For example shell scripts are usually reported as
such and not as text.  So the result of 'file' isn't what I think that
you want.  Also, some text editors can edit any file, including
executable files.

>
>Thanks in advance!

Sure, I hope that this helps.

	Rob Bedichek    robertb@cs.washington.edu

>
>-- John
>
>John Donahue, Senior Partner | UUCP: ucrmath!alchemy!{bbs, gumby}  | The Future
>  Alchemy Software Designs   | INET: {bbs, gumby}@alchemy.UUCP     | Begins Now
>-------------------+---------+-------------------------------------+-----------
>Communique On-line | +1-714-243-7150 {3, 12, 24, 96HST} Bps. 8-N-1 | Next Wave:
>Information System |    Alchemy Software Designs Support System    | Communique

jeenglis@girtab.usc.edu (Joe English Muffin) (09/23/90)

robertb@cs.washington.edu (Robert Bedichek) writes:
>In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes:
>>
>>	Could someone explain how the command "file" works? Specifically, I am
>>writing a program that allows users to navigate their $HOME directory and
><text deleted>

>I suggest that you read the man page for 'file'.  Also, read the file
>that the man pages specifies as the database that 'file' uses.

Not all versions of 'file' use a separate database; I
believe the 4.2BSD 'file' has it hardcoded. (Not to
mention the fact that not all Unices have on-line
man pages, and not all sites make the hard-copy versions 
easy to get to, but that's another gripe :-)

To answer the original question, 'file' first does a
stat() to determine if the file is an executable,
setuid, symbolic link, etc.  Then it reads in the
first N characters of the file and checks it against a
predefined set of patterns.  Many of the patterns are
just ``magic numbers''; for example, under SunOS the
file types "mc68020 demand paged dynamically linked
executable" and "shell script" are determined from the
first two bytes of the file.

Some of the other patterns it looks for are a little
more complicated; for example, a period at the
beginning of the line indicates "[nt]roff, tbl, or eqn
input" (which is why it tends to think makefiles are
for troff so often.)  Certain patterns of punctuation
and capitalization (not too sure what they are)
distinguish "English text" from "ascii text."

If none of the patterns match, it looks for
non-printable characters; if there are any it will
report "data", otherwise "ascii text."

>There are many file types that editors will like besides files reported
>by 'file' as text.  For example shell scripts are usually reported as
>such and not as text.  So the result of 'file' isn't what I think that
>you want.  Also, some text editors can edit any file, including
>executable files.

This is true.  Your best bet is to write a simple C
program that reads in the first block of the file and
checks for non-printing characters and possibly for
lines that are too long as well. 

--Joe English

  jeenglis@alcor.usc.edu

tif@doorstop.austin.ibm.com (Paul Chamberlain) (09/24/90)

In article <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes:
>	Could someone explain how the command "file" works? Specifically, I am
>writing a program that allows users to navigate their $HOME directory and ...

I agree that reading in the first block and making basic sanity checks is
probably the best thing to do to verify the sanity of editing it.  However,
if you desire any more detail, I would seriously consider reading the output
of the "file" command itself.  Or if you have some deep reason to avoid that,
get one of the PD implementations of "file" and suck it into your source.

Paul Chamberlain | I do NOT represent IBM         tif@doorstop, sc30661@ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/27/90)

In article <12141@chaph.usc.edu> jeenglis@girtab.usc.edu (Joe English Muffin) writes:
: Not all versions of 'file' use a separate database; I
: believe the 4.2BSD 'file' has it hardcoded. (Not to
: mention the fact that not all Unices have on-line
: man pages, and not all sites make the hard-copy versions 
: easy to get to, but that's another gripe :-)
: 
: To answer the original question, 'file' first does a
: stat() to determine if the file is an executable,
: setuid, symbolic link, etc.  Then it reads in the
: first N characters of the file and checks it against a
: predefined set of patterns.  Many of the patterns are
: just ``magic numbers''; for example, under SunOS the
: file types "mc68020 demand paged dynamically linked
: executable" and "shell script" are determined from the
: first two bytes of the file.
: 
: Some of the other patterns it looks for are a little
: more complicated; for example, a period at the
: beginning of the line indicates "[nt]roff, tbl, or eqn
: input" (which is why it tends to think makefiles are
: for troff so often.)  Certain patterns of punctuation
: and capitalization (not too sure what they are)
: distinguish "English text" from "ascii text."
: 
: If none of the patterns match, it looks for
: non-printable characters; if there are any it will
: report "data", otherwise "ascii text."

Nice summary.

The main problem with using "file" it might induce bitrot when "file"
mutates out from under you.  Just because "file" reports "ascii text"
today is no guarantee that it won't report "D-News history file" sometime
next year.  :-)

: >There are many file types that editors will like besides files reported
: >by 'file' as text.  For example shell scripts are usually reported as
: >such and not as text.  So the result of 'file' isn't what I think that
: >you want.  Also, some text editors can edit any file, including
: >executable files.
: 
: This is true.  Your best bet is to write a simple C
: program that reads in the first block of the file and
: checks for non-printing characters and possibly for
: lines that are too long as well. 

Why write another one?  I've already got one you can use.  :-)

	perl -e 'print "text" if -T shift' filename

If you really do want a "simple" C program, rip out the routine that Perl
uses, do_fttext().  (But be advised that "simple" programs are just about
as hard to maintain across multiple architectures as complicated ones.
You get a lot of leverage by installing something like Perl across all
your architectures.  End of sermon.)

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

drd@siia.mv.com (David Dick) (09/27/90)

In <171@alchemy.UUCP> bbs@alchemy.UUCP (BBS Administration) writes:


>	Could someone explain how the command "file" works? Specifically, I am
>writing a program that allows users to navigate their $HOME directory and
>any subdirectories (they cannot leave their $HOME directory though, for
>security reasons) to find files that are to be read into a text editor.
>Some text editor forks this program, and when the user selects a file to
>read, it writes the pathname to a temporary file which the editor reads 
>and then loads into its' buffer.

[more description omitted]

I consider file(1) to be a useful heuristic program for manual use,
but I would never put it in a script for automatic use.

In other words, it's just a guesser, and does not contribute to 
making a robust application.

If you have particular requirements of a target file, you should
establish them with your own code.

David Dick
Software Innovations, Inc.

drd@siia.mv.com (David Dick) (09/27/90)

[initial query about file(1) and answers elided]

>Not all versions of 'file' use a separate database; I
>believe the 4.2BSD 'file' has it hardcoded. 

When we move a customer's applications to UNIX we often come 
up with new file types.  Part of fully integrating an application
to UNIX is establishing magic numbers and making file(1) work, IMHO.

Forcing a hard-coded database makes this difficult, as well as 
being silly (e.g., is the extra efficiency really needed?) and 
quite contrary to the original idea of editable
control files in UNIX.  (Of course, how many other things are there
in BSD UNIX that are contrary to the original idea of UNIX? :-)

David Dick
Software Innovations, Inc. [the Software Moving Company (sm)]

chris@mimsy.umd.edu (Chris Torek) (09/29/90)

In article <1990Sep27.145844.28546@siia.mv.com> drd@siia.mv.com (David Dick)
writes:
[in response to `the BSD file program database is contained inside the BSD
file program']
>... a hard-coded database ... [is] quite contrary to the original
>idea of editable control files in UNIX.  (Of course, how many other
>things are there in BSD UNIX that are contrary to the original idea
>of UNIX? :-)

Sounds like revisionist history to me :-)

The 4.[0123]BSD `file' program is a direct descendent of the original
research Unix `file' program.  It was somewhere along the USG/USDL tree
that the System V `file' program acquired an external database.

CSRG have an external-database `file', but in some ways it has turned
out not to work as well.  We are still running a locally-hacked version
of the 4.3BSD `file' here.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris