[net.unix-wizards] Magic Numbers

rob@alice.UUCP (03/05/84)

opus!rcd complains that the kernel has no business looking for the #!.
apparently, he doesn't understand that #! is a magic number, just like
0407 or 0413.  the string that is exec'ed is inside the header that the
kernel must read (note: must read; so much for 'no business opening' files)
to determine that the file is a binary and exec it.

trb@masscomp.UUCP (03/06/84)

I just think that it was really nice that whoever put in the #! feature
used the magic number 20443 for editable executables.  They used the
old 407 format for binary executables because it would have been hard
to type and noisy to print out (407 is ^G^A).

Seriously, 407 was a PDP11 Branch .+7 instruction, which was executed
to jump over the a.out header.

	Andy Tannenbaum   Masscomp Inc  Westford MA   (617) 692-6200 x274

tll@druxu.UUCP (LaidigTL) (03/06/84)

One little annoyance that appears with using "#!" as a magic number is
the old byte-ordering problem.  The magic number for an executable file
is defined to be a short int (maybe unsigned, I forget), so that the old
octal 407 is ^G^A on a PDP-11 or a VAX, but is ^A^G on some other
machines.  Similarly, depending on your machine, #! can have either of
two values.  You can get around this in several (semi-) portable ways, for
instance:

	1)  Have the kernel do a strncmp to test against #!, and integer
	tests for the magic numbers of binary executable files.  This is
	less efficient than is nice.

	2)  Test for the #! with an integer test for equality with '#!'
	if you believe in the portability of this.


			Tom Laidig
			AT&T Information Systems Laboratories, Denver
			...!ihnp4!druxu!tll

rcd@opus.UUCP (03/07/84)

<>
 > opus!rcd complains that the kernel has no business looking for the #!.
 > apparently, he doesn't understand that #! is a magic number, just like
 > 0407 or 0413.  the string that is exec'ed is inside the header that the
 > kernel must read (note: must read; so much for 'no business opening' files)
 > to determine that the file is a binary and exec it.

Yes, the kernel has to open the file to look at the magic number and get
the header, etc.  However, it need not recognize a 020443 ("#!") magic
number as valid - that was really my point.  In the "old way", the kernel
would just fail an attempt to exec a shell script and the exec library code
in the process would pick up from there.  (This is just why the ENOEXEC
error code has the meaning it does.)  Exec(2) could be made to look for a
"#!" line and handle it appropriately.  This seems a little clunky perhaps,
but remember that the kernel just opened the file and read the start of it,
so the library code probably isn't going to generate any disk accesses when
it reopens the file to look again.

There's a lot of code in the Berkeley kernels to handle shell scripts -
take a look at it.  Since (1) the number of applications which need to do
exec's of arbitrary programs is relatively small, (2) the cost of handling
scripts in user code is small compared to the rest of the fork/exec/start-
up-a-shell overhead, and (3) the shell-script handling can be reasonably
put in the library code where it is pageable, it seems that it would be
better off there, particularly if you're at all tight on memory.  I will
also (re)state my opinion that the kernel is in a bad place (at least
awkward) for giving a good interpretation of errors in being unable to
"exec" a shell script - it would take STILL MORE precious kernel space for
the code to generate good error values, as well as yet more expansion of
the set of errno values.
-- 

{hao,ucbvax,allegra}!nbires!rcd

kre@mulga.SUN (Robert Elz) (03/08/84)

The point of doing '#!' stuff inside the kernel is that it allows
setuid interpreted programs (including 'sh' scripts as a special case).

That can't be accomplished in any library routine, no matter
how hard you try.

With that, shell scripts become as versatile as
compiled (a.out format) executables, you can ALWAYS use
whichever is most appropriate, without being stopped by
implementation restrictions - which is just as it should be.

Another effect, is that the name of the script (interpreted
program) goes in /usr/adm/acct instead of the ubiquitous 'sh'.

I might add that the original idea & code to do this were
by Dennis Ritchie (if my sources are correct, & they are
fairly good sources I think), and I added it to 4.1bsd.

Robert Elz
decvax!mulga!kre

ps: the code is reasonably portable, the "magic number" is the
string "#!", it works, as is, whichever way your bytes are arranged.
And yes, its fractionally slower than treating the magic number
as some horrible octal constant!

thomas@utah-gr.UUCP (Spencer W. Thomas) (03/09/84)

The ONE thing that putting #! into the kernel gets you that having 
exec(2) do it (besides putting it in probably a more obvious place),
is to allow setuid shell scripts.  This gives shell scripts an equal
footing with all other programs - you don't have to explain to users
that if they write a C program and chmod u+s it works, but if they
write a shell script and chmod u+s it doesn't.

=Spencer

merlyn@sequent.UUCP (03/09/84)

[from nbires!rcd...]

[[ There's a lot of code in the Berkeley kernels to handle shell scripts -
[[ take a look at it.  Since (1) the number of applications which need to do
[[ exec's of arbitrary programs is relatively small, (2) the cost of handling
[[ scripts in user code is small compared to the rest of the fork/exec/start-
[[ up-a-shell overhead, and (3) the shell-script handling can be reasonably
[[ put in the library code where it is pageable, it seems that it would be
[[ better off there, particularly if you're at all tight on memory.  I will
[[ also (re)state my opinion that the kernel is in a bad place (at least
[[ awkward) for giving a good interpretation of errors in being unable to
[[ "exec" a shell script - it would take STILL MORE precious kernel space for
[[ the code to generate good error values, as well as yet more expansion of
[[ the set of errno values.
[[ -- 

The shell script code CANNOT be put into a library for certain cases.
The one I think of is that you cannot then have execute-only shell
scripts.  The kernel MUST open the file, and make it stdin to the
designated program.  Arbitrary interpreters (ala /bin/sh) would not
have enough power to open the file (if read-denied) and start up the
desired other interpreter.  Setuid interpreted scripts also become
available as well (such as an execute-only, read-protected, setuid
cshell script).  This is also something that couldn't be done at the
application level, but rather at the kernel level.  I thank Berkeley
for putting in this feature, and appreciate it (and use it) quite
regularly.

Randal L. Schwartz, esq.
Sequent Computer Systems, Inc.
UUCP: ...!tektronix!ogcvax!sequent!merlyn
BELL: (503)626-5700

P.S. It's true... the infamous UUCP-breaker from last year is back!
After 10 months of VMS, I get to play with UNIX again!  What a deal.

henry@utzoo.UUCP (Henry Spencer) (03/11/84)

Yup, #! in the kernel permits setuid shell scripts.  I'm not sure
that this is a virtue, considering that people seem to be unaware of
the simply appalling number of security holes this opens up.  If you
think about the consequences of feeding a setuid shell file a non-
standard value of the IFS variable, with some suitably-named programs
lying around ready and waiting, you will have some idea of the sort
of things I'm referring to.  Shell files simply are not in a good
position to handle things like this; the interpretation process for
them is too complex and there is too little control over it.

This does not mean that I'm opposed to #! in the kernel, just that
setuid shell scripts seem a very weak justification for it, given
that they are grossly unsafe.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

george@idis.UUCP (George Rosenberg) (03/11/84)

I think that giving the kernel the ability to exec files
that can be interpreted is a good feature.
I don't use a system with this feature.
Since there is a discussion about it, I will bring up some
points that I am curious about.

The programs that perform the most execs, sh and csh,
will also be the interpreters that most scripts employ.
If an exec fails the Bourne shell will interpret
a script via a longjmp.
I assume that a longjmp is much cheaper than completing an exec.
Has anyone compared the two?
Is there a way that an informed program such as a shell interpreter
can tell the kernel to return without completing the exec,
if the old program and new interpreter are the same file and
the setuid and setgid bits of the script are not set?

>From what I understand, execing a file that begins with "#!"
will examine the next two tokens on the first line.
The first token must be a path name of an interpreter.
The tokens are inserted before the old argument list.
The execute bit and the setuid and setgid bits of the script are honored.
The interpreter is responsible for opening the script.
Can the interpreter also have "#!" as its magic number?
It seems to me that a design that also accommodated access modes
in which a file containing a script was executable but not readable
would also have been desirable.
This could have been done by simulating an open for reading (and perhaps
also a deferred FIOCLEX) on a particular file descriptor (such as 19).
I think that simulating an lseek to the beginning of the second line
would also be desirable.
Does anyone know why nothing like this was added at the same time?

	George Rosenberg
	duke!mcnc!idis!george
	decvax!idis!george

aegl@edee.UUCP (Tony Luck) (03/19/84)

One really useful effect of having the kernel look for '#!' is that shell
scripts are then really just like other executables (there should be no
way to tell by running it whether a program is a shell file or a binary)
i.e. you can make them setuid if you need to, or use them as login shells
without having to fix login/newgrp/su to know about ENOEXEC.

Tony Luck      { ... UK !ukc!edcaad!edee!aegl }

knop@dutesta.UUCP (Peter Knoppers) (04/08/84)

We have a PWB/UNIX system running on some PDP11/45's.  Our shell is
altered to operate on set-uid scripts.  The shell is set-uid to root.
On entry it checks the mode of the script.  If the s-bit is set, the
shell does a setuid to the user id of the file, if not the shell
changes its privileges back to those of the real user.

Peter Knoppers, Delft Univ. of Technology
..!{decvax,philabs}!mcvax!dutesta!knop
-- 
Peter Knoppers, Delft Univ. of Technology
..!{decvax,philabs}!mcvax!dutesta!knop

allbery@ncoast.UUCP (Brandon Allbery) (12/05/85)

Expires:

Quoted from <416@ihdev.UUCP> ["Re: magic numbers? (teach me, please)"], by pdg@ihdev.UUCP (P. D. Guthrie)...
+---------------
| In article <124@rexago1.UUCP> rich@rexago1.UUCP (K. Richard Magill) writes:
| >	1)  How does the shell (exec?) know whether the command I just typed
| >		is a shell script or one of several possible types of
| >		executable?
| 
| The shell doesn't know.  The shell merely tells the kernel to exec the
| file, after doing a fork.  The kernel determines if a file is a binary
| executable by the magic number, which is obtained by reading an a.out.h
| structure (4.1,4.2) or filehdr.h (sys 5) and comparing it against
| hardcoded numbers in the kernel. In 4.1 for instance only 407,413 and
| 410 are legal.  This also tells the kernel the specific type of
| executable, and in some cases can set emulation modes. The kernel also
| recognizes 
| #! /your/shellname
| at the beginning of a file and execs off the appropriate shell instead.
+---------------

In 4.2, the #! is recognized.  In all other Unices, the exec will fail, and the
shell will decide that the file must be a shell script; it proceeds to fork off
a copy of itself to run the script.  (Csh on non-4.2 systems checks for a # as
the first character of the file, and forks itself if it sees it; if not, it
forks a /bin/sh.)

+---------------
| >	2)  Presuming the answer to #1 above has something to do with
| >		magic numbers, who issues them?  is there a common
| >		(definitive) base of them or does each
| >		manufacturer/environment make up their own set?
| 
| The magic number is issued by the linker/loader.  Pretty much the magic
| number is decided by the manufacturer, but from what I have seen, is
| kept constant over machines. Forgive me if this is wrong, but I do not
| have any method of checking, but the magic numbers for say plain
| executable 4.x Vax and plain executable SysV.x Vax are the same, but
| SysV.x Vax and SysV.x 3B20 are different.  Could someone comfirm this?
+---------------

Executables using ``standard'' binary formats, i.e. a.out (PDP-11, Z8000)
and b.out (MC68000) use the standard magic numbers 0405, 0407, 0410, 0411.
Non-standard formats, like Xenix x.out (0x0206) and COFF (flames to /dev/null;
most systems are [ab].out) use distinctive magic numbers.

There are other magic numbers.  Old-style archives (ar) have 0177545 as a
magic number; again, the loader knows about this, since a library is an
archive.  System V archives begin with the magic ``number'' "!<arch>\n".
Cpio archives also have magic numbers in them, but at the archive-member
level.
-- 

			Lord Charteris (thurb)

ncoast!allbery@Case.CSNet (ncoast!allbery%Case.CSNet@CSNet-Relay.ARPA)
..decvax!cwruecmp!ncoast!allbery (..ncoast!tdi2!root for business)
6615 Center St., Mentor, OH 44060 (I moved) --Phone: +01 216 974 9210
CIS 74106,1032 -- MCI MAIL BALLBERY (WARNING: I am only a part-time denizen...)

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (12/06/85)

> Csh on non-4.2 systems checks for a # as
> the first character of the file, and forks itself if it sees it; if not, it
> forks a /bin/sh.

Yes, some of them do, but if they do that's a bug.
Virtually all of my Bourne shell scripts (including
SVR2 system utility scripts) start with "#".

guy@sun.uucp (Guy Harris) (12/07/85)

> Executables using ``standard'' binary formats, i.e. a.out (PDP-11, Z8000)
> and b.out (MC68000) use the standard magic numbers 0405, 0407, 0410, 0411.
> Non-standard formats, like Xenix x.out (0x0206) and COFF (flames to
> /dev/null; most systems are [ab].out) use distinctive magic numbers.

Well, VAX UNIX (32V, 4.xBSD, System III, Version 8?) also uses those magic
numbers (with 413 added for demand paged executables on 4.xBSD), and
probably lots of other 4.xBSD systems (Sun's does).  Does "most" mean "most
UNIX implementations" or "most boxes running UNIX"?  If the latter, I think
Xenix is running on a lot of systems, possibly most.  Then again, *my* copy
of "Xenix(TM) Standard Object File Format (January 1983)" implies that that
"0x0206" is the "magic number" and is *not* distinctive; the "x_cpu" field
indicates what CPU it's intended for.  (This is sort of like the new Sun
UNIX 3.0 object file format, where the "a_machtype" field indicates whether
it's intended for a 68010 or 68020).

COFF seems to invert this, since the "file header" indicates what machine
it's intended for (and tons of other glop) and the "UNIX header" (which is
basically the old a.out header) has the 0405, 0407, 0410, 0411, and 0413
(yes, that's what they use for paged executables, surprise surprise) which
indicates the format of the image but is machine-independent (modulo byte
ordering).  Then again, the "file header" magic number seems to indicate
something about the format of the executable, but see a previous posting of
mine for some dyspepsia caused by the proliferation of multiple file header
magic numbers.

> There are other magic numbers.  Old-style archives (ar) have 0177545 as a
> magic number; again, the loader knows about this, since a library is an
> archive.  System V archives begin with the magic ``number'' "!<arch>\n".

System V, Release 2 archives, anyway; System V Release 1 had a portable
archive format which was different from the 4.xBSD one which was the first
one to use the "!<arch>\n" magic "number".  I'm told they came to their
senses because Version 8, being 4.1BSD-based, used that format.

> Cpio archives also have magic numbers in them, but at the archive-member
> level.

No, it has a magic number at the beginning - 070707 (either as a "short" or
a string, depending on whether it's an old cruddy "cpio" archive or a nice
new "gee, we've finally caught up with 'tar' when it comes to portability"
"cpio -c" archive.  (S3 had "-c", but it had a bug so it wasn't really
portable.  S5 fixed this bug.  S5 also broke the byte-swapping garbage:

	S3 had an option to swap the bytes within 2-byte quantities.
	Presumably, this was because running the tape through "dd" to
	byte-swap *everything*, and then byte-swapping the data and
	pathnames inside "cpio", thus swapping the binary portion of the
	header once and everything else twice, is obviously more efficient
	than just swapping the binary portion of the header once.  ("cpio"
	already has hacks to deal with 4-byte quantities - namely,
	file size and modified time - automagically, by shoving "1L" into
	a "long" and seeing whether the 0th byte of that "long" is 0 or
	not, so PDP-11s and VAXes don't have problems.)  It is also
	obvious that forcing the user to specify a byte-swapping option
	is better than just looking at the magic number and seeing whether
	it's 070707 or a byte-swapped 070707 and deciding whether to
	swap or not based on that.

	Whoever worked on "cpio" for S5 obviously figured that the
	purpose of this byte-swapping crap was to make it possible to
	move binary data between machines with different byte orders
	(as everybody knows, most files with binary data are continuous
	streams of 2-byte or 4-byte quantities), not to provide a gross
	and kludgy way of byte-swapping the binary portion of a "cpio"
	header, so they added an option to swap the 2-byte portions
	of 4-byte quantities ("stupid FP-11", to quote - if I remember
	correctly - the VAX System III linker, that particular piece of
	DEC hardware being responsible for some PDP-11 software, including
	but *NOT* limited to UNIX, having a different format for 32-bit
	integers than the VAX's hardware supports) and an option to
	swap both bytes and 2-byte quantities.  They also "fixed" it
	not to swap the bytes of the pathnames.  This "fix" means that
	running the "cpio" archive through "dd" to swap the bytes, and
	then doing a byte swap again in "cpio", results in path names
	with their bytes swapped!  ("/nuxi", anyone?)  In effect, you
	are now screwed if you have a "cpio" tape, not made with "-c",
	which was produced on a machine with a different byte order.
	You can't read it in conveniently.  (This has been experimentally
	verified.  I had to whip up a version of "cpio" which does what
	"cpio" should have done in the first place - namely, just byte
	swap the damn "short"s in the header - to read a tape made on
	a System V VAX using the System V "cpio" on a Sun.))

There are a number of quite intelligent and talented people working on UNIX
development at AT&T Information Systems.  It looks like the people in charge
of keeping track of COFF magic numbers, and in charge of "cpio", are in need
of some supervision by the aforementioned people.  (Fortunately, it looks
like the IEEE P1003 committee is looking at a "tar"-based format, with fixes
to support storing information about directories and special files, for
tapes.  I'm told that the European UNIX vendor consortium, X/OPEN, chose a
"cpio" format because of the "cpio" *program*'s byte-swapping
"capabilities".  Aside from the basic stupidity (and incorrectness, in the
case of the S5 "cpio") of these "capabilities", they are irrelevant to the
choice of tape *format* because:

	1) "tar" doesn't need byte-swapping options because the
	   control information is in printable ASCII string format
	   (any tape controller which is good as anything other than
	   a target for skeet-shooting will write character strings
	   in memory out to the tape in character-string order)

	2) "cpio" has the "-c" option which does the same thing,
	   so it doesn't need those options except for reading old
	   tapes (any reasonable "cpio"-format-based standard would
	   be based on "cpio -c" format, not "cpio" format),

and
	3) a *good* program which handles "cpio" format can figure
	   out the byte order it needs for reading pre-"cpio -c"
	   tapes by looking at the magic number anyway!

(Flame off, until next time a collection of stupidities this gross comes to
light.)

	Guy Harris