[comp.unix.wizards] Aliasing text and data segments of a process

mjy@sdti.UUCP (Michael J. Young) (01/21/88)

Is there a way in Unix to create an "alias" between the text and data
segments of a process?  More specifically, how does one go about executing a
block of code that was generated in a data segment?

I'm not really talking about self-modifying code, in which a program
attempts to modify its own text segment, but rather self-generating code, in
which a program "compiles" a block of code into its data segment (created
via malloc() perhaps?) and then tries to execute it.  An obvious application
of this might be an incremental compiler, but I can think of other reasons
why I might want to do this as well.
-- 
Mike Young - Software Development Technologies, Inc., Sudbury MA 01776
UUCP     : {decvax,harvard,linus,mit-eddie}!necntc!necis!mrst!sdti!mjy
Internet : mjy%sdti.uucp@harvard.harvard.edu      Tel: +1 617 443 5779

alex@umbc3.UMD.EDU (Alex S. Crain) (01/21/88)

In article <202@sdti.UUCP> mjy@sdti.UUCP (Michael J. Young) writes:
>
>Is there a way in Unix to create an "alias" between the text and data
>segments of a process?  More specifically, how does one go about executing a
>block of code that was generated in a data segment?

*** SYSTEM 5 ****

Koyto Common Lisp does this when it loads object code. The system builds
object files where the first symbol in the text segment is a function that
knows about all the other symbols in the file. There is an external loader
that makes a copy of the .o file and resolves all external symbols against
the lisp executable's symbol table. Lisp allocates space with brk(), and
loads the .o file as data, and then branches to the start of the text area
of the .o file, assuming that there is a function there that will put the 
rest of the symbols on the common obstack.

Boy, do things get weird when the .o file is corrupted :-).

But to answer the question, Nothing. That is, lisp doesn't do anything 
special to accomplish this, it just works. There is a short file that 
demonstrates this behavior, which I can send you if you like.

-- 
					:alex.

alex@umbc3.umd.edu

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/22/88)

In article <202@sdti.UUCP> mjy@sdti.UUCP (Michael J. Young) writes:
>in which a program "compiles" a block of code into its data segment (created
>via malloc() perhaps?) and then tries to execute it.

In general, there is no support for doing this.  Some specific
implementations may have the necessary hooks, if the architecture
permits it.

There are several obvious alternative portable approaches to doing
something of this general nature.  Which to pursue would depend on
what you're really trying to accomplish, functionally.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/22/88)

In article <730@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
>loads the .o file as data, and then branches to the start of the text area
>of the .o file

This cannot possibly work on an architecture that enforces the
distinction between Instruction and Data spaces.

ed@mtxinu.UUCP (Ed Gould) (01/22/88)

>Is there a way in Unix to create an "alias" between the text and data
>segments of a process?  More specifically, how does one go about executing a
>block of code that was generated in a data segment?

It depends on the hardware architecture and on the implementation of
the operating system.  Some hardware (e.g., some PDP-11 models) allows
for a strict separation of instruction ("text") and data spaces.  On such
a machine, if the OS uses the feature, then (unless the text space is
writable, which it typically isn't) you're out of luck.

Other machines, like the VAX, do not rigidly separate instruction
and data spaces.  Even in this class of machinem though, there may
be pitfals:  If the hardware has separate read and execute permissions
on regions of memory, then the same problem arises.  If the OS does
not supply execute permission for the data segment, then code can't
be executed from there.

-- 
Ed Gould                    mt Xinu, 2560 Ninth St., Berkeley, CA  94710  USA
{ucbvax,uunet}!mtxinu!ed    +1 415 644 0146

"`She's smart, for a woman, wonder how she got that way'..."

david@linc.cis.upenn.edu (David Feldman) (01/22/88)

You can execute out of the data segment, at least on SOME Unix systems.  In
Ultrix, you can tell the loader to make the code "IMPURE", although with cc
you usually get demand paged pure executables unless you specify the right
option for ld.  You can also execute code out of the stack, of course, and if
you catch signals you are forced to do this.  On receiving a signal, Ultrix
inserts a segment of code above the stack in the stack space - on a VAX at
least.  This code is the infamous 'sigtramp'.  So, yes, a program can be
modified while it is running.

As an aside, I had planned on writing a machine simulator which executed code
out of a malloc'ed memory space.  I never started the project, but I was able
to get some assembly running that jumped into a malloc space and then out
again.  I would assume that any Unix running on a machine that does not enforce
separate I & D could do this.  Check the manual page for ld.

					Dave F.
					david@linc.cis.upenn.edu

jc@minya.UUCP (01/23/88)

In article <7156@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> In article <730@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
> >loads the .o file as data, and then branches to the start of the text area
> >of the .o file
> 
> This cannot possibly work on an architecture that enforces the
> distinction between Instruction and Data spaces.

Jeez, why do they let such obvious non-wizards post responses to
unix.wizards? (:-)  There have been far too many such comments from
people who obviously haven't RTFM, in this case K&R.

Study the following program, which should work anywhere you have
a C compiler.  (If your compiler doesn't do it right, send it back
to the factory; it's obviously broken.)

|	#include <stdio.h>
|	char *code;		/* This can point to any address in memory */
|	int (*fct)();	/* We will point this at *code */
|	
|	foobar(x,y)
|	{	printf("foobar(%d,%d)\n",x,y);
|		return x + y;
|	}
|	main()
|	{	int i;
|	
|		code = (char*)foobar;	/* This could be malloc() */
|		fct = (int(*)())code;	/* Stuff random pointer into fct */
|		i = (*fct)(7,9);		/* Call random memory location */
|		printf("(*fct)(7,9)=%d\n",i);
|		exit(0);
|	}

Well, OK, he asked if there is Unix support, and there isn't.
So who needs it?  This oughta work on VMS or MS/DOS, too.

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

jc@minya.UUCP (01/23/88)

You know, it occurred to me that the I-space vs D-space just might
be a problem, so I looked in the manuals for this machine (which
might as well remain incognito).  Sure enough, there is supposedly
separate address spaces for both, as well as for IO-space.  So I
modified the little program I posted earlier:

| #include <stdio.h>
| extern char *malloc();
| char *code;		/* This will point into D-space */
| int (*fct)();	/* So will this */
| 
| foobar(x,y)
| {	printf("foobar(%d,%d)\n",x,y);
| 	return x + y;
| }
| main()
| {	int i;
| 	char *p, *q;
| 
| 	code = (char*)malloc(1000);
| 	p = code;
| 	q = (char*)foobar; 
| 	while (p < code+1000)	/* Make copy of foobar() in code[] */
| 		*p++ = *q++;
| 	fct = (int(*)())code;	/* Point to the malloc()ed data area */
| 	i = (*fct)(7,9);	/* Call the copy */
| 	printf("(*fct)(7,9)=%d\n",i);
| 	exit(0);
| }

It compiled and linked without problems, so with a bit of trepidation
I told Unix (Sys/V) to run it, and guess what it said? Give up?  OK,
here's the output:

| foobar(7,9)
| (*fct)(7,9)=16

Seems like it worked, didn't it?  Clever people, those C compiler writers!

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/25/88)

In article <452@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>In article <7156@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
>> In article <730@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
>> >loads the .o file as data, and then branches to the start of the text area
>> >of the .o file
>> This cannot possibly work on an architecture that enforces the
>> distinction between Instruction and Data spaces.
>Jeez, why do they let such obvious non-wizards post responses to
>unix.wizards? (:-)  There have been far too many such comments from
>people who obviously haven't RTFM, in this case K&R.

This issue has nothing to do with K&R.  It has to do with
hardware realities.  If the I&D space distinction is enforced,
as it is for example using "cc -i" on PDP-11s, then it is
indeed impossible to execute anything out of data space.
In fact, for such PDP-11s, the same range of addresses mean
two totally different things, depending on whether data is
being accessed or an instruction is being fetched for
execution.

>Study the following program, which should work anywhere you have
>a C compiler.

Your example takes an I-space address, stashes it in a pointer
(of inappropriate type, but that's not the issue here), then
invokes an already-compiled function (which lives in I-space)
using it.  Of COURSE you can invoke an I-space function via a
pointer.  That is NOT AT ALL the same as what was requested,
which was to invoke a portion of D-space as a function.  THAT
cannot be done on a split=I&D PDP-11, for example.  Different
physical memory locations correspond to an I-space virtual
address and the SAME NUMERICAL VALUE as a D-space virtual
address.

If you still don't understand this, go find a split I&D PDP-11
and play with it for a while, or contact me for clarification,
rather than spreading erroneous information across the net.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/25/88)

In article <453@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>Sure enough, there is supposedly separate address spaces for both...

You have to ask the linker for this feature, assuming your hardware
and OS support it; it's not the default.

>| 	code = (char*)malloc(1000);

code -> D-space.

>| 	fct = (int(*)())code;	/* Point to the malloc()ed data area */
>| 	i = (*fct)(7,9);	/* Call the copy */

If this works, you do not have split I&D space.

rbj@icst-cmr.arpa (Root Boy Jim) (01/26/88)

   From: Doug Gwyn  <gwyn@brl-smoke.arpa>

   In article <730@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
   >loads the .o file as data, and then branches to the start of the text area
   >of the .o file

   This cannot possibly work on an architecture that enforces the
   distinction between Instruction and Data spaces.

While true, most such machines do not *insist* on enforcing the distinction,
or provide mechanisms around it where appropriate. Thus is is possible to
build three or four types of executables on a given system:

(1) Text and Data glommed together with only limits protection.
(2) Text sharable (and thus unmodifyable), data separate & possibly executable.
(3) Text and Data separate in separate I/D spaces. Here they share the same
    address range and as Doug mentions, never the twain shall meet.
(4) Demand paged, which is more or less like (2) above.

Each format has its own advantages and drawbacks. If you want to set
breakpoints in code, you must use the first type. If you want to dynamically
load code, you must either use this format, or execute from data space if
possible. This kind of stuff tends to vary across machines and systems.

The split I/D space PDP-11 is perhaps a bad example, as it is (or was)
possible to build other format executables, but I'm sure Doug *has* seen
machines where this is impossible.

	(Root Boy) Jim Cottrell	<rbj@icst-cmr.arpa>
	National Bureau of Standards
	Flamer's Hotline: (301) 975-5688
	I feel like a wet parking meter on Darvon!

mjy@sdti.UUCP (Michael J. Young) (01/28/88)

In article <11476@brl-adm.ARPA> rbj@icst-cmr.arpa (Root Boy Jim) writes:
>(1) Text and Data glommed together with only limits protection.
> ...
>Each format has its own advantages and drawbacks. If you want to set
>breakpoints in code, you must use the first type.

Actually, setting breakpoints is no problem even with the other three types.
That's what the ptrace(2) system call is for.  I suppose you could even
use ptrace() to "poke" an entire function into the text space.  The problem
arises when you want to change the size of the text region; ptrace() doesn't
let you do that.
-- 
Mike Young - Software Development Technologies, Inc., Sudbury MA 01776
UUCP     : {decvax,harvard,linus,mit-eddie}!necntc!necis!mrst!sdti!mjy
Internet : mjy%sdti.uucp@harvard.harvard.edu      Tel: +1 617 443 5779

naren@vcvax1.UUCP (naren) (01/28/88)

> In article <7156@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> > In article <730@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
> > >loads the .o file as data, and then branches to the start of the text area
> > >of the .o file
> > 
> > This cannot possibly work on an architecture that enforces the
> > distinction between Instruction and Data spaces.
> 
> Jeez, why do they let such obvious non-wizards post responses to
> unix.wizards? (:-)  There have been far too many such comments from
> people who obviously haven't RTFM, in this case K&R.
>
> [Sample program that malloc()'s and typecasts result to a func. ptr. deleted]
>
> John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

	Doug Gwyn is right about architectures that enforce distinctions 
between code and data spaces (ex: 80386). On UNIX/386, an sbrk() allocates 
space in the Data Segment of the process. Type casting this pointer and 
issuing a 'call' to this address will result in a protection exception. 
	Now, if you REALLY want to do this, you could write a new system call 
like mktext(vaddr, length) where vaddr is the start of the data space 
you would like to fill in with code.  mktext() would just create a new code 
segment descriptor in the LDT of your task that includes the desired 
section of data space and then you'd be all set. 
	I am of course leaving out a lot of the nitty-gritty details of 
how this feature would interact with other things like shared texts, etc.

...!{harvard,mit-eddie}!cybvax0!vcvax1!naren 	Naren Nachiappan.(617/661-1230)

weiser.pa@xerox.com (01/28/88)

"...in which a program "compiles" a block of code into its data segment (created
via malloc() perhaps?) and then tries to execute it.  "

Just do it.  Works fine on Suns and Vaxes.

-mark

jc@minya.UUCP (John Chambers) (01/28/88)

In article <11476@brl-adm.ARPA>, rbj@icst-cmr.arpa (Root Boy Jim) writes:
>  From: Doug Gwyn  <gwyn@brl-smoke.arpa>
>  > This cannot possibly work on an architecture that enforces the
>  > distinction between Instruction and Data spaces.
> While true, most such machines do not *insist* on enforcing the distinction,
> or provide mechanisms around it where appropriate. Thus is is possible to
> build three or four types of executables on a given system:
>
>	[examples deleted]

Jeez, what a turkey!  Here I was enjoying all the flames I was getting from
people telling me what a fool I was thinking that my examples might work on
machines with separate I and D spaces, and you had to go post descriptions
of how it might be implemented.  Now I'm going to have to find some other,
much less entertaining stuff to read.  

At least I have a few good SF books to turn to.

BTW, do you recall back in the early days of the obscure-C contest, there
was a cute entry that started:
	short main[] = {
followed by a jumbled list of numbers in various formats?  It was a program
that ran on PDP-11s and VAXen and did something reasonably silly.  Anyhow,
I tried it out on a PDP-11/75 that I had handy.  The machine definitely had
separate I and D spaces, and the program quite definitely worked.  I didn't
tell the compiler anything special, and I doubt the linker recognized that
_main was special and belonged in I space.  But neither the compiler nor the
linker was fazed by having main as a data array.

So far, I haven't heard from anyone that has claimed to try either of my
posted examples and found them not to work.  I was really hoping I'd get
responses from people telling me where they failed and how.  Well, maybe
if I wait long enough, I'll learn something.  [I do know personally of one
commercial system where breakpoints don't work because I space is unwritable;
I won't let on which one, in hopes I'll learn of more.]

On to the SF books...

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

wolfgang@mgm.mit.edu (Wolfgang Rupprecht) (01/29/88)

In article <207@sdti.UUCP> mjy@sdti.UUCP (0000-Michael J. Young) writes:
>Actually, setting breakpoints is no problem even with the other three types.
>That's what the ptrace(2) system call is for.

If you have a shared-text type of executable, you can't guarentee
ptrace-ability. If someone else is executing the same text, the system
is forced to deny you write-permission with the error 'text busy'.
(Otherwise the other process would also get hit with the breakpoints.)
The other process can still be corrupted however, if it is started
*after* you insert the breakpoints. Now it gets amusing, since you
can't *remove* the breakpoints once the other process is started.
--
Wolfgang Rupprecht	ARPA:  wolfgang@mgm.mit.edu (IP 18.82.0.114)
Freelance Consultant	UUCP:  mit-eddie!mgm.mit.edu!wolfgang
Boston, Ma.		VOICE: Hey_Wolfgang!_(617)_267-4365

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/30/88)

In article <749@umbc3.UMD.EDU> alex@umbc3.UMD.EDU (Alex S. Crain) writes:
>And if you can do it on your hardware, the method described will work.

But it may not work next year, even on the same hardware but especially
if you're porting to another system.  If these are valid concerns, then
he should find another way to accomplish what he's after.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/30/88)

In article <459@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>>  From: Doug Gwyn  <gwyn@brl-smoke.arpa>
>>  > This cannot possibly work on an architecture that enforces the
>>  > distinction between Instruction and Data spaces.
>I tried it out on a PDP-11/75 that I had handy.  The machine definitely had
>separate I and D spaces, and the program quite definitely worked.  I didn't
>tell the compiler anything special, and I doubt the linker recognized that
>_main was special and belonged in I space.  But neither the compiler nor the
>linker was fazed by having main as a data array.

You don't listen very well, do you?  Just because the underlying hardware
CAN enforce the distinction between I&D space doesn't mean that it always
DOES so.  In fact, the usual UNIX C implementation for a PDP-11 defaults
to a single shared address space, and only programs that need more space
(such as the f77 compiler) request split I&D spaces by specifying the
cc -i option when linking.  Try running these example programs with I&D
separation enforced, AS I SPECIFIED, and see what happens.

As someone (rbj?) said, the PDP-11 isn't the best example, due to its being
possible to set it up to blur the I&D distinction by default.  (Some cheaper
models couldn't be set up to enforce the distinction!)  I used the PDP-11 as
an example because it seemed the machine you were most likely to have access
to.  I have seen segment-based architectures (Burroughs B5500 comes to mind)
where the default behavior is to enforce the distinction, and I would be
very surprised if IBM's System/38 or the H-P 3000 didn't also do so.

The way to understand an issue is not to resort to blind experiments.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/30/88)

In article <246@vcvax1.UUCP> naren@vcvax1.UUCP (naren) writes:
>	Now, if you REALLY want to do this, you could write a new system call 
>like mktext(vaddr, length)...

In case it isn't obvious to everyone, the reason why this can be done is that
the operating system kernel has special privileges and can therefore shuffle
around virtual->physical address mappings and associated attributes, but an
ordinary user-mode process cannot do this itself.  That's why IPC via shared
memory requires kernel support, for example.

mjy@sdti.UUCP (Michael J. Young) (02/02/88)

In article <246@vcvax1.UUCP> naren@vcvax1.UUCP (naren) writes:
>	Doug Gwyn is right about architectures that enforce distinctions 
>between code and data spaces (ex: 80386). On UNIX/386, an sbrk() allocates 
>space in the Data Segment of the process. Type casting this pointer and 
>issuing a 'call' to this address will result in a protection exception. 

This happens on many other 80x86 ports as well.  Microport (the only 286
port I'm familiar with) enforces separation between text and data regions
as well.  Unfortunately, they don't seem to provide ld(1) options to
override the protection.  I received an email reply from T. Andrews, who
said that Xenix/286 provides a service and an ld(1) option to support this,
but I have no personal experience with it.

>	Now, if you REALLY want to do this, you could write a new system call 
>like mktext(vaddr, length) where vaddr is the start of the data space 
>you would like to fill in with code.  mktext() would just create a new code 
>segment descriptor in the LDT of your task that includes the desired 
>section of data space and then you'd be all set. 
>	I am of course leaving out a lot of the nitty-gritty details of 
>how this feature would interact with other things like shared texts, etc.

... and where to get the money to buy a source license! :-)

Seriously, though.  It seems to me that any implementation of Unix that
enforces separation should also provide a means around it, preferably in a
portable manner.  Does POSIX address this issue?

On systems that enforce separation of text and data, with no means of
"turning it off", it seems you are forced into using exec(2).  Can you
imagine trying to implement an incremental compiler where each new
function you create has to have its own a.out and be its own process?
-- 
Mike Young - Software Development Technologies, Inc., Sudbury MA 01776
UUCP     : {decvax,harvard,linus,mit-eddie}!necntc!necis!mrst!sdti!mjy
Internet : mjy%sdti.uucp@harvard.harvard.edu      Tel: +1 617 443 5779

jc@minya.UUCP (John Chambers) (02/04/88)

In article <7209@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> In article <246@vcvax1.UUCP> naren@vcvax1.UUCP (naren) writes:
> >	Now, if you REALLY want to do this, you could write a new system call 
> >like mktext(vaddr, length)...
> 
> In case it isn't obvious to everyone, the reason why this can be done is that
> the operating system kernel has special privileges and can therefore shuffle
> around virtual->physical address mappings and associated attributes, but an
> ordinary user-mode process cannot do this itself.  That's why IPC via shared
> memory requires kernel support, for example.

Uh, no it doesn't.  (Well, maybe it does on a PDP-11 :-).  I've personally
worked on one system where we had shared memory with no support whatsoever
from the kernel.  The poor li'l kernel didn't even suspect we were doing it.

Guess how we did it?  Give up?  (OK, turkey, tell 'em!)

The processor in question came up with the MM registers mapped in the obvious
way, so that real and virtual memory were identical.  The kernel was kept in
ignorance of the last few MM registers, which by some strange coincidence just
happened to point to real memory 'way up in the address space.  (Did I say that
this was a machine with 32-bit addresses?  Well, it was.)  This chunk of real
memory was quite distant from the main memory, and when the kernel did its scan
for usable memory, it didn't find the high chunk.  Like most Unix kernels, it
only believed in a single contiguous piece of real memory.

The effect was to map this small chunk of memory into the virtual address space
of every process, without the knowledge of the kernel.  Most processes also
didn't realize it was there.  Our run-time library did, and did some very
interesting things with it, and very fast.

OK, it's a kludge.  But then, memory-mapped anything is exactly the same
kind of kludge.  Just classify it as a memory-mapped device (as are quite
a lot of network interface boards these days), and it all makes sense.  
For instance, go talk to CMC about their ethernet boards.

This example had the advantage that we didn't have to try to figure out 
how the Sys/V shm package works.  I mean, it was pure elegance in comparison
with how AT&T would have liked us to do it. (:-)

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)