l-aron@obelix.liu.se (Lars Aronsson) (05/29/87)
Fork Off -- A UNIX Tool for Parallel Processing
Lars Aronsson
ABSTRACT
UNIX supports parallel processing and inter-
process communication through system calls like
fork and pipe. The ordinary shells (command
interpreters) and, hence, most existing programs
restricts parallel processing to next-to-
uniprogramming level concepts, supporting one-
dimensional "pipelines". Implementing more complex
dataflow structures using the system calls is dif-
ficult, and a solution with "temporary files" is
often chosen.
This paper introduces a programming language
for these, more complex, dataflow applications.
Fork Off is an interpreter for this programming
language.
Low Level UNIX
In UNIX, each running processes uses a certain number
of file descriptors, used for input and output operations.
These file descriptors may be connected to files or pipes. A
pipe is a channel for interprocess communication. Each pipe
has got two ends, one of which is allowed for write opera-
tions, the other for read. The ends should be connected to
different processes.
The UNIX operating system provides the user with the
following system calls to manipulate the input/output
environment.
fork Create a copy of the running process and continue
to execute both copies.
exec Exchange the running process with one running a
certain program.
pipe Create a pipe. Both ends are connected to different
file descriptors of the calling process.
dup Move the file referencies or pipe ends from one file
descriptors to another.
open Connect a file to a file descriptor.
close Release a certain file descriptor.
Using these system calls, virtually any configuration
of parallel, intercommunicating processes can be set up.
Traditional Shells
The traditional shells (e.g. the Bourne shell), allow
for so called pipelines. A pipeline is a one-dimensional
string of processes, all of which uses only two file
descriptors, one for input (called stdin, by convention hav-
ing the number zero) and one for output (stdout, number
one). This resembles the old FORTRAN tradition of letting
unit five default to the card reader and unit six to the
line printer. Sometimes, also special file descriptor is
used for error messages (stderr, number two). Error messages
from all process, then are mixed and printed on the termi-
nal.
The first process in the string reads its input (if
any) from the terminal or from a file. The output of the
first process is fed as input to the second process and so
on. The output of the last process is printed on the termi-
nal, printed on a line printer or saved into a file. In
Bourne shell, the syntax to command this kind of behaviour
is:
$ proc1 | proc2 | proc3 | ... | procN
This next-to-uniprogramming level of parallel process-
ing (it can be simulated on any uniprogrammed system using
only one temporary file) has been found very useful, and
resulted in a lot of programs called "filters", which only
read input and write output.
The algorithm (no AT&T code included!) for creating
such a pipeline is rather simple:
{ for (J=N;J>1;J--) {
pipe;
if (fork) {
dup; exec(procJ);
}
else
close; /* pipe ends that won't be used */
}
dup; exec(proc1);
}
Problems Arise
Although filters and pipelines solve lots of problems,
they don't solve all. The common solution when programs
require more than one input and/or more than one output
still is to use temporary files. Examples of such tasks are
merge sorting, concatenation, and using output of one pro-
cess as input for different operations. Examples of solu-
tions are:
$ sort -m file1 file2 Merge two sorted files.
$ cat file1 file2 Concatenate two files.
$ tee file Copy input to output and
also save it into the file.
The reasons for these uniprogramming-fasioned solutions
lie not in the UNIX kernel, but in the shells and the con-
vention of having one "stdin" and one "stdout".
A Non-Fork Off Solution
The following article has been posted to USEnet.
Subject: 2dpipe - implement 2 dimensional sh(1) pipes
Newsgroups: net.sources
Date: 23 Oct 86 05:53:23 MET
Message-ID: <188@metro.oz>
Reply-To: rossc@metro.ucc.su.oz (Ross Cartlidge)
Distribution: world
Organization: Uni of Sydney Computing Centre, Sydney AUSTRALIA
This is, no doubt, a solution to the problem, provided
your UNIX systems supports "named pipes". The article
inculdes three almost identical Bourne shell scripts, named
"0", "1", and "2", each of which accepts a "pipeline call"
as an argument, then forks and executes the pipeline, creats
a pipe to the file descriptor of that pipeline having the
same number as is the name of the shell script and, at last,
prints the name of the other end of the created pipe.
This calls for an example. Mr. Cartlidge suggests this
one, used to print the common lines between two unsorted
files, file1 and file2:
$ comm -12 `1 sort file1` `1 sort file2`
This is equivalent to:
$ sort file1 > temp1
$ sort file2 > temp2
$ comm -12 temp1 temp2
Or, why not "write out the list of all group name and
gid pairs which have a corresponding user name and uid pair
sorted by name and then by number into name_order and
number_order respectively"?
$ comm -12 `1 'cut -f1,3 -d: /etc/passwd | sort'`
`1 'cut -f1,3 -d: /etc/group | sort'` |
tee `0 'sort +0 -t: > name_order'` |
sort +1n -t: > number_order
As the reader can see, this eliminates the need for
temporary files, but results in a rather hard-to-read code.
The Fork Off Approach
In order to solve the problem with more style, a pro-
gram named Fork Off has been constructed. The synopsis for
calling Fork Off is:
$ fkoff script_file
The executable program is called fkoff for means of
simplicity. A script, describing the dataflow network, must
be supplied using a file. This is not contradictory to the
dataflow concept, where files are avoided. The Fork Off
call may be used as any UNIX program call (using the exec
system call). It might be part of a pipeline or it might be
referenced in a fkoff script file (this does not make it
recursive).
Dataflow Programming with Fork Off
In traditional, sequential programming, the main model
for describing programs is the control flow diagram. The
abstract notion of "control" starts at the top, circulates
through the program, and ends at the bottom of the diagram.
In dataflow programming, a data flow diagram is used
instead. Here, all "boxes" represent processes, which exe-
cute in parallel. Arrows between the boxes describe data
channels or pipes. Surruounding the entire diagram is a
box, which defines the "system". Arrows between boxes and
the surrounding box describe external communication.
In the Fork Off language, each process has its own
name. The surrounding box is represented by a virtual,
predefined process called "external". Each arrow (pipe) has
a number at each end. This is the number of the file
descriptor used to write or read the pipe.
The Fork Off Language
In order to create a certain "network" of processes and
pipes, a language to describe the network has to be
designed. Of course, a graphic language would be desirable,
but this would take too much effort to parse and be "incom-
patible" with the rest of the character-based UNIX system.
The language used here contains one line of "code" for each
process definition and one for each communication link (pipe
or file).
The full syntax of the Fork Off language is described
below. No LEX or YACC is needed to parse it. "n" and "m"
are decimal numbers ranging from zero to the constant _NFILE
(normally approx. twenty). Efforts have been made to make
the syntax resemble the Bourne shell input/output redirec-
tion syntax.
# ...comment... Line ignored.
name ::= process_call Define and name a process.
name1 n | m name2 Create a pipe from file descriptor n
of process name1 to file descriptor m
of process name2.
name1 n >& - File descriptor n of process name1
is closed.
name1 n <& - Same.
name1 n >& m name2 Writes to file descriptor n of
process name1 mix with writes to
file descriptor m of process name2.
name1 n <& m name2 Same for read operations.
name1 n > file Save output from file descriptor n
of process name1 to the file.
name1 n >> file Same, but append to file.
name1 n < file Read data for file descriptor n
of process name1 from the file.
Examples
First, we shall consider a one-dimensional pipeline
implemented with Fork Off.
The Bourne shell command line:
$ filter1 -opt | filter2 -opt
is equivalent to:
$ fkoff myscript
with myscript looking like:
# simple linear script
f1 ::= filter1 -opt
f2 ::= filter2 -opt
external 0 | 0 f1
f1 1 | 0 f2
f2 1 | 1 external
As shown in the second example, Fork Off dataflow
networks may be constructed using other similar networks.
Suppose we want to use the filter defined in our first exam-
ple to filter the error messages from a preprocessor to a
compiler before they are saved into a file.
# Filter error messages from preprocessor.
pp ::= /bin/preprocessor -flags
cc ::= /bin/compiler -flags
ff ::= fkoff myscript
external 0 | 0 pp
pp 1 | 0 cc
pp 2 | 0 ff
ff 1 > file.error
The third example shows the concept of a dataflow loop.
Imagine we have two chess-playing programs called chessmas-
ter and masterchess. Chessmaster and masterchess are imple-
mented as a filters, reading moves from "stdin" (file
descriptor zero) and writing moves to "stdout" (file
descriptor one). Moves are to be machine-readable. This is
not user-friendly. Chessmaster and masterchess each takes
one argument; a flag telling it whether to use black or
white pieces. White starts.
The Fork Off script is shown below. It uses no external
communication. All moves are stored into two files.
# Fork Off script to compare chess programs.
# Process definitions:
black ::= chessmaster -black
blackdog ::= tee blackfile
white ::= masterchess -white
whitedog ::= tee whitefile
# Description of interprocess communication:
black 1 | 0 blackdog
blackdog 1 | 0 white
white 1 | 0 whitedog
whitedog 1 | 0 black
These three examples uses standard filters. To really
explore the power of Fork Off, several new programs using
multiple inputs and outputs would have to be written.
Current Project Status
Fork Off has developed step by step since the early
fall of 1986. It is programmed in the C language on an
PDP-11/70 running 2.9BSD UNIX. Early kludges (incorporating
only processes and pipes) have worked. Creeping featuranism
(including the syntax described above and reforming the
internal representation of the dataflow network) has delayed
the project, however, and there is not yet a complete
version working. When a complete version will work, the
intension is to post it to USEnet. Until then, the author
would greatly appreciate any comments or suggestions on this
paper. The interrested reader may also acquire a copy of the
existing, non-working kludge called Fork Off.
Personal Information
The author of this paper and the programmer of Fork
Off, Lars Aronsson, is a twenty-one year old, undergraduate
student of Computer Science at the Linkoping University,
Sweden. Until mid June 1987 he can be reached through any
of the addresses below. Between mid June 1987 and mid April
1988, no communications but through the last (Orebro Snail-
Mail) address can be guaranteed. During that time, the
author will be trained to defend his King and his Father-
land.
Internet, UUCP, Arpa: l-aron@obelix.liu.se
Snail-Mail: Lars Aronsson
Rydsvagen 256 A:10
S-582 48 Linkoping, Sweden
Snail-Mail: Lars Aronsson,
Gustavsgatan 26
S-703 55 Orebro, Sweden