[comp.unix] FORK OFF --UNIX Dataflow programming

l-aron@obelix.liu.se (Lars Aronsson) (05/29/87)
      Fork Off -- A UNIX Tool for Parallel Processing


                       Lars Aronsson


                          ABSTRACT

          UNIX supports parallel processing and  inter-
     process  communication  through  system calls like
     fork  and  pipe.   The  ordinary  shells  (command
     interpreters)  and,  hence, most existing programs
     restricts   parallel   processing   to    next-to-
     uniprogramming  level  concepts,  supporting  one-
     dimensional "pipelines". Implementing more complex
     dataflow structures using the system calls is dif-
     ficult, and a solution with "temporary  files"  is
     often chosen.

          This paper introduces a programming  language
     for  these,  more  complex, dataflow applications.
     Fork Off is an interpreter  for  this  programming
     language.



Low Level UNIX

     In UNIX, each running processes uses a  certain  number
of  file  descriptors, used for input and output operations.
These file descriptors may be connected to files or pipes. A
pipe  is a channel for interprocess communication. Each pipe
has got two ends, one of which is allowed for  write  opera-
tions,  the  other for read. The ends should be connected to
different processes.

     The UNIX operating system provides the  user  with  the
following   system  calls  to  manipulate  the  input/output
environment.

fork    Create a copy of the running process and continue
        to execute both copies.
exec    Exchange the running process with one running a
        certain program.
pipe    Create a pipe. Both ends are connected to different
        file descriptors of the calling process.
dup     Move the file referencies or pipe ends from one file
        descriptors to another.
open    Connect a file to a file descriptor.
close   Release a certain file descriptor.

     Using these system calls, virtually  any  configuration
of parallel, intercommunicating processes can be set up.

Traditional Shells

     The traditional shells (e.g. the Bourne  shell),  allow
for  so  called  pipelines.  A pipeline is a one-dimensional
string of  processes,  all  of  which  uses  only  two  file
descriptors, one for input (called stdin, by convention hav-
ing the number zero) and  one  for  output  (stdout,  number
one).   This  resembles the old FORTRAN tradition of letting
unit five default to the card reader and  unit  six  to  the
line  printer.   Sometimes,  also special file descriptor is
used for error messages (stderr, number two). Error messages
from  all  process, then are mixed and printed on the termi-
nal.

     The first process in the string  reads  its  input  (if
any)  from  the  terminal or from a file.  The output of the
first process is fed as input to the second process  and  so
on.  The output of the last process is printed on the termi-
nal, printed on a line printer or  saved  into  a  file.  In
Bourne  shell,  the syntax to command this kind of behaviour
is:

$ proc1 | proc2 | proc3 | ... | procN


     This next-to-uniprogramming level of parallel  process-
ing  (it  can be simulated on any uniprogrammed system using
only one temporary file) has been  found  very  useful,  and
resulted  in  a lot of programs called "filters", which only
read input and write output.

     The algorithm (no AT&T  code  included!)  for  creating
such a pipeline is rather simple:

{   for (J=N;J>1;J--) {
        pipe;
        if (fork) {
            dup; exec(procJ);
        }
        else
            close; /* pipe ends that won't be used */
    }
    dup; exec(proc1);
}


Problems Arise

     Although filters and pipelines solve lots of  problems,
they  don't  solve  all.  The  common solution when programs
require more than one input  and/or  more  than  one  output
still  is to use temporary files. Examples of such tasks are
merge sorting, concatenation, and using output of  one  pro-
cess  as  input for different operations.  Examples of solu-
tions are:

$ sort -m file1 file2   Merge two sorted files.
$ cat file1 file2       Concatenate two files.
$ tee file              Copy input to output and
                        also save it into the file.


     The reasons for these uniprogramming-fasioned solutions
lie  not  in the UNIX kernel, but in the shells and the con-
vention of having one "stdin" and one "stdout".

A Non-Fork Off Solution

     The following article has been posted to USEnet.

Subject:        2dpipe - implement 2 dimensional sh(1) pipes
Newsgroups:     net.sources
Date:           23 Oct 86 05:53:23 MET
Message-ID:     <188@metro.oz>
Reply-To:       rossc@metro.ucc.su.oz (Ross Cartlidge)
Distribution:   world
Organization:   Uni of Sydney Computing Centre, Sydney AUSTRALIA


     This is, no doubt, a solution to the problem,  provided
your  UNIX  systems  supports  "named  pipes".   The article
inculdes three almost identical Bourne shell scripts,  named
"0",  "1",  and "2", each of which accepts a "pipeline call"
as an argument, then forks and executes the pipeline, creats
a  pipe  to  the file descriptor of that pipeline having the
same number as is the name of the shell script and, at last,
prints the name of the other end of the created pipe.

     This calls for an example.  Mr. Cartlidge suggests this
one,  used  to  print  the common lines between two unsorted
files, file1 and file2:

$ comm -12 `1 sort file1` `1 sort file2`

This is equivalent to:

$ sort file1 > temp1
$ sort file2 > temp2
$ comm -12 temp1 temp2


     Or, why not "write out the list of all group  name  and
gid  pairs which have a corresponding user name and uid pair
sorted by name  and  then  by  number  into  name_order  and
number_order respectively"?

$ comm -12 `1 'cut -f1,3 -d: /etc/passwd | sort'`
  `1 'cut -f1,3 -d: /etc/group | sort'` |
  tee `0 'sort +0 -t: > name_order'` |
  sort +1n -t: > number_order


     As the reader can see, this  eliminates  the  need  for
temporary files, but results in a rather hard-to-read code.

The Fork Off Approach

     In order to solve the problem with more style,  a  pro-
gram  named  Fork Off has been constructed. The synopsis for
calling Fork Off is:

$ fkoff script_file


     The executable program is called  fkoff  for  means  of
simplicity.  A script, describing the dataflow network, must
be supplied using a file. This is not contradictory  to  the
dataflow  concept,  where  files  are avoided.  The Fork Off
call may be used as any UNIX program call  (using  the  exec
system  call). It might be part of a pipeline or it might be
referenced in a fkoff script file (this  does  not  make  it
recursive).

Dataflow Programming with Fork Off

     In traditional, sequential programming, the main  model
for  describing  programs  is  the control flow diagram. The
abstract notion of "control" starts at the  top,  circulates
through the program, and ends at the bottom of the diagram.

     In dataflow programming, a data flow  diagram  is  used
instead.   Here, all "boxes" represent processes, which exe-
cute in parallel.  Arrows between the  boxes  describe  data
channels  or  pipes.   Surruounding  the entire diagram is a
box, which defines the "system".  Arrows between  boxes  and
the surrounding box describe external communication.

     In the Fork Off language,  each  process  has  its  own
name.   The  surrounding  box  is  represented by a virtual,
predefined process called "external".  Each arrow (pipe) has
a  number  at  each  end.  This  is  the  number of the file
descriptor used to write or read the pipe.

The Fork Off Language

     In order to create a certain "network" of processes and
pipes,  a  language  to  describe  the  network  has  to  be
designed.  Of course, a graphic language would be desirable,
but  this would take too much effort to parse and be "incom-
patible" with the rest of the character-based  UNIX  system.
The  language used here contains one line of "code" for each
process definition and one for each communication link (pipe
or file).

     The full syntax of the Fork Off language  is  described
below.   No  LEX or YACC is needed to parse it.  "n" and "m"
are decimal numbers ranging from zero to the constant _NFILE
(normally  approx.  twenty).  Efforts have been made to make
the syntax resemble the Bourne shell  input/output  redirec-
tion syntax.

# ...comment...         Line ignored.
name ::= process_call   Define and name a process.
name1 n | m name2       Create a pipe from file descriptor n
                        of process name1 to file descriptor m
                        of process name2.
name1 n >& -            File descriptor n of process name1
                        is closed.
name1 n <& -            Same.
name1 n >& m name2      Writes to file descriptor n of
                        process name1 mix with writes to
                        file descriptor m of process name2.
name1 n <& m name2      Same for read operations.
name1 n > file          Save output from file descriptor n
                        of process name1 to the file.
name1 n >> file         Same, but append to file.
name1 n < file          Read data for file descriptor n
                        of process name1 from the file.


Examples

     First, we shall  consider  a  one-dimensional  pipeline
implemented with Fork Off.

The Bourne shell command line:

$ filter1 -opt | filter2 -opt

is equivalent to:

$ fkoff myscript

with myscript looking like:

# simple linear script
f1 ::= filter1 -opt
f2 ::= filter2 -opt
external 0 | 0 f1
f1 1 | 0 f2
f2 1 | 1 external


     As shown in  the  second  example,  Fork  Off  dataflow
networks  may  be  constructed using other similar networks.
Suppose we want to use the filter defined in our first exam-
ple  to  filter  the error messages from a preprocessor to a
compiler before they are saved into a file.

# Filter error messages from preprocessor.
pp ::= /bin/preprocessor -flags
cc ::= /bin/compiler -flags
ff ::= fkoff myscript
external 0 | 0 pp
pp 1 | 0 cc
pp 2 | 0 ff
ff 1 > file.error


     The third example shows the concept of a dataflow loop.
Imagine  we have two chess-playing programs called chessmas-
ter and masterchess.  Chessmaster and masterchess are imple-
mented  as  a  filters,  reading  moves  from  "stdin" (file
descriptor  zero)  and  writing  moves  to  "stdout"   (file
descriptor one).  Moves are to be machine-readable.  This is
not user-friendly.  Chessmaster and masterchess  each  takes
one  argument;  a  flag  telling  it whether to use black or
white pieces. White starts.

     The Fork Off script is shown below. It uses no external
communication.  All moves are stored into two files.

# Fork Off script to compare chess programs.
# Process definitions:
black ::= chessmaster -black
blackdog ::= tee blackfile
white ::= masterchess -white
whitedog ::= tee whitefile
# Description of interprocess communication:
black 1 | 0 blackdog
blackdog 1 | 0 white
white 1 | 0 whitedog
whitedog 1 | 0 black


     These three examples uses standard filters.  To  really
explore  the  power  of Fork Off, several new programs using
multiple inputs and outputs would have to be written.

Current Project Status

     Fork Off has developed step by  step  since  the  early
fall  of  1986.   It  is  programmed in the C language on an
PDP-11/70 running 2.9BSD UNIX.  Early kludges (incorporating
only processes and pipes) have worked.  Creeping featuranism
(including the syntax  described  above  and  reforming  the
internal representation of the dataflow network) has delayed
the project, however,  and  there  is  not  yet  a  complete
version  working.  When  a  complete  version will work, the
intension is to post it to USEnet. Until  then,  the  author
would greatly appreciate any comments or suggestions on this
paper. The interrested reader may also acquire a copy of the
existing, non-working kludge called Fork Off.

Personal Information

     The author of this paper and  the  programmer  of  Fork
Off,  Lars Aronsson, is a twenty-one year old, undergraduate
student of Computer Science  at  the  Linkoping  University,
Sweden.   Until  mid June 1987 he can be reached through any
of the addresses below.  Between mid June 1987 and mid April
1988,  no communications but through the last (Orebro Snail-
Mail) address can be  guaranteed.   During  that  time,  the
author  will  be  trained to defend his King and his Father-
land.

Internet, UUCP, Arpa:   l-aron@obelix.liu.se

Snail-Mail:             Lars Aronsson
                        Rydsvagen 256 A:10
                        S-582 48  Linkoping, Sweden

Snail-Mail:             Lars Aronsson,
                        Gustavsgatan 26
                        S-703 55  Orebro, Sweden