l-aron@obelix.liu.se (Lars Aronsson) (05/29/87)
Fork Off -- A UNIX Tool for Parallel Processing Lars Aronsson ABSTRACT UNIX supports parallel processing and inter- process communication through system calls like fork and pipe. The ordinary shells (command interpreters) and, hence, most existing programs restricts parallel processing to next-to- uniprogramming level concepts, supporting one- dimensional "pipelines". Implementing more complex dataflow structures using the system calls is dif- ficult, and a solution with "temporary files" is often chosen. This paper introduces a programming language for these, more complex, dataflow applications. Fork Off is an interpreter for this programming language. Low Level UNIX In UNIX, each running processes uses a certain number of file descriptors, used for input and output operations. These file descriptors may be connected to files or pipes. A pipe is a channel for interprocess communication. Each pipe has got two ends, one of which is allowed for write opera- tions, the other for read. The ends should be connected to different processes. The UNIX operating system provides the user with the following system calls to manipulate the input/output environment. fork Create a copy of the running process and continue to execute both copies. exec Exchange the running process with one running a certain program. pipe Create a pipe. Both ends are connected to different file descriptors of the calling process. dup Move the file referencies or pipe ends from one file descriptors to another. open Connect a file to a file descriptor. close Release a certain file descriptor. Using these system calls, virtually any configuration of parallel, intercommunicating processes can be set up. Traditional Shells The traditional shells (e.g. the Bourne shell), allow for so called pipelines. A pipeline is a one-dimensional string of processes, all of which uses only two file descriptors, one for input (called stdin, by convention hav- ing the number zero) and one for output (stdout, number one). This resembles the old FORTRAN tradition of letting unit five default to the card reader and unit six to the line printer. Sometimes, also special file descriptor is used for error messages (stderr, number two). Error messages from all process, then are mixed and printed on the termi- nal. The first process in the string reads its input (if any) from the terminal or from a file. The output of the first process is fed as input to the second process and so on. The output of the last process is printed on the termi- nal, printed on a line printer or saved into a file. In Bourne shell, the syntax to command this kind of behaviour is: $ proc1 | proc2 | proc3 | ... | procN This next-to-uniprogramming level of parallel process- ing (it can be simulated on any uniprogrammed system using only one temporary file) has been found very useful, and resulted in a lot of programs called "filters", which only read input and write output. The algorithm (no AT&T code included!) for creating such a pipeline is rather simple: { for (J=N;J>1;J--) { pipe; if (fork) { dup; exec(procJ); } else close; /* pipe ends that won't be used */ } dup; exec(proc1); } Problems Arise Although filters and pipelines solve lots of problems, they don't solve all. The common solution when programs require more than one input and/or more than one output still is to use temporary files. Examples of such tasks are merge sorting, concatenation, and using output of one pro- cess as input for different operations. Examples of solu- tions are: $ sort -m file1 file2 Merge two sorted files. $ cat file1 file2 Concatenate two files. $ tee file Copy input to output and also save it into the file. The reasons for these uniprogramming-fasioned solutions lie not in the UNIX kernel, but in the shells and the con- vention of having one "stdin" and one "stdout". A Non-Fork Off Solution The following article has been posted to USEnet. Subject: 2dpipe - implement 2 dimensional sh(1) pipes Newsgroups: net.sources Date: 23 Oct 86 05:53:23 MET Message-ID: <188@metro.oz> Reply-To: rossc@metro.ucc.su.oz (Ross Cartlidge) Distribution: world Organization: Uni of Sydney Computing Centre, Sydney AUSTRALIA This is, no doubt, a solution to the problem, provided your UNIX systems supports "named pipes". The article inculdes three almost identical Bourne shell scripts, named "0", "1", and "2", each of which accepts a "pipeline call" as an argument, then forks and executes the pipeline, creats a pipe to the file descriptor of that pipeline having the same number as is the name of the shell script and, at last, prints the name of the other end of the created pipe. This calls for an example. Mr. Cartlidge suggests this one, used to print the common lines between two unsorted files, file1 and file2: $ comm -12 `1 sort file1` `1 sort file2` This is equivalent to: $ sort file1 > temp1 $ sort file2 > temp2 $ comm -12 temp1 temp2 Or, why not "write out the list of all group name and gid pairs which have a corresponding user name and uid pair sorted by name and then by number into name_order and number_order respectively"? $ comm -12 `1 'cut -f1,3 -d: /etc/passwd | sort'` `1 'cut -f1,3 -d: /etc/group | sort'` | tee `0 'sort +0 -t: > name_order'` | sort +1n -t: > number_order As the reader can see, this eliminates the need for temporary files, but results in a rather hard-to-read code. The Fork Off Approach In order to solve the problem with more style, a pro- gram named Fork Off has been constructed. The synopsis for calling Fork Off is: $ fkoff script_file The executable program is called fkoff for means of simplicity. A script, describing the dataflow network, must be supplied using a file. This is not contradictory to the dataflow concept, where files are avoided. The Fork Off call may be used as any UNIX program call (using the exec system call). It might be part of a pipeline or it might be referenced in a fkoff script file (this does not make it recursive). Dataflow Programming with Fork Off In traditional, sequential programming, the main model for describing programs is the control flow diagram. The abstract notion of "control" starts at the top, circulates through the program, and ends at the bottom of the diagram. In dataflow programming, a data flow diagram is used instead. Here, all "boxes" represent processes, which exe- cute in parallel. Arrows between the boxes describe data channels or pipes. Surruounding the entire diagram is a box, which defines the "system". Arrows between boxes and the surrounding box describe external communication. In the Fork Off language, each process has its own name. The surrounding box is represented by a virtual, predefined process called "external". Each arrow (pipe) has a number at each end. This is the number of the file descriptor used to write or read the pipe. The Fork Off Language In order to create a certain "network" of processes and pipes, a language to describe the network has to be designed. Of course, a graphic language would be desirable, but this would take too much effort to parse and be "incom- patible" with the rest of the character-based UNIX system. The language used here contains one line of "code" for each process definition and one for each communication link (pipe or file). The full syntax of the Fork Off language is described below. No LEX or YACC is needed to parse it. "n" and "m" are decimal numbers ranging from zero to the constant _NFILE (normally approx. twenty). Efforts have been made to make the syntax resemble the Bourne shell input/output redirec- tion syntax. # ...comment... Line ignored. name ::= process_call Define and name a process. name1 n | m name2 Create a pipe from file descriptor n of process name1 to file descriptor m of process name2. name1 n >& - File descriptor n of process name1 is closed. name1 n <& - Same. name1 n >& m name2 Writes to file descriptor n of process name1 mix with writes to file descriptor m of process name2. name1 n <& m name2 Same for read operations. name1 n > file Save output from file descriptor n of process name1 to the file. name1 n >> file Same, but append to file. name1 n < file Read data for file descriptor n of process name1 from the file. Examples First, we shall consider a one-dimensional pipeline implemented with Fork Off. The Bourne shell command line: $ filter1 -opt | filter2 -opt is equivalent to: $ fkoff myscript with myscript looking like: # simple linear script f1 ::= filter1 -opt f2 ::= filter2 -opt external 0 | 0 f1 f1 1 | 0 f2 f2 1 | 1 external As shown in the second example, Fork Off dataflow networks may be constructed using other similar networks. Suppose we want to use the filter defined in our first exam- ple to filter the error messages from a preprocessor to a compiler before they are saved into a file. # Filter error messages from preprocessor. pp ::= /bin/preprocessor -flags cc ::= /bin/compiler -flags ff ::= fkoff myscript external 0 | 0 pp pp 1 | 0 cc pp 2 | 0 ff ff 1 > file.error The third example shows the concept of a dataflow loop. Imagine we have two chess-playing programs called chessmas- ter and masterchess. Chessmaster and masterchess are imple- mented as a filters, reading moves from "stdin" (file descriptor zero) and writing moves to "stdout" (file descriptor one). Moves are to be machine-readable. This is not user-friendly. Chessmaster and masterchess each takes one argument; a flag telling it whether to use black or white pieces. White starts. The Fork Off script is shown below. It uses no external communication. All moves are stored into two files. # Fork Off script to compare chess programs. # Process definitions: black ::= chessmaster -black blackdog ::= tee blackfile white ::= masterchess -white whitedog ::= tee whitefile # Description of interprocess communication: black 1 | 0 blackdog blackdog 1 | 0 white white 1 | 0 whitedog whitedog 1 | 0 black These three examples uses standard filters. To really explore the power of Fork Off, several new programs using multiple inputs and outputs would have to be written. Current Project Status Fork Off has developed step by step since the early fall of 1986. It is programmed in the C language on an PDP-11/70 running 2.9BSD UNIX. Early kludges (incorporating only processes and pipes) have worked. Creeping featuranism (including the syntax described above and reforming the internal representation of the dataflow network) has delayed the project, however, and there is not yet a complete version working. When a complete version will work, the intension is to post it to USEnet. Until then, the author would greatly appreciate any comments or suggestions on this paper. The interrested reader may also acquire a copy of the existing, non-working kludge called Fork Off. Personal Information The author of this paper and the programmer of Fork Off, Lars Aronsson, is a twenty-one year old, undergraduate student of Computer Science at the Linkoping University, Sweden. Until mid June 1987 he can be reached through any of the addresses below. Between mid June 1987 and mid April 1988, no communications but through the last (Orebro Snail- Mail) address can be guaranteed. During that time, the author will be trained to defend his King and his Father- land. Internet, UUCP, Arpa: l-aron@obelix.liu.se Snail-Mail: Lars Aronsson Rydsvagen 256 A:10 S-582 48 Linkoping, Sweden Snail-Mail: Lars Aronsson, Gustavsgatan 26 S-703 55 Orebro, Sweden