renga@beaver.cs.washington.edu (Renganathan Sundararajan) (11/07/88)
*************************************** Long *************************************** Threads, Lightweight Processes. ============================== (A Summary) This is a summary of the e-mail messages that I received on this subject. In the following summary, I do not attempt to attribute the contributions individually. My Thanks to : peter%sugar@uunet.UU.NET (Peter da Silva) marzullo%cs.cornell.edu (Keith Marzullo) sjs%ctt.bellcore.com (Stan Switzer) mudd-j%hickory.cis.ohio-state.edu(John R. Mudd) ariel%lznh@att.att.com (ariel aloni) tim%crackle@amd.com (Tim Olson) gray%Pescadero.stanford.edu (Cary Gray) retrac%rice.edu (John Carter) peter%ficc@uunet.uu.net (Peter da Silva) campbell%redsox.bsw.com (Larry Campbell) mason%pescadero.Stanford.EDU (Tony Mason) irvin%cwi.nl%mcvax%mcvax%uunet.UU.NET (Irvin Shizgal) robbin@MoliEnergy.BC.CA (Robbin.W.Johnson) jesup%cbmvax.cbm.commodore.com (Randell Jesup) johnl@ima.isc.com (John Levine) fouts@lemming.nas.nasa.gov (Marty fouts) schwan%wayward@gatech.edu (Karsten Schwan) ======================================================================= The words 'Process' and 'Task' have been used to mean different things in different operating systems. Adding to the confusion is the use of these terms in languages such as ADA, MODULA-2 that allow concurrent execution of 'processes', 'tasks' etc through explicit or implicit creation. Historically, a process was thought of as the activation of a sequential program. Before virtual memory and caches came into being, a process just had an address space (that included code, heap and stack space) and a set of registers. Hence Context switching involved the saving and restoring of registers(including base address registers), condition flags and OS table state if any. With the introduction of virtual memory and caches, it was necessary to save the MMU state, flush the cache or mark the cache entries as invalid or otherwise deal with them. In other words, since the process switching involved a lot of work these are called 'Heavy Weight Processes'. If a programming language allows the creation and concurrent execution of 'processes', 'tasks' or whatever, is it necessary to have an OS process represent each of these? Obviously, no. These typically share the code space, and the data space and may have private heap, stack spaces. In other words, we have multiple threads of control executing within a single address space. These resemble procedure activations (each acti- vation has its own stack frame and heap space if necessary) and quite unlike the HWP described in the previous paragraph. We now have 'LWPs' or 'Threads'. Judging from the responses I got, threads and light-weight processes denote the same thing. They refer to a single address space with multiple execution contexts. A LWP or a thread consists of PC, registers, flags, all that that needs to be saved to service an interrupt. Threads/LWPs share a virtual address context with other threads, or in the case of some LWP schemes, they execute in the kernel so they just use the kernel's context. Some people use LWPs to refer to things that just run in the kernel and threads to refer to things that run in user mode, but the terminology is far from standardized. Threads/LWPs serve a number of purposes: - Cheap creation, deletion, and context switching. - Speed up inter-process communication, since the address space is shared. Fast, efficient communication among a set of cooperating processes where the processes aren't necessarily `parallel' in nature. - Capture I/O/Computation concurrency that otherwise would require a context switch, often when servicing some sort of incoming network requests. Even the term thread has been used in 2 different ways : - a thread of control within a *** single **** address space and - a ** distributed ** thread of control executing in **different** address spaces. The common feature between these 2 usages is the existence of a stack. In the formaer case, the stack exists in a single address space and RPCs are implemented in terms of messages. In the latter case, the stack may span address spaces (and machines) and the *thread* executes the RPCs. REFERENCES: ========== @Article{mach-overview, Author = "Richard F. Rashid", Title = "Threads of a New System", Journal= "Unix Review", Number = 8, Volume = 4, Month = aug, Year = 1986, Pages = "37--49" } @Article{V-process-groups, Author = "David. R. Cheriton and W. Zwaenepoel", Title = "Distributed Process Groups in the {V} Kernel", Journal= TOCS, Volume = 3, Number = 2, Month = may, Year = 1985, Pages = "77--107" } Abu-Sufah, W., H.E. Husmann, and D.J. Kuck, ``On Input/Output Speedup in Tightly Coupled Multiprocessors,'' IEEE Transactions on Computers, June 1986, Volume C-35, pp. 520-30. Baron, R.V., D. Black, W. Bolosky, J. Chew, D.B. Golub, R. Rashid, A. Tevanian, Jr., and M.W. Young, ``MACH Kernel Interface Manual,'' Carnegie-Mellon University Department of Computer Science, October 1986. Black, A.P., ``Supporting Distributed Applications: Experience with Eden,'' Proceedings of the Tenth ACM Symposium on Operating System Principles, December 1985. Brooks, E.D. III, ``A Multitasking Kernel for the C and FORTRAN Programming Languages,'' Lawrence Livermore Laboratory, University of California, September 1984. Clark, D.D. ``The Structuring of Systems Using UpCalls,'' Proceedings of the Tenth ACM Symposium on Operating System Principles, December 1985, pp. 171-80. Jones, M.B., R.F. Rashid, and M.R. Thompson, ``Matchmaker: An Interface Specification Language for Distributed Processing,'' Proceedings of the 12th Annual Symposium on Principles of Programming Languages, January 1985, pp. 225-35. Kameda, H., ``Effects of Job Loading Policies for Multiprogramming Systems in Processing a Job Stream,'' ACM Transactions on Computer Systems, Volume 4, February 1986, pp. 71-106. Kepecs, J. and M. Solomon, ``A Simplified Operating System for Distributed Applications,'' 3rd PODC Conference Proceedings, 1984. Rovner, P., R. Levin, and J. Wick, ``On Extending Modula-2 For Building Large, Integrated Systems,'' DEC Systems Research Center, January 1985 pp. 16-20. Schwan, K., T. Bihari, B.W. Weide, and G. Taulbee, ``High-Performance Operating System Primitives for Robotics and Real-Time Control Systems,'' The Ohio State University Department of Computer and Information Science, July 1986, report number OSU-CISRC-86TR1KS (to appear in ACM Transactions on Computer Systems, 1987). The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors by Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy.; Department of Computer Science FR-35, University of Washington, Seattle, WA 98195 Tech Report 88-09-04 (Note: Some of the above do not directly relate to the topic under discussion) =============== END of SUMMARY ==================================== The following is an extract of comments/opinions extracted from the messages posted on os.research and the mail I got. Although I edited the comments/opinions, I have tried to preserve the context as much as possible. There may be a few typos. On the usefulness of Threads. ============================= >irvin@cwi.nl (Irvin Shizgal) writes: There are several reasons that light-weight processes are useful ("thread" refers to a light-weight process). (1) Light-weight processes share an address space, and thus switching context between them is simpler and faster (the address map doesn't change. Generally a light-wight context consist of only the program counter and registers. If non-blocking system calls are available, they can even be implemented at user level.) (2) Since light-weight processes share an address space they their effective communications bandwidth is high. The heavy-weight process thus encapsulates a unit which is assigned to a single (possibly multi-processor) machine. On a multi-processor one can really run threads concurrently. (3) It is sometimes advantageous to design a program by splitting it up into concurrent tasks. In many such cases code (and data structures) can be simplified by letting the operating system do the multiplexing. If multi-tasking is to be used as a fundamental programming abstraction, there must be an efficient implementation >From: peter%sugar@uunet.UU.NET (Peter da Silva) One thing that occurs to me is that threads basically refine the question of what is the basic unit of resource allocation by allowing multiple execution streams to share a block of other resources. Some of these have been available from the beginning. Open files, for example, though the prevalence of the standard I/O library has limited the usefulness of this feature... you may recall that in the version 6 shell this was how control structures were implemented. >From: mcvax!cwi.nl!jack@uunet.UU.NET (Jack Jansen) LWPs are a very nice mechanism to program with, much nicer than non-blocking I/O with completion interrupts or polling, etc. Just look at your average unix utility that uses SIGURG or select() and imagine what it would look like when written as a set of cooperating LWPs using blocking primitives and you'll get the point..... On the origin of Threads,LWPs. ============================= >irvin@cwi.nl (Irvin Shizgal) writes: I don't know how far back the notion of light-weight processes goes. I do remember hearing about it back in the mid-70's in distributed/portable OS research at the University of Waterloo. >From: perry@apollo.COM (Jim Perry) "Threads" or multitasking is a very old technology, a natural outgrowth of the fact that I/O is too slow to be done synchronously. In more recent operating systems (well, in Unix, which seems to be what everyone means by OS) I/O scheduling is done among processes, i.e. when a process initiates an I/O operation it is suspended until that operation completes. Other systems allow various mechanisms to allow the process to continue past the I/O call, with some means of announcing when pending operations complete. A program using such a facility is multi-threaded in the simplest sense, and a more elegant model is easily constructed from such a system. The DTSS system (Dartmouth TimeSharing System) relies heavily on this model. The command processor is a single program, which holds open all user terminals with a task (thread, LWP) to handle each. A few years ago the OS, PL/I compiler and runtime system were modified to provide an excellent multitasking model that's cleaner and more usable than anything I've seen since (to be fair, if you get to control the OS, the language, and the runtime system, and are unconcerned about portability, you have a lot more leeway than e.g. trying to make something that does all this in portable C). >From marzullo%cs.cornell.edu For a historical perspective, look at the paper "On the Duality of Operating System Structure" by Roger Needham and Hugh Lauer. It is in the ACM Operating System Review (the SIG newsletter) in the mid to late seventies. This was an extremely important paper for the time, as it brought to a head the debate on light vs. heavy weight processes. Several now reasonably famous people got their start disagreeing with this paper. Needham and Lauer's argument is superficially correct, but not correct at a deeper level. The best argument I've heard about it (Loretta Reid, now at DEC) is that the client-server model is a pervasive one, and N&L showed how to recognize it and transform a HWP implementation to a LWP and vice versa. >From: jrg@Apple.COM (John R. Galloway) Data General has had multiple tasks (threads) running around in the same address space at least since the mid 70's. I think such existed in the early versions of RDOS (just forground and background processes) as well and that would be in the early 70's. On Who schedules the LWP for execution: the OS or the parent HWP? ================================================================= >From: dpm@cs.cmu.edu (David Maynard) Lightweights are often "scheduled" by code that is part of their parent--allowing the parent to schedule them intelligently based on known interaction patterns. The OS scheduler only deals with real processes. PROBLEM: The classic example is page fault handling. Suppose one of the lightweight processes faults. As far as the OS is concerned, the entire process has faulted so that none of the lightweights can proceed until the page is brought in. On the problems of using LWPs ============================= >From: dpm@cs.cmu.edu (David Maynard) By making the LWPs share an address space, you give up a level of fault containment and take on the responsibility of coordinating access to shared resources. This isn't necessarily bad, but you have to keep it in mind. See above for page fault handling. >sjs@ctt.bellcore.com (Stan Switzer) what happens when one lightweight process "blocks" for I/O? Well, one approach is to write I/O surrogate routines which invoke non-blocking cognates of "read" and "write", mark the LWP as waiting for I/O and call the LWP dispatcher. Note that this is still non-preemptive -- at least with the understanding that all I/O operations potentailly release control. (This is essentially what the SunOS 4.0 LWP scheme does.) But suppose that a LWP page faults. This too is effectively a blocking I/O (and indeed, in MACH, if I understand correctly, users can provide their own handling for page faults), but things get complicated here. If we don't block the whole process, then we effectively have a preemptive scheduler because we can lose control at any instruction or data fetch. At this point we might as well let the operating system in on the game and let it schedule our LWPs. Although the terminalogy is by no means standard, when the OS knows about your LWPs, they are often called "threads." But now your tasks have to worry about synchronization amongst themselves using monitors, semaphores, and suchlike, so in some sense, our gains have been ill-gotten: we lose some of the convenience we sought to gain (but we gain concurrency because multiple processors can execute the various threads). My feeling is that the OS itself should be nothing more than a processor and memory multiplexor, leaving everything else up to user-level code. (It needs to provide some form of IPC too). MACH, the V system, and AMOEBA (I understand AMOEBA best -- it's quite elegant) are all based on this basic concept. In this context, "threads" can also be thought of unbundling processor multiplexing from memory multiplexing. In the final analysis, there is some tradeoff among modularity (and its close relatives reusability and plugability), convenience, effeciency, safety (esp. regarding shared memory usage), concurrency (esp in a multi-processor environment), and elegance. On the Hardware support for threads, LWPs ========================================= >From: riedl@purdue.edu (John T Riedl) Most architectures already provide instructions for efficiently saving processor state to improve procedure call performance. Some machines (like Suns) provide hardware support for address space context switching, in the form of multiple hardware address space contexts. This is the reason for the performance 'knee' on Suns when more than a certain number of contexts are in the run queue (7 on a 3/50; 15 on a Sun 4, I think). In one way threads help avoid this knee, since a group of related asynchronous activities can be organized as a single address space. On systems that support Threads/LWPs and On hardware support: ============================================================ Both MACH and V support threads. The XTMOS (XTM o.s of Cogent Research, Salem (I think), Oregon) supports Threads along with the LINDA memory model on their "desktop supercomputer". The XTM machine provides micro-coded support for the creation, and scheduling of threads. From: irvin@cwi.nl (Irvin Shizgal) [AMOEBA project, CWI, Amsterdam] Other operating systems using threads are Amoeba, "V" (David Cheriton at Standford), and even OS/2 (the latest Microsoft/IBM release). From: johnl@ima.isc.com: There have been many implementations of threads and they have been reinvented many places. IBM's OS/360 MVT, the ancestor of MVS, had them in the late 1960's. OS/2 has them. From: schwan%wayward@gatech.edu (Karsten Schwan) There are now thread packages on the Encore Multimax (Brown built one) and the BBN Butterfly running Chrysalis (we built it at Ohio State), and we are working to add some kernel thread support to the Mach version of the BBN Butterfly, which initially does not support threads. In parallel to this work, real-time operating system folks have realized the importance of low-overhead executable units for some time...look at some real-time OS papers in ACM TOCS or in IEEE TC or TSE. Cheers, Karsten Schwan, Assc. Prof., School of ICS, Georgia Tech. From: marzullo@cs.cornell.edu (Keith Marzullo) DEC SRC has a system is called Topaz, on the Firefly multiprocessor. Almost every research group using Unix has added a version of threads to their system: From: fouts@lemming. (Marty Fouts) I offer, only half humorously, TSO running under OS/360, as the first example of "multiple THREADS of control executing within a single address space" Consider that TSO ran as a partition and invoked its own round robin scheduler to schedule its quantum among its various terminal users, all of which were (sort of) within the same address space. If TSO doesn't qualify, IBM implementations of APL workspaces might. . . From: modcomp!nigel@uunet.UU.NET (Nigel Gamble) The INMOS transputer is a microprocessor which supports creation and scheduling of processes in hardware. See Byte, November 1988. From: mason@pescadero.Stanford.EDU I've not seen a machine that gives you explicit LWP support, but quick register saves, register windows, and other such mechanisms can be used to make LWPs quite fast. On the coexistence of HWPs and LWPs =================================== Although many LWPs existing in the context of one HWP is to be expected, it is not uncommon for LWPs and HWPs to coexist without any relationship between them. >John Carter (retrac@rice.edu) writes: In V there is the notion of a "team", which is a HWP (each team has a different virtual address space) as well as a "process", which is a LWP. Teams are made up of 1 or more processes. >karl@triceratops.cis.ohio-state.edu (Karl Kleinpaste) writes: LWPs can exist independent of any other processes. Mine did - they were an extra process-context level of interrupt processing. LWPs tend to be used for very specialized functions (mine were deeply concerned with tty I/O), whereas typical processes tend to be intended for a whole bunch of things, which results in heavier usage of comparatively expensive resources like memory or peripheral I/O. My LWPs were spawned at boot time by the kernel just after creation of init, and they then sat waiting for tty events. They were not attached to any other process in any way. They did not interfere with normal HWPs in any way. >Peter da Silva (peter%ficc@uunet.uu.net) writes: AmigaOS is an example of an environment with no HWPs. Mach is an example of an environment where LWPs are within HWPs. I believe Minix is an example of an environment where the kernel contains LWPs but all user tasks are HWPs. Certainly TUNIS is. >Larry Campbell (campbell@bsw.com) writes: I believe that OS/2 and Mach both support threads as well as traditional processes. You can implement non-preemptive threads yourself pretty easily under most operating systems if you don't mind writing a bit of assembler. Renganathan renga@cs.uoregon.edu