E1AR0002@SMUVM1.BITNET (Leff, Southern Methodist University) (05/03/87)
Listed below are entries (in the format used by refer(1)) for papers written by members of the Clouds project at Georgia Tech, which has been concerned with the design and implementation of a reliable, decentralized operating system prototype since late 1981. Copies of most of these reports may be obtained by writing to the following address: Technical Reports Librarian School of Information and Computer Science Georgia Institute of Technology Atlanta, GA 30332-0280 Please mention the technical report number. ---------- %A M. Ahamad %A M. H. Ammar %A J. Bernabeu %A M. Y. A. Khalidi %T A Multicast Scheme for Locating Objects in a Distributed Operating System %R Technical Report GIT-ICS-87/01 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D January 1987 %K Clouds %X Object-oriented distributed operating systems that provide location-independent access to objects must locate and invoke an object remotely when the object is not local and its location is not known. Commonly, a remote invocation is broadcast and each node performs a search to determine if the invoked object exists locally. When objects are not replicated, only a single node finds the object while the search performed by the other nodes is wasted computation. %X In this paper, we present a scheme for distributing and locating objects that reduces this wasteful computation. We describe a set of protocols for object creation, invocation, deletion and migration. These protocols exploit the multicast capability of the underlying communication network and hence only a subset of the nodes receives a remote invocation. A mathematical model of the system using the proposed scheme is presented and analyzed in order to demonstrate how various parameters affect system performance. %A M. Ahamad %A P. Dasgupta %A R. J. LeBlanc %A C. T. Wilkes %T Fault-Tolerant Computing in Object Based Distributed Operating Systems %J Proceedings of the Sixth Symposium on Reliability in Distributed Software and Database Systems %I IEEE Computer Society %C Williamsburg, VA %D March 1987 %P 115-125 %K Clouds replication naming PET %X Replication of data has been used for enhancing its availability in the presence of failures in distributed systems. Data can be replicated with greater ease than generalized objects. We review some of the techniques used to replicate objects for resilience in distributed operating systems. We discuss the problems associated with the replication of objects and present a scheme of replicated actions and replicated objects, using a paradigm we call PETs (parallel execution threads). %X The PET scheme not only exploits the high availability of replicated objects but also tolerates site failures that happen while an action is executing. We show how this scheme can be implemented in a distributed object based system, and use the Clouds operating system as an example testbed. %A M. Ahamad %A P. Dasgupta %T Parallel Execution Threads: An Approach to Fault-Tolerant Actions %R Technical Report GIT-ICS-87/16 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D March 1987 %K Clouds replication naming %X A distributed system can support fault tolerant atomic actions by replicating data and computation at sites that have independent failure modes. We present a scheme called parallel execution threads (PET) that can be used to implement fault tolerant actions in an object-based distributed environment. This scheme tolerates existing as well as transient failures. The details of the PET scheme as well as the commit protocols used by it are described. We also consider the integration of the PET scheme in the \fIClouds\fP distributed operating system. %A J. E. Allchin %A M. S. McKendry %T Object-Based Synchronization and Recovery %R Technical Report GIT-ICS-82/15 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D September 1982 %X Using abstract data types and nested actions as system structuring tools can help create more robust systems. In pursuing the goal of creating an operating system using these tools, several interesting principles have been encountered. First, in this environment synchronization and recovery should be associated with each object. By associating synchronization with each object and by using the semantics of the object operations, it is possible to achieve higher concurrency. Binding recovery to objects permits efficient recovery techniques which might not be possible without the specific implementation knowledge available to the programmer of the object. Second, it is important to distinguish between the abstract behavior of an object and its implementation when analyzing concurrency. Third, using serializability for the abstract behavior of an object is sometimes undesirable or unnecessary. Whether an object provides serializability as the abstract behavior depends on the semantics of how the object is used. Examples of object types which motivate the principles are presented. %A J. E. Allchin %T A Suite of Algorithms for Maintaining Replicated Data Using Weak Correctness Conditions %R Technical Report GIT-ICS-82/18 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D December 1982 %X A suite of decentralized algorithms for maintaining distributed replicated data is presented. The algorithms do not necessarily achieve serial consistency, but they are adequate for many simple data storage problems in operating systems and realtime systems. Applications which appear well-suited to the suite include mail systems, naming servers, appointment calendars, certain types of file dictionaries, operating system load tables (e.g., routing), and device state in distributed process control systems. The algorithms are robust and are intuitively easy to understand. The algorithms assume an unreliable network and tolerate node failures, network partitions, lost, duplicate, and out-of-order messages. Both goals for replicating data--high availability and rapid response time--are met by the algorithms. The basic algorithms use resolution tables to state the outcome of information conflicts caused by concurrent actions or unreliable nodes and communication. Each algorithm is oriented toward different application requirements and provides a different degree of message traffic overhead and availability. The efficiency of the algorithms depends on the acceptability of weak correctness conditions in the applications. The correctness condition for one of the algorithms is formally defined and the algorithm is proved to be correct (with other proofs following in a straightforward manner from the framework presented). This algorithm has also been implemented. %A J. E. Allchin %A M. S. McKendry %T Facilities for Supporting Atomicity in Operating Systems %R Technical Report GIT-ICS-83/01 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D January 1983 %X One of the problems fundamental to distributed computing is maintaining the atomicity of a sequence of operations despite concurrent activity or system/application failures. \fIAtomic actions\fP have been used for this purpose in database systems and recently in programming languages. This paper introduces support for atomicity in the kernel of an operating system. This support is not limited to managing just one type of data (\fIe.g.,\fP files) and could be used to ensure that any action (or task) be accomplished atomically on a set of user definable objects. The atomicity framework presented uses processes, actions, and objects. Requirements for atomicity are discussed and system primitives are defined which include the ability to create and terminate nested actions, control concurrency between actions, and recover from action aborts. The facilities presented provide system designers and programmers with the ability to control consistency requirements using whatever semantic knowledge is available. The atomicity thus attained is called \fIsemantic atomicity\fP. Unlike other work, we do not tightly bind processes to actions, thus allowing the facilities presented to be applicable to a wide class of systems (including applications where actions are supported by cooperating processes). One possible approach for integration of the facilities into a programming language is discussed related to the Clouds decentralized global operating system. The desirability for semantic atomicity is illustrated through a file directory system example. Use of the facilities to address the problem of actions supported by cooperating processes is also illustrated through an example. %A J. E. Allchin %T How to Shadow a Shadow %R Technical Report GIT-ICS-83/05 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D February 1983 %X Several file and database systems have used a shadowing technique for recovery purposes on data files which are not concurrently accessed. Essentially there are two versions: the current version and a shadow version. Transactions manipulate only the current version. When a change is first made to a data page, a new page is allocated and the current version page directory is updated with the new page location. The usual implementation is exceptionally efficient for small to medium-sized files because on transaction termination the only processing required is to determine which version should become the shadow; the other version is discarded. This paper discusses an efficient solution for using this approach with concurrent transactions. We present a technique for building not only single-level concurrent transactions, but nested transactions which may be concurrent as desired. %A J. E. Allchin %A M. S. McKendry %T Support for Objects and Actions in Clouds: Status Report %R Technical Report GIT-ICS-83/11 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D May 1983 %X This status report describes the current work of the \fBClouds\fP project at Georgia Tech. The Clouds project is studying techniques for construction of reliable computing systems in environments of distributed machines interconnected by local area networks. This report emphasises the functional requirements for architectural support. To support reliability, the architecture supports \fIobjects\fP and \fIactions\fP. Objects are instances of abstract data types. They provide a basis for building system components and for controlling the behaviour of a system when failures occur. Atomic actions are a means of dynamically grouping invocations of operations on objects into units of work that either complete in their entirety or do not have any effect whatsoever. Recovery mechanisms assist in maintaining this abstraction and synchronization mechanisms control interactions between actions. %X The techniques described are oriented particularly toward highly dynamic applications in which the payoff for reliability is high and the loads placed on the system vary substantially. The architectural support may be tailored to particular applications, even within a single system. It is possible, for example, to use `hot spares' (an on-line spare is maintained so that no time is lost upon failure), or slower but cheaper recovery in which computations are restarted after failures. Mechanisms are provided to make it possible to bring failed machines on-line and integrate them with the remainder of the system without disruption. To improve efficiency and limit the propagation of the effects of failures, mechanisms are provided to construct \fInested\fP actions, which function as components of larger actions while failing independently of their containing actions. %A J. E. Allchin %A M. S. McKendry %T Synchronization and Recovery of Actions %J Proceedings of the Second Symposium on Principles of Distributed Computing %C Montreal %I ACM SIGACT/SIGOPS %D August 1983 %K Clouds %X We introduce an approach to robust computation in distributed systems. This approach is the foundation for reliability in the \fBClouds\fP decentralized operating system. It is based on atomic actions operating on instances of abstract data types (objects). We present an event-based model of computation in which scheduling of responses to operation invocations is controlled by objects. We discuss an integrated strategy for synchronization \fIand\fP recovery which uses relationships between the abstract states of objects to track dependencies between actions. Serializability is defined in terms of the semantics of operations. This permits high concurrency to be obtained in non-serializable implementations without deviation from serializable abstract behavior. We define a class of schedulers that allows objects to make autonomous scheduling decisions. We present the use of non-serializable operation semantics. Finally, we discuss implementation of the model, including action synchronization, object operation ordering using action-based counting semaphores, and action recovery. %A J. E. Allchin %T An Architecture for Reliable Decentralized Systems %R Ph.D. Diss. %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %O Also released as technical report GIT-ICS-83/23 %K thesis Clouds %D September 1983 %X Constructing reliable programs for distributed processing systems is a very difficult task. \fIActions (transactions),\fP indivisible units of work, can simplify this process by providing uniform treatment of failures and preventing interference. These units of work can also be \fInested,\fP further controlling the scope of concurrency and failures. The atomicity provided by actions is an important tool for building reliable decentralized systems. %X Actions manipulate pieces of data called \fIobjects\fP. Objects are usually treated as uninterpreted (bit strings). However, treating all objects in this fashion can result in unacceptable concurrency or high recovery overhead. In order to take advantage of actions in the widest possible context, it is necessary to consider operations on generalized objects (instances of abstract data types). %X This report presents a general architectural model for reliable decentralized systems constructed using actions and object. We include one prototype design created from the model. We also include practical algorithms necessary to implement this design. %X We present a nested action management algorithm that, to our knowledge, is the first such algorithm to separate remote call semantics from action units. It also guarantees that \fIorphans\fP, computational parts of actions that will be eventually aborted, view consistent system states. We describe a design for synchronization and recovery that is oriented toward a programming-based view of objects (as well as simple data). We demonstrate the usefulness of our results through typical reliable programming problems. %X Availability is another important dimension of distributed systems. We describe a novel collection of simple, yet very robust, replication algorithms which can increase data availability. Algorithms from this suite can be customized to balance particular tradeoffs required in different application systems. The efficiency of the algorithms depends on the acceptability of weak consistency conditions in the applications. One member of this suite is formally modelled and proven correct. The other follow in a straightforward manner. %A P. Dasgupta %A R. LeBlanc %A E. Spafford %T The Clouds Project: Design and Implementation of a Fault-Tolerant Distributed Operating System %R Technical Report GIT-ICS-85/29 %D 1985 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %X The \fIClouds\fP project at Georgia Tech was initiated to conduct research into failure resistant, efficient distributed architectures and operating systems. The project used state of the art techniques to design a distributed operation system kernel that can be supported on conventional, unreliable hardware, and be more reliable than the underlying electronics. Several approaches to the problem were considered, and after substantial research and construction effort, the current design emerged. This design unifies simplicity with efficiency and advanced concepts. The resulting system is quite versatile and can be adapted easily to suit most requirements of reliable distributed computing, in many different hardware configurations. The design is largely hardware independent and independent of system configuration. %X This report describes the object and action based approach to building operating systems as incorporated in Clouds. We also describe in some detail the salient features of the system and the research directions that the project is expected to take. %A P. Dasgupta %A M. Morsi %T An Object-Based Distributed Database System Supported on the Clouds Operating System %R Technical Report GIT-ICS-86/07 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1986 %X Many database systems are built on top of conventional off-the-shelf operating systems. However most operating systems lack the kind of support necessary to measure up to the consistency and recovery needs of database systems. This entails the need for modifying and adding some operating system services, and a creation of a database system service layer on top of the existing system to provide the database services. These services generally comprise of synchronization routines (or concurrency control), crash recovery protocols, transactions commit, and rollback protocols. %X At Georgia Tech, the \fIClouds\fP project is actively involved in building the \fIClouds\fP operating system that provides all of the above mentioned services in a distributed integrated environment. We are interested in investigating techniques to implement reliable distributed database systems on the \fIClouds\fP environment. %X This paper presents a design for such a database system. We discuss the design of a distributed relational database system using the object paradigm. We discuss approaches to techniques that handle storage and handling of relations, relational operators and their implementations, concurrency control, failure and recovery, and transaction commit. We show how our design exploits the \fIClouds\fP environment and fits in with the services provided by \fIClouds\fP. The design of the database is conceptually quite simple, elegant and yet completely general, effective and efficient. %X We also deal with replication of data in the database system. \fIClouds\fP does not effectively handle replication, as the location independence of data in the \fIClouds\fP system nearly does away with the need for replication in most applications. However, in efficient implementations of database systems, there is a need for providing support for replicated data, and we present a scheme that provides quicker data access through replication. %A P. Dasgupta %T A Probe-Based Monitoring Scheme for an Object-Oriented Distributed Operating System %J Proceedings of the Conference on Object Oriented Programming Systems, Languag es and Applications %C Portland, OR %D Sept. 1986 %I ACM SIGPLAN %O Also available as Technical Report GIT-ICS-86/05 %P 57-66 %K OOPSLA Clouds %X Research in the field of concurrency control for database systems has given rise to many techniques of ensuring consistency in multiuser database systems. However claims of superiority of proposed protocols have mainly been supported by intuitive reasoning. Simulation is one of the methods that can be used to demonstrate efficiency and practicality of the mechanisms when analytical methods are not easily available. %X A simulator provides a concrete, and often the only practical way of judging the merits of different strategies under various conditions of operation. It can provide statistical insight into the different factors that affect performance, and the correlation of these factors with desirable features. It can thus be also used for fine-tuning existing protocols. Finally, it can be used as a verification tool and for enhancing our intuition about the issues involved. %X We first describe very briefly some of the previous research dealing with database performance evaluation; it serves mainly to lead the interested reader to relevant literature. We then proceed to the description of our model of the distributed database operating system and what needs to be accounted for in a useful simulator. Later we go into substantial detail about the simulation and implementation techniques to provide the reader with information about the exact simulation environment, sufficient to judge the validity and the usefulness of the results obtained. Finally, we describe the results of simulating several concurrency protocols and present our interpretations. %A G. G. Kenley %T An Action Management System for a Distributed Operating System %R M.S. Thesis %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1986 %K thesis Clouds %O Also released as technical report GIT-ICS-86/01 %X The goal of constructing reliable programs has led to the introduction of transaction (action) software into programming environments. The further goal of constructing reliable programs in a distributed environment has led to the extension of transaction systems to operate a more decentralized environment. %X We present the design of a transaction manager that is integrated within the kernel of a decentralized operating system: the \fBClouds\fP kernel. This decentralized action management system supports nested actions, action-based locking, and efficient facilities for supporting recovery. The recovery facilities have been designed to support a systems programming language which recognizes the concept of an action. We also present a search protocol to locate objects in this distributed environment. %X \fIOrphans\fP, disjoint parts of actions that have aborted, are identified and eliminated using a time-driven orphan detection scheme which requires a clock synchronization protocol; we present the facilities necessary to generate a system-wide global clock to support that protocol. %X The design goal of this implementation has been to achieve the performance necessary to support an experimental testbed which can serve as the basis for further work in the area of decentralized systems. %A R. J. LeBlanc %A C. T. Wilkes %T Systems Programming with Objects and Actions %J Proceedings of the Fifth International Conference on Distributed Computing Sy stems %C Denver %D July 1985 %O Also released, in expanded form, as technical report GIT-ICS-85/03 %K 5ICDCS Aeolus Clouds %X The goal of the Clouds project at Georgia Tech is the implementation of a fault-tolerant distributed operating system based on the notions of objects and actions, which will provide an environment for the construction of reliable applications. As part of the Clouds project, we are designing and implementing a high-level language in which those levels of the Clouds system above the kernel level will be implemented. The Aeolus language provides access to the synchronization and recovery features of Clouds. It also provides a framework with which to study programming methodologies suitable for action-object systems such as Clouds. %X This paper provides a brief introduction to the features of the Clouds system which provide support for programming of objects and actions, and how these features are made available in the Aeolus language. We also present an example Aeolus object from our initial studies in programming methodologies for Clouds which demonstrates the use of these features for programming recoverable objects. %A C. Lin %T The Design of a Distributed Debugger for Action-Based Object-Oriented Program s %R Ph.D. Diss. %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1987 %K thesis Clouds %O In progress %A M. S. McKendry %A J. E. Allchin %A W. C. Thibault %T Architecture for a Global Operating System %J Proceedings IEEE Infocom %C San Diego, CA %D April 1983 %K Clouds %X Global operating systems are suited to distributed, local-area network environments. A decentralized global operating system can manage all resources globally, relying on functional requirements for resource allocations, rather than the relative physical locations of the resource allocation mechanism and the resource itself. Among the advantages of global operating systems are the ability to use idle resources and to control the environment as a single cohesive entity. This paper introduces an architectural approach to supporting decentralized global operating systems. The approach addresses the problem of managing distributed data by incorporating specialized data management facilities in the kernel. This data management is especially useful to the operating system itself. A capability-based access scheme provides flexible, control of resources and autonomy. The approach is being utilized in the \fBClouds\fP operating system project at Georgia Tech. %A M. S. McKendry %T Clouds: A Fault-Tolerant Distributed Operating System %J Distributed Processing Technical Committee Newsletter %I IEEE %D 1984 %O Also issued as Clouds Technical Memo #42 %A M. S. McKendry %T Fault-Tolerant Scheduling Mechanisms %R (Unpublished Technical Report) %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D May 1984 %O Draft only %K FTJS Clouds %A M. S. McKendry %T Ordering Actions for Visibility %J Transactions on Software Engineering %I IEEE %V 11 %N 6 %D June 1985 %O Also released as technical report GIT-ICS-84/05 %K Clouds %X Several research projects are studying architectures for distributed computing that are founded on the notion of \fIatomic actions\fP operating on \fIobjects\fP (instances of abstract data types). The \fIClouds\fP project at Georgia Tech is evaluating this approach as the foundation for constructing distributed operating systems. Objects are not new to operating systems. They provide substantial benefits in such dimensions as protection and synchronization, as well as their inherent organizational characteristics. This paper is concerned with synchronization to control ordering. Conventional approaches require substantial extension for the action environment. Typically, they are based on (or equivalent to) general semaphores. Semaphores take no account of the visibility requirements of actions however, and consequently they can allow an action to progress beyond the point at which its effects can be undone. Also, they do not account for failures. %X This paper introduces examples to illustrate requirements for ordering mechanisms. A model of nested actions is then used as a basis for categorizing visibility requirements. These requirements go beyond those typical of database systems, because often the entities managed by operating systems cannot be recovered if an action fails. Several simplifications that apply to many operating system problems are discussed. Algorithms for controlling ordering are then presented, with examples of their use. We establish several expediencies that result from ordering requirements. In many situations, recovery for nested actions can be implemented with a single backup copy of each item, a single synchronization variable can be used to control blocking, and generalized locking is not required. These savings appear to be fundamental to making the object-action approach viable for operating system construction. %A D. V. Pitts %A E. H. Spafford %T Notes on a Storage Manager for the Clouds Kernel %R Technical Report GIT-ICS-85/02 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1985 %X The Clouds project is research directed towards producing a reliable distributed computing system. The initial goal of the project is to produce a kernel which provides a reliable environment with which a distributed operating system can be built. The Clouds kernel consists of a set of replicated sub-kernels, each of which runs on a machine in the Clouds system. Each sub-kernel is responsible for the management of resources on its machine; the sub-kernel components communicate to provide the cooperation necessary to meld the various machines into one kernel. %X This report documents a portion of that research, namely, the implementation of a kernel-level storage manager that supports reliability. The storage manager is a part of each sub-kernel and maintains the secondary storage residing at each machine in our distributed system. In addition to providing the usual data transfer services, the storage manager ensures that data being stored survives machine and system crashes, and that the secondary storage of a failed machine is recovered (made consistent) automatically when the machine is restarted. Since the storage manager is a part of the Clouds kernel, efficiency of operation is also a concern. We wish to reduce the overhead required to ensure the recoverability of secondary storage as much as possible, while adhering to the design goals associated with the storage manager. %A D. V. Pitts %T Storage Management for a Reliable Decentralized Operating System %R Ph.D. Diss. %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1986 %K thesis Clouds %O Also released as Technical Report GIT-ICS-86/21 %X Decentralization of computing systems has several attractions: performance enhancements due to increased parallelism; resource sharing; and the increased reliability and availability of data due to redundant copies of the data. Providing these characteristics in a decentralized system requires proper organization of the system. With respect to increasing the reliability of a system, one model which has proven successful is the object/action model, where tasks performed by the system are organized as sequences of atomic operations. The system can determine which operations have been performed completely and so maintain the system in a consistent state. %X This dissertation describes the design and a prototype implementation of a storage management system for an object-oriented, action-based decentralized kernel. The storage manager is responsible for providing reliable secondary storage structures. First, the dissertation shows how the object model is supported at the lowest levels in the kernel by the storage manager. It also describes how storage management facilities are integrated into the virtual memory management provided by the kernel to support the mapping of objects into virtual memory. All input and output to secondary storage is done via virtual memory management. This dissertation discusses the role of the storage management system in locating objects, and a technique intended to short circuit searches whenever possible by avoiding unnecessary secondary storage queries at each site. It also presents a series of algorithms which support two-phase commit of atomic actions and then argues that these algorithms do indeed provide consistent recovery of object data. These algorithms make use of virtual memory management information to provide recovery, and relieve the action management system of the maintenance of the stable storage. %A D. V. Pitts %T Object Memory and Storage Management in the Clouds Kernel %R Technical Report GIT-ICS-87/15 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D March 1987 %X The Clouds kernel is a native layer distributed kernel supporting the Clouds operating system. Clouds is a distributed object based system, designed to support fault tolerance, location independence and an action/object programming environment. %X Some of the key issues in supporting Clouds are the availability of Object Memory, Object Location and Object Recovery. Object Memory provides a set of global, permanent, named address spaces for storing objects. The address spaces resemble conventional segmentation schemes, but are persistent and thus replaces both the computational and storage systems used in conventional schemes by a more powerful paradigm. The Object Location system provides transparent object invocation mechanisms throughout the distributed environment. The Object Recovery system support recoverable objects through shadowing and two-phase commit techniques to allow atomicity of actions. %X This paper describes, in brief, the key issues in the design and implementation of the Object Memory and Storage Management system, that provides all the mentioned facilities. The implementation is operational and in use by the Clouds Project at Georgia Tech. %A E. H. Spafford %A M. S. McKendry %T Kernel Structures for Clouds %R Technical Report GIT-ICS-84/09 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1984 %X In the past few years, a great deal of research has been focused on the potential benefits of distributed systems. In particular, a distributed system offers the potential of a fault-tolerant computing environment. A distributed system also suggests increased computing power through the combination and application of resources. The presence of multiple machines, however, raises many questions relating to communication, consistency, reliability, configuration, and user interfaces, to name just a few. These questions are difficult to address, and that is perhaps the reason why so few attempts have been made to construct actual distributed systems. Interesting recent work in this area includes the \fIEden\fP project at the University of Washington, the \fIArgus\fP project at MIT, the \fIAccent\fP system at CMU, and the \fIISIS\fP project at Cornell. %X The \fIClouds\fP project is an approach to the construction and application of a distributed system that is intended to address these questions. We support the room full of computers'' view of distribution. In this view, the user sees a single resource, despite physical distinctions. In our research approach, this is achieved by constructing a highly-transparent multicomputer operating system with low-lever support for maintaining consistent data items. A \fImulticomputer\fP or \fIcomputer cluster\fP is a system of many computers joined into one large system. The system's distribution is \fItransparent\fP to users and to most operating system components in the sense that the user is not aware of the nature or number of machines which compose the multicomputer. The user's data and processes may be distributed throughout the multicomputer system, or they all may be located on one processor -- there is no observable difference to the user, nor is there any need for the user to be aware of the configuration. We support this transparency during \fIupward configuration\fP -- the addition of more machines, and during \fIdownward reconfiguration\fP -- the removal of failure of machines. %X \fIClouds\fP supports abstract data objects at a very low lever. These objects are used to build the operating system and applications. Some of these objects may be made \fIrecoverable\fP (operations on those objects may be undone or reversed in the event of failure or error). %X \fIAtomic transactions\fP or \fIactions\fP are used by both the operating system and user applications to maintain consistency and recoverability of data and operations. The design makes use of actions and objects to provide reliable operating system services, such as job schedulers, and thus provide a fault-tolerant system. %X The principles and motivations behind the \fIClouds\fP project have been described in more depth in several other documents. The authors assume that the reader is already acquainted with the \fIClouds\fP project and is somewhat familiar with the goals outlined in those documents. This paper is intended to be an introduction to the internal structures of the \fIClouds\fP kernel. We will be constructing an experimental \fIClouds\fP system during the next few years using dedicated minicomputers and personal computers. Further description of the \fIClouds\fP kernel will be done as this experimental system continues to be designed and constructed. %A E. H. Spafford %T Kernel Structures for a Distributed Operating System %R Ph.D. Diss. %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1986 %K thesis Clouds %O Also released as technical report GIT-ICS-86/16 %X In recent years there has been considerable interest in developing distributed computing systems. Distribution of computing resources suggests many possible benefits including greater flexibility, enhanced computing power through greater parallelism, and increased reliability. %X In practice, achieving any of these benefits has been difficult, since a distributed system also presents potential problems in naming, synchronization, and the effective use of resources. Consistency problems arise when dealing with operations and data structures that may span machine and device boundaries; that is, should a communications or machine failure occur at an inopportune time, the data may be left in an unknown, incorrect, or inaccessible condition. This type of problem is certainly undesirable in user programs, but special problems arise when operating system data structures become inconsistent. Due to the larger number of components involved in a distributed system, these problems are more likely to occur and more damaging in their effects. %X Since 1982, the Clouds project has been researching an approach to the construction of a distributed computing environment intended to address these concerns. The Clouds operating system is intended to reliably support effective use of distributed resources. Some of that design is derived from the action/object model of computation developed in Jim Allchin's dissertation. That work suggested an architecture for a distributed, reliable computing system built from abstract data objects and atomic transactions. The architecture, properly implemented, can be used to address many of the problems presented by distributed systems. However, Allchin's work does not address the structure or implementation of the kernel and operating system services necessary for a functional distributed system. %X This dissertation explores the requirements for services and structures needed to support a distributed computing environment as suggested by Allchin's work. It contains the design of a distributed operating system kernel which meets these requirements and which could flexibly support various implementations of the Clouds reliable system as well as other forms of object-oriented distributed systems. This dissertation also describes a prototype implementation, which was done to help refine and validate the design and provide a testbed for further research. %A Eugene H. Spafford %T Object Operation Invocation in Clouds %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %R Technical Report GIT-ICS-87/14 %D February 1987 %K RPC %O Submitted to SOSP %X Many distributed operating systems have been developed in recent years based on the action/object paradigm. The Clouds multicomputer system provides a fault-tolerant distributed computing environment built from passive data objects, fault-atomic transactions, and a global kernel interface. Large portions of the Clouds operating system and supporting software, and all user-level software are being constructed from these constructs. %X Important to the successful functioning of Clouds is the uniform operation invocation mechanism. The mechanism is flexible, powerful and easily understood. It allows plain processes or nested transactions to access user and system objects in a transparent, uniform manner, whether those objects are local to the current machine or on some remote processor. The same basic interface used to make operation invocation requests on objects can be used to spawn processes and actions, and to gain access to restricted kernel services. %X This paper presents an abbreviated description of the Clouds philosophy and some of its kernel features as they relate to object operation invocation. Included is a presentation of the structure and operation of the invocation mechanism and its support for some of Clouds' design goals. Support for remote invocation, per-object access control, and location independent invocation are also presented. This should give the reader some understanding of the integrated nature of the three basic Clouds primitives--objects, actions, and processes--as well as insight into how they are supported. %A H. Strickland %T Networking Support for a Distributed Operating System %R M.S. Thesis %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1987 %K Clouds %O In progress %A Peter Wan %T A Disk Driver for an Action-Oriented Operating System %R M.S. Thesis %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1987 %O In progress %K Clouds %A C. T. Wilkes %T Preliminary Aeolus Reference Manual %R Technical Report GIT-ICS-85/07 %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %D 1985 %O Last Revision: 17 March 1986 %K Clouds %X The goal of the Clouds project at Georgia Tech is the implementation of a fault-tolerant distributed operating system based on the notions of objects, actions, and processes, which will provide an environment for the construction of reliable applications. The Aeolus programming language developed from the need for an implementation language for those portions of the Clouds system above the kernel level. Aeolus has evolved with these purposes: to provide the power needed for systems programming without sacrificing readability or maintainability; to provide abstractions of the Clouds notions of objects, actions, and processes as features within the language; to provide access to the recoverability and synchronization features of the Clouds system; and to serve as a testbed for the study of programming methodologies for action-object systems such as Clouds. %X Thus, the main interest of Aeolus lies not in the language itself, but in what may be done with the language. We have avoided providing high-level features for programming actions with the intention of evolving designs for such features out of our experience with programming in Aeolus. These features will then be incorporated into an applications language for the Clouds system. %X This report is not intended to be a tutorial on the Aeolus language; rather, it strives to be a concise definition of the syntax and semantics of Aeolus, and thus should serve as a reference for programmers and implementors. %A C. T. Wilkes %A R. J. LeBlanc %T Rationale for the Design of Aeolus: A Systems Programming Language for an Act ion/Object System %J Proceedings of the 1986 International Conference on Computer Languages %I IEEE Computer Society %C Miami, FL %D October 1986 %P 107-122 %O Also available as Technical Report GIT-ICS-86/12 %K Clouds %X The goal of the Clouds project at Georgia Tech is the implementation of a fault-tolerant distributed operating system based on the notions of objects, actions, and processes, to provide an environment for the construction of reliable applications. The Aeolus programming language developed from the need for an implementation language for those portions of the Clouds system above the kernel level. %X Aeolus has evolved with these purposes: to provide the power needed for systems programming without sacrificing readability or maintainability; to provide abstractions of the Clouds notions of objects, actions, and processes as features within the language; to provide access to the recoverability and synchronization features of the Clouds system; and to serve as a testbed for the study of programming methodologies for action-object systems such as Clouds. %X In this paper, the features provided by the language for the support of readability and maintainability in systems programming are described briefly, as is the rationale underlying their design. Considerably more detail is devoted to features provided for support of object and action programming. Finally, an example making use of advanced features for action programming is presented, and the current status of the language and its use in the Clouds project is described. %A C. T. Wilkes %T Programming Methodologies for Resilience and Availability %R Ph.D. Diss. %I School of Information and Computer Science, Georgia Institute of Technology %C Atlanta, GA %K thesis Clouds %D 1987 %O In progress %X The goal of the Clouds project at Georgia Tech is the implementation of a fault-tolerant distributed operating system based on the notions of objects and actions, which will provide an environment for the construction of reliable applications. As part of the Clouds project, we have designed and implemented a high-level language in which those levels of the Clouds system above the kernel level are being implemented. The Aeolus language provides access to the synchronization and recovery features of Clouds. It also provides a framework within which to study programming methodologies suitable for action-object systems such as Clouds. This dissertation describes programming methodologies appropriate to the design of fault tolerant servers needed in the Clouds system. Among the properties needed by these objects are resilience and availability. %X As part of this research, several case studies which will serve as designs for actual Clouds servers have been developed in Aeolus. Among the issues examined using these case studies are: the use of knowledge about the semantics of an object, as opposed to automatic provisions, in designing for resilience and availability; the tradeoffs between consistency and availability for such objects; the support from the Aeolus runtime system and from the Clouds kernel needed for providing fault tolerance; and high-level language features for resilience and availability which may be derived from experience with programming in Aeolus. ---------- C.T. "Tom" Wilkes School of Information & Computer Science, Georgia Tech, Atlanta GA 30332 CSNet: wilkes @ gatech ARPA: wilkes @ ics.gatech.edu uucp: ...!{akgua,allegra,hplabs,ihnp4,seismo,ulysses}!gatech!stratus!wilke o