next up previous
Next: About this document Up: Distributed Operating Systems Previous: Local-Area Networks

Multicomputers

Local-area (and long-haul) networks run independent, possibly different, operating systems that offer limited services for sharing. An alternative approach is to present a single operating system that manages all the computers. The operating system can run different processes of an application on different computers. In addition, if a particular site becomes overloaded, the operating system can migrate processes running on the computer to other sites. There are several potential benefits of migration: load balancing, move process to data and other resources being accessed frequently, fault tolerance, execution on specialized hardware, etc. Migration requires a mechanism to gather load information, a distributed policy that decides when a process should be moved, and a mechanism to affect the transfer. Migration has been demonstrated in systems such as Locus, Demos/MP, and Charlotte.

Migration poses several problems. A system cannot simply copy the state associated with the migrating process to the destination process since some of this state may be host relative. (e.g. process id's, pending Unix signals and messages). A process may therefore be migratable only under certain conditions. Some systems do not guarantee that all calls will behave the same when the process is migrated. Even if there is no host relative state, the system must ensure that all messages directed to the migrated process reach the new destination. In some systems such as Charlotte, there are explicit bound ports established between communicating processes. When a process with existing links moves, the system can update the information at the other end of the bound port. The situation is compounded by the fact that two processes at opposite ends of bound ports might simultaneously move. In a system with input ports, a forwarding message can be left at the original host, which is returned to the sender. Because of these problems, some systems only migrate new processes (in the scheduler Q) and not those that have already executed and built-up site specific state. (These problems are similar to those faced in the migration of an entire (mobile) computer with communicating processes to another network.)

One of the potential drawbacks in process migration is migration cost: the time taken to migrate a process might be less than the time required to complete the process. A related problem is latency: during migration, the process does not respond to the user. Two solutions have been proposed to address this problem: One is to do precopying - the process continues to run at the original computer until is completely copied to the remote computer. Another one, adopted by Accent, is lazy copying, the complete memory image of a process is not copied when it is migrated. Instead, the state is copied on reference, and studies show this works well because programs tend to use only a small part of their state. This approach also solves the migration cost problem. Processes might be moved only during certain states, e.g suspended, to ensure that their interactive response time does not suffer. Migration might also increase the cost of accessing resources that were on the original machine. These resources can be migrated with the process.

The benefits of migration, thus, have to be weighed against the overheads involved.

How is the operating system on a multicomputer organized? Typically, each machine has a copy of the kernel, which provides the minimum functionality that includes communication between remote processes. Most of the operating system tasks are handled by servers which reside on different machines. Often a service is provided by a team of distributed servers instead of a centralized server, for several reasons:

Each computer connected to devices needs servers on that machine to manage the devices, as in 242-Xinu, which creates terminal servers on each machine connected to a tty device.

Requests to local servers may be satisfied faster that remote servers.

A centralized server may become a bottleneck.

A service may be lost if the machine running the centralized server goes down.



next up previous
Next: About this document Up: Distributed Operating Systems Previous: Local-Area Networks



Prasun Dewan
Tue Apr 30 11:14:08 EDT 2002