Preview only show first 10 pages with watermark. For full document please download

Agile Computing: Bridging The Gap Between Grid Computing And Ad-hoc Peer-to-peer Resource Sharing

Agile Computing: Bridging the Gap between Grid Computing and Ad-hoc Peer-to-Peer Resource Sharing

   EMBED

  • Rating

  • Date

    May 2018
  • Size

    222.2KB
  • Views

    8,920
  • Categories


Share

Transcript

  Agile Computing: Bridging the Gap between Grid Computing and Ad-hoc Peer-to-Peer Resource Sharing  Niranjan Suri 1,2 , Jeffrey M. Bradshaw 1 , Marco M. Carvalho 1 , Thomas B. Cowin 1 , Maggie R. Breedy 1 , Paul T. Groth 1 , and Raul Saavedra 1   1  Institute for Human & Machine Cognition, University of West Florida 2  Lancaster University {nsuri,jbradshaw,mcarvalho,tcowin,mbreedy,pgroth,rsaavedra}@ai.uwf.edu Abstract  Agile computing may be defined as opportunistically (or on user demand) discovering and taking advantage of available resources in order to improve capability,  performance, efficiency, fault tolerance, and  survivability. The term agile is used to highlight both the need to quickly react to changes in the environment as well as the need to exploit transient resources only available for short periods of time. Agile computing builds on current research in grid computing, ad-hoc networking, and peer-to-peer resource sharing. This  paper describes both the general notion of agile computing as well as one particular approach that exploits mobility of code, data, and computation. Some  performance metrics are also suggested to measure the effectiveness of any approach to agile computing. 1. Introduction Agile computing may be defined as opportunistically (or on user demand) discovering and taking advantage of available resources in order to improve capability,  performance, efficiency, fault tolerance, and survivability. We use the term agile to highlight both the need to quickly react to changes in the environment as well as the ability to take advantage of transient resources only available for short periods of time. By resources, we mean any kind of computational resources including CPU, memory, disk, network bandwidth, as well as any specialized resources. By available, we mean resources that are under-utilized and may be reached via a network connection. Note that the availability of a resource may be in a constant state of flux due to the varying loads on that resource or due to changes in the reachability of the resource. The notion of agile computing builds on current research in grid computing, ad-hoc networking, and peer to peer resource sharing. The specific realization described in this paper exploits mobility of code, data, and computation and an architecture independent uniform execution environment to achieve the desired  property of agility. We would like to emphasize that the approach described here is only one way to realize the overall notion of agile computing and we hope that other researchers will pursue alternative approaches. Agile computing can be applied in a variety of domains to achieve one or more of the five goals (improvements in capability, performance, efficiency, survivability, and fault tolerance). The specific domain described in this paper is military sensor networks. 2. Motivations for Agile Computing Agile computing will provide several advantages over the current state-of-the-art. The following four examples illustrate the survivability, cost, and performance advantages and the ability to opportunistically take advantage of new capabilities. 2.1. Improving Survivability Survivability may be defined as the resiliency of a system under attack. In certain environments (such as the military), computer systems may be under attack in order to reduce their effectiveness. Such attacks may be kinetic (physical, such as a location being destroyed through explosives) or electronic (information warfare attacks such as denial of service). An agile computing infrastructure will allow critical functionality to be moved to other available computing platforms on demand. For example, a Navy ship at sea may have dedicated computers for fire control, for logistics, and for a variety of other functions, as well as general-purpose workstations and laptops for the crew. If the operators knew that a missile was incoming and there was a  possibility of some part of the ship getting hit, the system should be able to duplicate the critical functionality onto other systems (on the same ship or other ships in the Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’03) 0--7695-1919-9/03 $17.00 © 2003 IEEE  vicinity). If one of the systems (for example, the fire control system) is overloaded, it should be able to tap into the resources available on the logistics system (or even preempt the less critical functionality of the logistics system). In case one of the dedicated systems is damaged, the functionality should be dynamically moved to other available systems (such as the general purpose workstations). 2.2. Reducing Investment in Duplicate Hardware Critical systems are often constructed with redundant systems (hot standbys) to takeover when the primary system fails. The space shuttle is an example that has redundant hardware in case of emergencies. However, astronauts also carry laptop computers with them. Currently, there is no infrastructure that would allow a crew laptop to take over part of the operation of the shuttle in case of a failure in the shuttle computer. The goal of agile computing research would be to allow the space shuttle’s systems to be designed in such a way that functionality of the shuttle’s main system can be dynamically moved to a crew laptop in case of an emergency. Such an infrastructure would reduce the need to duplicate dedicated hardware. Agile computing emphasizes survivability through software and process migration and redundancy, not hardware redundancy. This is particularly powerful given that computational hardware is now becoming ubiquitous and interconnected. 2.3. Improving Performance In any given environment (micro or macro), the utilization of systems is not uniform. This may be observed in very small groups such as laptops in a meeting room to very large organization wide (or even nationwide) groups. Using available resources in groups will enhance overall performance. Note that groups may  be transient or stable and even in a stable group, systems may disconnect and reconnect. The goal of agile computing is to take advantage of not just stable but transient groups as well. Much work has been done on resource sharing via  peer to peer protocols. SETI@home [1] is one example of utilizing computational resources in a networked environment. Entropia has built systems such as GIMPS (Great Internet Marisenne Prime Search) [2] that also take advantage of resources in a network of PCs. Both of these are examples of working with stable groups and hence lack the dynamism of agile computing. Protocols such as Gnutella [3] are highly dynamic and ad-hoc but are limited to file sharing. Arguably, due to security concerns, it would be difficult to use a Gnutella-style  protocol for sharing computational resources without additional protection mechanisms. 2.4. Dynamically Discovering and Using Capabilities Consider an advanced military sensor network (such as the one envisioned for the U.S. Army Future Combat Systems program) whose goal is to allow soldiers to discover and task available sensors in any given environment. As opposed to tasking a static set of deployed sensors, agile computing would allow opportunistic use of available resources. For example, imagine a soldier interested in visual images from a certain area. The soldier might be using a terrestrial camera sensor to obtain the visual images. However, a helicopter with a camera (possibly on a different mission) might be over the area of interest for the next two minutes. The soldier should be able to dynamically discover the new resource and switch to using that in order to retrieve much higher quality images than the srcinal terrestrial sensor. When the helicopter leaves the area of interest, the sensor feed should automatically switch back to the srcinal (less effective) sensor. An agile computing system would allow such opportunistic discovery of capabilities and provide the mechanisms to take advantage of them (even if it is only for a short transient period of time). 3. Technical Requirements and Challenges Several technical requirements must be satisfied  before agile computing can be realized. These requirements are discussed below: 3.1. Group Formation A group is defined as a set of systems that specify the scope for resource sharing. Therefore, formation and regulation of groups is central to agile computing. Groups may be formed through static configuration or through ad-hoc discovery. For example, all of the workstations in a department may be configured to be  part of a group. On the other extreme, a group may be formed when four laptop computers in a meeting room discover each other through a protocol such as Bluetooth. Group formation needs to be controllable through  policies for security reasons so that the systems that are allowed to share resources can be regulated. 3.2. Architecture Independence Agile computing implies that computations must be able to take advantage of any available resources independent of the underlying hardware architecture. If a Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’03) 0--7695-1919-9/03 $17.00 © 2003 IEEE  system fails unexpectedly, the infrastructure should be able to compensate by running those computations on another system, independent of architecture. Similarly, if an ad-hoc group consists of laptops of different architectures (say, Apple Powerbooks running MacOS X and PC laptops running Windows and Linux), any one of the laptops should still be able to take advantage of any available resources in the other laptops. 3.3. Environmental Independence Architecture independence alone is not sufficient to support migration of computations between systems. If the environment on each system is different, the computation will fail after migration due to the sudden change. Environmental factors include the data resident on a system, configuration of the software on the system as well as any specialized hardware in the system. 3.4. Mobility of Code, State, and Data In our proposed solution, one of the key enhancements provided agile computing over existing approaches is the exploitation of mobility of code, computational state, and data. Mobility of code is needed to support dynamically pushing or pulling code to a system in order to change its configuration or download a new capability. Mobility of computational state is needed to allow computations to move from one system to another within a group. Such movement may be triggered for survivability, performance, or to accommodate changes in group membership. Finally, mobility of data with the computation and/or the code is necessary to handle potential network disconnections as well as to optimize network bandwidth usage. 3.5. Security Security is of paramount importance for agile computing to be used in a practical environment. A system that joins a group and contributes resources should be protected from abuse. If computations are  pushed onto a system, the computations must be executed in a restricted environment to guarantee that they do not access private areas of the host system or abuse the resources available on the host. Similarly, the computations themselves need to be protected from a  possibly malicious host that joins a group. In addition to satisfying the above requirements the following two challenges must be addressed to make agile computing successful: 3.6. Overcoming Dedicated Roles / Owners for Systems One of the problems with current systems is that they are often dedicated to certain tasks or assigned to  particular users. Such a priori classification of systems  prevents exploitation of available resources. Agile computing relies on the notion that any available resource should be utilizable. In order to make this a reality, hardware should be generic and ubiquitous with the specialization being derived through software. If this were the case, then one system can easily be substituted for another by means of moving the software functionality as needed. Similarly, if systems are assigned to individual owners who are protective about their systems, then resource sharing will be ineffective. One solution to this  problem lies in resource accounting, which will allow the owner to a system to contribute resources but then to quantify the contribution in order to receive compensation in some manner. 3.7. Achieving a High Degree of Agility The degree of agility may be defined as the length of time for which a system needs to be part of a group in order for its resources to be effectively exploited. The shorter the length of time, the higher the degree of agility. The degree of agility that can be realized is a direct function of the overhead involved. When a system  joins a group, there is overhead in the group reformation  process, in setting up communication channels, and in moving computations, code, and data to the system. Before a system leaves a group, there is potentially more overhead in moving active computations off of the system. The degree of agility may also be defined in terms of the minimum time required in order to reconfigure when one or more systems are under threat. A system that has a higher the degree of agility will be more survivable. 4. Design and Implementation The realization of an agile computing framework involves the design and implementation of a platform independent way to coordinate the communication, distribution, and execution of processes on heterogeneous networks. As part of our requirements, the framework must be capable, amongst other things, to  provide local resource control, accounting and redirection, as well as high level services such as lookup, discovery, and policy enforcement. A dynamically formed group is a fundamental structural notion in agile computing. A group is essentially defined as a set of hosts that have joined Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’03) 0--7695-1919-9/03 $17.00 © 2003 IEEE  together to share resources. That could be, for instance, a set of laptops in a conference room during a meeting. Figure 1 shows one possible arrangement for a set of hosts. Note that groups may be disjoint or overlapping. An overlapping group is created by the existence of one or more shared hosts. A host may join multiple groups. Various grouping principles are possible but likely candidates are physical proximity, network reachability, and ownership. KernelHost 3 Group Two Group ThreeKernelHost 1Group OneKernelHost 2KernelHost 4KernelHost 5KernelHost 6KernelHost 7   Figure 1: Runtime Grouping of Hosts Hosts in a group might belong to different administrative domains, which creates another type of grouping. Domains are used to express common policies to hosts and to conveniently administer them. Domains tend to be more static compared to runtime groups. Figure 2 shows a configuration with two domains and one group. Figure 2: Relationship between Domains and Groups Each participating node runs a specialized Java-compatible kernel (the Agile Computing Kernel) that  provides a platform-independent execution environment and a set of local services such as policy enforcement and resource control. The kernels from different hosts also interact amongst themselves to provide a lower level group-wide set of services such as global directory and coordination services. The Agile Computing Kernel contains a uniform execution environment, a policy manager, a resource manager, a group manager and a local coordinator. Figure 2 shows the main components of the Agile Computing Kernel. Figure 2: The Agile Computing Kernel These components provide the set of capabilities that the running processes rely upon to take advantage of the agile computing framework. They constitute a middleware through which processes communicate and migrate between nodes. The following subsections  provide a brief explanation of each of the components. 4.1. Uniform Execution Environment The Uniform Execution Environment provides a common abstraction layer for code execution that hides underlying architectural differences such as CPU type and operating system. The execution environment will also support dynamic migration of computations between kernels, secure execution of incoming computations, resource redirection, resource accounting, and policy enforcement. The implementation of the Uniform Execution Environment is currently based on Aroma [4] [5], a clean-room implementation of a Java-compatible VM developed under DARPA funding, to provide architecture independence and support for agile computing requirements. Aroma was designed from the ground up to support capture of execution state of Java threads, provide accounting services for resource usage, and control resource consumption of Java threads. Moreover, the captured execution state is platform independent, which allows migration of computations  between Aroma VMs that are running on different hardware platforms [6]. The capabilities of the Aroma Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’03) 0--7695-1919-9/03 $17.00 © 2003 IEEE  VM are critical to ensure secure execution of mobile code. There are no implicit requirements to use Java as the language for the implementation of the agile computing infrastructure. We do recognize though that Java  provides many desirable features for a mobile-code  based framework. Moreover, the virtual machine architecture of Java provides platform independence. Therefore, our approach relies on Java. Besides the Java-compatible VM, the execution environment also includes a set of software components that support interaction between the kernel and locally running processes. These components are: a) the Security Enforcer, b) the Accounting Service, c) the State Capture Mechanism, and d) the Resource Redirection Service. The Security Enforcer will ensure that running  processes will have limited access to systems resources to avoid denial of service (DOS) attacks [7]. This component will receive instructions from the Policy Manager component specifying usage restrictions for each process running in the VM. The restrictions can be established during process migration or even at runtime, after the process execution has started. This component also provides authentication and encryption services for secure data and state transfer. The Accounting Service provides a facility to track resource utilization at the process level inside the VM. The service is used by the Security Enforcer and the Resource Manager components to estimate overall kernel load and resource availability. The Resource Redirection service provides means to transparently move links to local or remote resources when code is migrated between kernels. Consider, for example, a scenario where a computation has two socket connections open to remote hosts. Due to an imminent  power failure the computation needs to move to another intermediate host. In this case, the Resource Redirection Service of each kernel will negotiate a redirection of the resources (the socket connections in this case) to transparently move the computation with no apparent interruption of the links. For the computation in this example, the migration happens seamlessly and the socket connections with the remote hosts are maintained during the migration. The implementation of this feature leverages from previous research conducted on resource redirection for the Aroma VM using Mockets [8]. The State Capture Mechanism provides the necessary means to capture execution state of one or multiple  processes running in the execution environment. The execution state can be captured at any point between execution of two Java bytecodes. The state information can then be persisted or moved to another host to resume execution on the very next bytecode. All these components work in concert with the Policy and Resource Manager components that are also part of the kernel but not directly integrated with the execution environment. The Policy Manager and the Resource Manager are primarily concerned with higher level interactions with the Coordinator and other kernels, but they do rely on the execution environment components to locally perform and enforce most of their tasks. 4.2. Policy Manager The policy manager is responsible for the specification, conflict resolution, and distribution of  policies. This component provides a facility for other components in the kernel to query and determine policies and restrictions that apply to local and remote processes and nodes. We are experimentally evaluating different  policy disclosure strategies and conflict resolution algorithms, building on our previous DARPA and  NASA-funded work in this arena [9]. 4.3. Resource Manager The Resource manager provides an interface for the Coordinator and remote kernels to query and provide information about local resource utilization. Resource availability is one of the metrics considered  by the Coordinator when calculating optimum paths for data distribution. The Resource Manager will act as a  bridge between the Accounting Service in the execution environment and the Coordinator. It will monitor local resource utilization in the execution environment and interact with the Coordinator to request migration of local computation or to notify it of local resource availability for the Framework. 4.4. Group Manager The Group Manager is the component responsible for identifying all the nodes that participate in a group. The framework is designed to handle highly dynamic environments, where nodes join and leave the framework at any arbitrary rate. Therefore, a fundamental requirement is to efficiently and accurately identify other available nodes and services. The role of the group manager is to coordinate with the Policy Manager to ensure proper advertisement of services and to identify and locate services required by local processes. Service lookup implies the notion of a registry that accepts service registration and deregistration requests as well as queries for clients looking for specific providers or capabilities. In general, lookup registries provide no guarantees about service availability. Both queries and registration/deregistration requests are initiated by clients or service providers. The registry is usually a passive entity in the framework. There are many well established Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’03) 0--7695-1919-9/03 $17.00 © 2003 IEEE