A Dependable Middleware for Enhancing the Fault Tolerance of Distributed Computations in Grid Environments
Produktform: Buch
Grid computing envisions the sharing of compute, storage, network and software resources across multi-institutional virtual organizations (VOs) to provide more effective solutions for important scientific, engineering and business problems. With the advancing penetration of Grid infrastructures in science and in industry, issues of fault tolerance and self-healing are becoming tremendously important. The more resources and components involved, the more complicated and error-prone becomes the system. In particular for long-running applications high failure rates are a major concern. Many scientific applications require to run for days or weeks. For example, a simulation of gamma-ray bursts, an astrophysical phenomena, requires over 100 days ofruntime on an one PFlop/s machine. Running such an application requires a large supercomputer or a Grid. Unfortunately, these systems are very error-prone. A single node failure usually leads to the abort of the entire application. Thus, efficient support for fault tolerance is essential.
The Migol middleware, which is the main contribution of this thesis, addresses the fault tolerance of long-running applications as found in many sciences, e. g. in astrophysics or life sciences. Migol supports applications in performing complex tasks such as resource allocation, monitoring, checkpointing and, if necessary, recoveries.weiterlesen