Department of Computer Science at UH

University of Houston
Department of Computer Science
In Partial Fulfillment of the Requirements for the Degree of
 Master of Science
Georgi Kutiev
Will defend his thesis

Robust MPI Execution with Controlled Redundancy

Abstract
With the ever-increasing amount and size of grids (both local and globally-connected), handling parallel execution jobs becomes difficult due to scalability bottlenecks and frequent faults in individual machines across the network.

In this work, we address these issues by proposing a controlled redundancy meta-algorithm as a scalable, fail-safe grid engine for executing parallel algorithms of any computation and communication complexity.  We employ a topology where several instances (replicas) of a user algorithm run in redundant parallel instances. Each node from a replica communicates exclusively with an intermediary (control) node. The control node then forwards the received messages appropriately to all interested recipients, even across replicas. This results in message data being disseminated as soon as it is made available by the fastest (leading edge) replica. Slow or failing nodes can be completely bypassed and correct execution is guaranteed for all but extreme cases of massive failures.

Finally, we present runs of sample applications and standard NAS MPI benchmarks - showing a 0.1-7% overhead of this method for most real-world implementations, as it scales up to make full use of available resources. The method maintains transparent resistance to failure while eliminating the need for complex and often restrictive checkpoint/restore mechanisms.

Date: Friday, April 27th, 2007
Time: 10:30 AM
Place: 362-PGH
Faculty, students, and the general public are invited.
Advisor: Prof. Subhlok