Online Reliability Improvement via Smart Redundancy in Systems with Faulty and Untrusted Participants

Keywords Online Reliability Improvement via Smart Redundancy in Systems with Faulty and Untrusted Participants

Suppose r = 0:7 and the desired system reliability is s = 0:97. Iterative redundancy uses s as the confidence threshold and calculates how many jobs’ results must unanimously agree to be sure of the result with probability s. For
example, if the task server distributes only one job, there is a 0:7 0:7+0:3 = 0:7 chance that the result is correct, but if the task server distributes four jobs and they all return the same result, there is a 0:74 0:74+0:34 > 0:97 chance that the result is correct. Thus, four is the minimum number of jobs that can achieve the
confidence threshold, and the task server distributes four jobs in the first iteration.

If all four jobs return the same result, the task is finished. If some jobs return a disagreeing result, the task server determines the minimum number of additional jobs that must be distributed to achieve the confidence threshold
and produce the desired system reliability. For example, if three jobs return agreeing results and one returns a disagreeing result, the task server determines that at least two more jobs must return the majority result to achieve s; thus, the task server automatically distributes two more jobs. As we will show in Equation (5), the cost of iterative redundancy, for this particular example, is the use of 9:4 times as many resources as a system without redundancy. Note that this cost is 1:5 times less than the cost of progressive redundancy and 2:0 times less
than the cost of traditional redundancy.

Metadata
Date published
UNKNOWN
Document type
technical white paper
Pages
12
Defines standard
Replaced/Superseded by document(s)
Cancelled by
Amended by
File MIME type Size (KB) Language Download
usc-csse-2009-510.pdf application/pdf   377.93 KB English DOWNLOAD!
File attachments
Abstract

Many software systems today, such as computational grids,
include faulty and untrusted components. As faults are inevitable,
these systems utilize redundancy to achieve fault tolerance.
In this paper, we present two new, “smart” redundancy
techniques: iterative redundancy and progressive redundancy.
The two techniques are efficient, adaptive, and automated.
They are efficient in that they leverage runtime information to
improve system reliability using fewer resources than existing
methods. They are automated in that they inject redundancy in
situations where it is most beneficial and eliminate it where it
is unnecessary. Finally, they are adaptive in that they increase
redundancy on-the-fly when component reliability drops and
decrease redundancy when component reliability improves.
We enumerate examples of systems that can benefit from our
techniques but focus in this paper on computational grid
systems. We present formal analytical and empirical analyses,
demonstrating our techniques on a real-world computational
grid and comparing them to existing methods.

Visit also