Online Reliability Improvement via Smart Redundancy in Systems with Faulty and Untrusted Participants

Keywords Online Reliability Improvement via Smart Redundancy in Systems with Faulty and Untrusted Participants

Suppose r = 0:7 and the desired system reliability is s = 0:97. Iterative redundancy uses s as the confidence threshold and calculates how many jobs’ results must unanimously agree to be sure of the result with probability s. For
example, if the task server distributes only one job, there is a 0:7 0:7+0:3 = 0:7 chance that the result is correct, but if the task server distributes four jobs and they all return the same result, there is a 0:74 0:74+0:34 > 0:97 chance that the result is correct. Thus, four is the minimum number of jobs that can achieve the
confidence threshold, and the task server distributes four jobs in the first iteration.

If all four jobs return the same result, the task is finished. If some jobs return a disagreeing result, the task server determines the minimum number of additional jobs that must be distributed to achieve the confidence threshold
and produce the desired system reliability. For example, if three jobs return agreeing results and one returns a disagreeing result, the task server determines that at least two more jobs must return the majority result to achieve s; thus, the task server automatically distributes two more jobs. As we will show in Equation (5), the cost of iterative redundancy, for this particular example, is the use of 9:4 times as many resources as a system without redundancy. Note that this cost is 1:5 times less than the cost of progressive redundancy and 2:0 times less
than the cost of traditional redundancy.

Metadata

Date published

UNKNOWN

Document type

technical white paper

Pages

Defines standard

Replaced/Superseded by document(s)

Cancelled by

Amended by

File	MIME type		Size (KB)	Language	Download
usc-csse-2009-510.pdf	application/pdf		377.93 KB	English	DOWNLOAD!

File attachments

Provides definitions

Abstract

Many software systems today, such as computational grids,
include faulty and untrusted components. As faults are inevitable,
these systems utilize redundancy to achieve fault tolerance.
In this paper, we present two new, “smart” redundancy
techniques: iterative redundancy and progressive redundancy.
The two techniques are efficient, adaptive, and automated.
They are efficient in that they leverage runtime information to
improve system reliability using fewer resources than existing
methods. They are automated in that they inject redundancy in
situations where it is most beneficial and eliminate it where it
is unnecessary. Finally, they are adaptive in that they increase
redundancy on-the-fly when component reliability drops and
decrease redundancy when component reliability improves.
We enumerate examples of systems that can benefit from our
techniques but focus in this paper on computational grid
systems. We present formal analytical and empirical analyses,
demonstrating our techniques on a real-world computational
grid and comparing them to existing methods.

Site Slogan

Online Reliability Improvement via Smart Redundancy in Systems with Faulty and Untrusted Participants

Australia
PO Box 2385
Ringwood North, VIC 3134, Australia
P: +61 (0) 3 9876 7345
enquiries@ppi-int.com

USA (answers in Australia)
P: +1 888 772 5174

Brazil
P: +55 12 9 9780 3490

China 绝对不是
P: +86 188 5117 2867