Optimal Fault Tolerance
in Real Time Cloud Computing
Dr. Mohammad Abdur Rouf (Professor)
Dhaka University of Engineering and Technology, Gazipur
© B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST)
Dhaka University of Engineering and Technology, Gazipur
Presented by
Md. Mostafijur Rahman
Masters Student ID #: 132431(p)
Department of Computer Science and Engineering
Outline of the talk
• Real Time Fault Tolerance Model In Cloud Computing
• Technology:
• Amazon Web Service EC2.
o Virtual Machine: Algorithm X1…Xm , Acceptance Test.
o Adjudicator : Input Buffer, Reliability Weighted , Time Checker , Reliability
Assessor, Decision Mechanism, Checkpoint, Compare performance , fault Tolerance
and scalability .
• Describes the related work
• Time Stamped Fault Tolerance in Real Time Systems which is proposed by S.Malik and M.
J. Rehman, The artificial Neural Networks are connected with each other via neuron Weighted .The
system has to produce the output within Threshold .The Algorithm sorts the input values and DM takes
the decision according weighted like the “Voter Mechanism”. The Acceptance Tester checks the node
status . According the status , system will be forward or Backward Recovery .
• X. Kong has presented The Performance ,Fault-tolerance and Scalability Analysis of
Virtual Infrastructure Management System, Three Models Centralized , Hierarchical and
pair-to-peer structure .The Paper is presented the Performance, Fault-Tolerance and
scalability analysis these above structure .
Dhaka University of Engineering and Technology, Gazipur
ABSTRACT
• A real time system can take advantage of intensive computing
capabilities and scalable virtualized environment of cloud computing
to execute real time tasks.
• The proposed system tolerates the faults and makes the decision on
the basis of reliability of the processing nodes.
• If any processing node does not achieve the system Reliability Level
, the systems will perform backward recovery or safety measures.
The node will be removed or added depends on Reliability
weighted.
• The system provides both the forward and backward recovery
mechanism.
• Adaptive behavior of the reliability weights assigned to each
processing node and adding and removing of nodes on the basis of
reliability.
Dhaka University of Engineering and Technology, Gazipur
Operation Work
There is a set of virtual machines, running on cloud infrastructure, and the
other is the adjudication node.
Dhaka University of Engineering and Technology, Gazipur
Operation
Dhaka University of Engineering and Technology, Gazipur
In this scheme, we have ‘N’ virtual machine, which run the ‘N’ variant
algorithms. Algorithm ‘X1’ runs on ‘Virtual machine-1’, ‘X2’ runs on ‘Virtual
machine-2’, up till ‘Xm’, which runs on ‘Virtual machine-m’.
Then we have AT module which is responsible for the verification of output
result of each node. The outputs are then passed to TC module which checks the
timing of each result.
On the basis of the timing the RA module calculates and reassigns the reliability
of each module. Then all the results are forwarded to DM module which selects
the output on the basis of best reliability. The output of a node with highest
reliability is selected as the system cycle output.
Real time application Algorithm is the application logic to perform the real time
computations. There are variant algorithms running on different virtual
machines. These algorithms can be different from each other either by
implementation language, or software engineering process, or by functional
logic. The algorithm passes the result to the acceptance test for verification.
This scheme provides an automatic forward recovery. If a node fail to produce
output or produce out put after time overrun the system will not fail. It will
continue to operate with remaining nodes.
Reliability Assessment Flow Chart:
Dhaka University of Engineering and Technology, Gazipur
Fault Tolerance Mechanism Algoritm
Dhaka University of Engineering and Technology, Gazipur
Reliability Assessment Algorithm:
Begin
Initially reliability:=1, n :=1
Input from configuration RF, maxReliability,
minReliability
Input nodestatus
if nodeStatus =Pass then
reliability := reliability + (reliability * RF)
if n > 1 then
n := n-1;
else if nodeStatus = Fail then
reliability := reliability – (reliability * RF * n)
n := n+1;
if reliability >= maxReliability then
reliability := maxReliability
if reliability < minReliability then
nodeStatus :=dead
call_proc: remove_this_node
call_proc: add_new_node
End
Dhaka University of Engineering and Technology, Gazipur
Decision Mechanism Flow Chart:
Dhaka University of Engineering and Technology, Gazipur
BestReliability>=SRL
Input Buffer Get RA nodeRelibility,
NumCandNodes , SRL(System
Reliability Level )
Forward Recovery
Remove node MinReliability,
Add New Node
Backward Recovery
Yes
NO
Decision Mechanism Algorithm
Dhaka University of Engineering and Technology, Gazipur
Begin
Initially reliability:=1, n :=1
Input from RA nodeReliability, numCandNodes
Input from configuration SRL
bestReliability := find_reliability of node with highest
Reliability
if bestReliability >= SRL
status := success
else
perform_backward_recovery
call_proc: remove_node_minReliability
call_proc: add_new_node
End
Reliability Assessment Impact Analysis
A metric analysis is given for the
reliability assessment impact analysis.
Dhaka University of Engineering and Technology, Gazipur
Change in reliability for a single node
Dhaka University of Engineering and Technology, Gazipur
Different Scenarios
Dhaka University of Engineering and Technology, Gazipur
1) Complete Failure Free Scenario
All the algorithms on each virtual machine produce the result. Acceptance test pass the results.
All the results are produced before time overrun. So TC also clears the results. RA computes
and assigns the new reliability weights to each virtual machine. Decision mechanism selects an
output from the VM with maximum reliability.
2) Partial Failure Scenario – All AT pass, TC passes some Nodes
All the virtual machines produce the correct result. Some results are produced within time and
some after time overrun. All AT pass the results and forward them to the time checker. TC
receives result of some virtual machines before time-overrun. It passes them to RM, which
assesses their reliability. RM forwards the produced result to the
decision mechanism for adjudication.
3) Partial Failure Scenario – Some AT pass, TC also pass
Acceptance test pass only the correct results to the TC. For failed virtual machines, it
generates an error signal to TC. Time checker receives result of passed virtual machines
before time overrun. It passes them to RM, which assesses their reliability. RM forwards the
produced result to the decision mechanism for adjudication. Decision mechanism selects an
output from the VM with maximum reliability. In this scenario, system will continue to operate
with forward recovery. Adjudicator selects the output from the subset of the nodes that have
produced the correct output within time limit.
Different Scenarios
Dhaka University of Engineering and Technology, Gazipur
4) Failure Scenario – ATs fail, TC fail
In this scenario, either all the AT rejects the result of the algorithms or some AT passes
but TC fails to find a single output within time limit. In this case, the cycle fails and TC
informs the adjudicator to perform backward recovery. Now backward recovery will be
done with the help of checkpoints stored in recovery cache.
5) Failure Scenario – ATs pass, TC pass, DM fail
In this scenario, all or some of the AT passes the results. TC also finds the output within
time limit. Reliability assessor computes and assigns the reliability to the virtual
machines. But the VM with highest reliability could not achieve the system reliability
level (SRL). In this case, DM raises the failure signal for the whole computing
EXPERIMENTS & RESULTS
Our experiment has the following description of implementation for the
proposed model.
Virtual Machine 1:
• Algorithm ‘X1’ implementation
• Acceptance Test-I
Virtual Machine 2:
• Algorithm ‘X2’ implementation
• Acceptance Test-II
Virtual Machine 3:
• Algorithm ‘X3’ implementation
• Acceptance Test-III
Adjudication Node:
There is an adjudicator node, which contains the following modules:
Input Buffer Time Checker; Reliability Assessor;
Decision mechanism;
Recovery Cache;
Dhaka University of Engineering and Technology, Gazipur
RESEARCH ISSUES
• Values of environmental variables are following;:
Reliability =1, n = 1, RF = 0.2, SRL = 0.8, maxReliability = 1.2,
minReliability = 0.7
Dhaka University of Engineering and Technology, Gazipur
DISCUSSION & CONCLUSION
The proposed scheme is a good option to be used as a fault tolerance mechanism for real
time computing on cloud infrastructure. It has all the advantages of forward recovery
mechanism. It has a dynamic behavior of reliability configuration.
The scheme is highly fault tolerant. The reason behind adaptive reliability is that the scheme
can take advantage of dynamic scalability of cloud infrastructure.
This system takes the full advantage of using diverse software. In our experiment, we have
used three virtual machines. It utilizes all of three virtual machines in parallel This scheme
has incorporated the concept of fault tolerance on the basis of VM algorithm reliability.
Decision mechanism shows convergence towards the result of the algorithm which has
highest reliability.
Probability of failure is very less in our devised scheme. This scheme works for forward
recovery until all the nodes fail to produce the result. The system assures the reliability by
providing the backward recovery at two levels. First backward recovery point is TC. Here if
all the nodes fail to produce the result, it performs backward recovery. Second backward
recovery point is DM.
It performs the backward recover if the node with best reliability could not achieve the SRL.
There is another big advantage of this scheme. It does not suffer from domino effect as
check pointing is made in the end when all the nodes have produced the result.
Dhaka University of Engineering and Technology, Gazipur
FUTURE WORK
• We are working on some new enhancements to this scheme so that our
system should be more fault-tolerant. Major focus in future will be on
reliability factor inclusion in more effective decision making.
• We are also working on a module name Resource Awareness Module
(RAM). It is aimed to help the cloud scheduler for the scheduling decisions
on the basis of certain network and infrastructure characteristics.
• This fault tolerance mechanism is going to be a part of RAM. Initially, RAM
is targeted to be integrated with ProActive scheduler. So after this
integration ProActive scheduler will do the scheduling on a virtual machine
node on the basis of node reliability.
Dhaka University of Engineering and Technology, Gazipur
References & useful links
• 1. Anjali.D.Meshram , A.S.Sambare and S.D.Zade “Fault Tolerance Model in
Cloud Computing.“ Foundation of Computer Science FCS, New York, USA
• 2. S. Malik, M. J. Rehman, “Time Stamped Fault Tolerance in Distributed Real Time
Systems”; IEEE International Multitopic Conference, Karachi, Pakistan, 2005
• 3. X. Kong, J. Huang, C. Lin, P. D. Ungsunan, “Performance, Fault-tolerance and
Scalability Analysis of Virtual Infrastructure Management System.
• J. Barr, A. Narin, “Building Fault-Tolerant applications on AWS”, Amazon Web
Services, http://media.amazonwebservices.com/
Dhaka University of Engineering and Technology, Gazipur
Section Questions and Answers
Thanks
Dhaka University of Engineering and Technology, Gazipur

Adaptive fault tolerance in real time cloud_computing

  • 1.
    Optimal Fault Tolerance inReal Time Cloud Computing Dr. Mohammad Abdur Rouf (Professor) Dhaka University of Engineering and Technology, Gazipur © B.Sc. Engg. (KU) M.Sc. Engg. (BUET), Ph.D. (KAIST) Dhaka University of Engineering and Technology, Gazipur Presented by Md. Mostafijur Rahman Masters Student ID #: 132431(p) Department of Computer Science and Engineering
  • 2.
    Outline of thetalk • Real Time Fault Tolerance Model In Cloud Computing • Technology: • Amazon Web Service EC2. o Virtual Machine: Algorithm X1…Xm , Acceptance Test. o Adjudicator : Input Buffer, Reliability Weighted , Time Checker , Reliability Assessor, Decision Mechanism, Checkpoint, Compare performance , fault Tolerance and scalability . • Describes the related work • Time Stamped Fault Tolerance in Real Time Systems which is proposed by S.Malik and M. J. Rehman, The artificial Neural Networks are connected with each other via neuron Weighted .The system has to produce the output within Threshold .The Algorithm sorts the input values and DM takes the decision according weighted like the “Voter Mechanism”. The Acceptance Tester checks the node status . According the status , system will be forward or Backward Recovery . • X. Kong has presented The Performance ,Fault-tolerance and Scalability Analysis of Virtual Infrastructure Management System, Three Models Centralized , Hierarchical and pair-to-peer structure .The Paper is presented the Performance, Fault-Tolerance and scalability analysis these above structure . Dhaka University of Engineering and Technology, Gazipur
  • 3.
    ABSTRACT • A realtime system can take advantage of intensive computing capabilities and scalable virtualized environment of cloud computing to execute real time tasks. • The proposed system tolerates the faults and makes the decision on the basis of reliability of the processing nodes. • If any processing node does not achieve the system Reliability Level , the systems will perform backward recovery or safety measures. The node will be removed or added depends on Reliability weighted. • The system provides both the forward and backward recovery mechanism. • Adaptive behavior of the reliability weights assigned to each processing node and adding and removing of nodes on the basis of reliability. Dhaka University of Engineering and Technology, Gazipur
  • 4.
    Operation Work There isa set of virtual machines, running on cloud infrastructure, and the other is the adjudication node. Dhaka University of Engineering and Technology, Gazipur
  • 5.
    Operation Dhaka University ofEngineering and Technology, Gazipur In this scheme, we have ‘N’ virtual machine, which run the ‘N’ variant algorithms. Algorithm ‘X1’ runs on ‘Virtual machine-1’, ‘X2’ runs on ‘Virtual machine-2’, up till ‘Xm’, which runs on ‘Virtual machine-m’. Then we have AT module which is responsible for the verification of output result of each node. The outputs are then passed to TC module which checks the timing of each result. On the basis of the timing the RA module calculates and reassigns the reliability of each module. Then all the results are forwarded to DM module which selects the output on the basis of best reliability. The output of a node with highest reliability is selected as the system cycle output. Real time application Algorithm is the application logic to perform the real time computations. There are variant algorithms running on different virtual machines. These algorithms can be different from each other either by implementation language, or software engineering process, or by functional logic. The algorithm passes the result to the acceptance test for verification. This scheme provides an automatic forward recovery. If a node fail to produce output or produce out put after time overrun the system will not fail. It will continue to operate with remaining nodes.
  • 6.
    Reliability Assessment FlowChart: Dhaka University of Engineering and Technology, Gazipur
  • 7.
    Fault Tolerance MechanismAlgoritm Dhaka University of Engineering and Technology, Gazipur Reliability Assessment Algorithm: Begin Initially reliability:=1, n :=1 Input from configuration RF, maxReliability, minReliability Input nodestatus if nodeStatus =Pass then reliability := reliability + (reliability * RF) if n > 1 then n := n-1; else if nodeStatus = Fail then
  • 8.
    reliability := reliability– (reliability * RF * n) n := n+1; if reliability >= maxReliability then reliability := maxReliability if reliability < minReliability then nodeStatus :=dead call_proc: remove_this_node call_proc: add_new_node End Dhaka University of Engineering and Technology, Gazipur
  • 9.
    Decision Mechanism FlowChart: Dhaka University of Engineering and Technology, Gazipur BestReliability>=SRL Input Buffer Get RA nodeRelibility, NumCandNodes , SRL(System Reliability Level ) Forward Recovery Remove node MinReliability, Add New Node Backward Recovery Yes NO
  • 10.
    Decision Mechanism Algorithm DhakaUniversity of Engineering and Technology, Gazipur Begin Initially reliability:=1, n :=1 Input from RA nodeReliability, numCandNodes Input from configuration SRL bestReliability := find_reliability of node with highest Reliability if bestReliability >= SRL status := success else perform_backward_recovery call_proc: remove_node_minReliability call_proc: add_new_node End
  • 11.
    Reliability Assessment ImpactAnalysis A metric analysis is given for the reliability assessment impact analysis. Dhaka University of Engineering and Technology, Gazipur
  • 12.
    Change in reliabilityfor a single node Dhaka University of Engineering and Technology, Gazipur
  • 13.
    Different Scenarios Dhaka Universityof Engineering and Technology, Gazipur 1) Complete Failure Free Scenario All the algorithms on each virtual machine produce the result. Acceptance test pass the results. All the results are produced before time overrun. So TC also clears the results. RA computes and assigns the new reliability weights to each virtual machine. Decision mechanism selects an output from the VM with maximum reliability. 2) Partial Failure Scenario – All AT pass, TC passes some Nodes All the virtual machines produce the correct result. Some results are produced within time and some after time overrun. All AT pass the results and forward them to the time checker. TC receives result of some virtual machines before time-overrun. It passes them to RM, which assesses their reliability. RM forwards the produced result to the decision mechanism for adjudication. 3) Partial Failure Scenario – Some AT pass, TC also pass Acceptance test pass only the correct results to the TC. For failed virtual machines, it generates an error signal to TC. Time checker receives result of passed virtual machines before time overrun. It passes them to RM, which assesses their reliability. RM forwards the produced result to the decision mechanism for adjudication. Decision mechanism selects an output from the VM with maximum reliability. In this scenario, system will continue to operate with forward recovery. Adjudicator selects the output from the subset of the nodes that have produced the correct output within time limit.
  • 14.
    Different Scenarios Dhaka Universityof Engineering and Technology, Gazipur 4) Failure Scenario – ATs fail, TC fail In this scenario, either all the AT rejects the result of the algorithms or some AT passes but TC fails to find a single output within time limit. In this case, the cycle fails and TC informs the adjudicator to perform backward recovery. Now backward recovery will be done with the help of checkpoints stored in recovery cache. 5) Failure Scenario – ATs pass, TC pass, DM fail In this scenario, all or some of the AT passes the results. TC also finds the output within time limit. Reliability assessor computes and assigns the reliability to the virtual machines. But the VM with highest reliability could not achieve the system reliability level (SRL). In this case, DM raises the failure signal for the whole computing
  • 15.
    EXPERIMENTS & RESULTS Ourexperiment has the following description of implementation for the proposed model. Virtual Machine 1: • Algorithm ‘X1’ implementation • Acceptance Test-I Virtual Machine 2: • Algorithm ‘X2’ implementation • Acceptance Test-II Virtual Machine 3: • Algorithm ‘X3’ implementation • Acceptance Test-III Adjudication Node: There is an adjudicator node, which contains the following modules: Input Buffer Time Checker; Reliability Assessor; Decision mechanism; Recovery Cache; Dhaka University of Engineering and Technology, Gazipur
  • 16.
    RESEARCH ISSUES • Valuesof environmental variables are following;: Reliability =1, n = 1, RF = 0.2, SRL = 0.8, maxReliability = 1.2, minReliability = 0.7 Dhaka University of Engineering and Technology, Gazipur
  • 17.
    DISCUSSION & CONCLUSION Theproposed scheme is a good option to be used as a fault tolerance mechanism for real time computing on cloud infrastructure. It has all the advantages of forward recovery mechanism. It has a dynamic behavior of reliability configuration. The scheme is highly fault tolerant. The reason behind adaptive reliability is that the scheme can take advantage of dynamic scalability of cloud infrastructure. This system takes the full advantage of using diverse software. In our experiment, we have used three virtual machines. It utilizes all of three virtual machines in parallel This scheme has incorporated the concept of fault tolerance on the basis of VM algorithm reliability. Decision mechanism shows convergence towards the result of the algorithm which has highest reliability. Probability of failure is very less in our devised scheme. This scheme works for forward recovery until all the nodes fail to produce the result. The system assures the reliability by providing the backward recovery at two levels. First backward recovery point is TC. Here if all the nodes fail to produce the result, it performs backward recovery. Second backward recovery point is DM. It performs the backward recover if the node with best reliability could not achieve the SRL. There is another big advantage of this scheme. It does not suffer from domino effect as check pointing is made in the end when all the nodes have produced the result. Dhaka University of Engineering and Technology, Gazipur
  • 18.
    FUTURE WORK • Weare working on some new enhancements to this scheme so that our system should be more fault-tolerant. Major focus in future will be on reliability factor inclusion in more effective decision making. • We are also working on a module name Resource Awareness Module (RAM). It is aimed to help the cloud scheduler for the scheduling decisions on the basis of certain network and infrastructure characteristics. • This fault tolerance mechanism is going to be a part of RAM. Initially, RAM is targeted to be integrated with ProActive scheduler. So after this integration ProActive scheduler will do the scheduling on a virtual machine node on the basis of node reliability. Dhaka University of Engineering and Technology, Gazipur
  • 19.
    References & usefullinks • 1. Anjali.D.Meshram , A.S.Sambare and S.D.Zade “Fault Tolerance Model in Cloud Computing.“ Foundation of Computer Science FCS, New York, USA • 2. S. Malik, M. J. Rehman, “Time Stamped Fault Tolerance in Distributed Real Time Systems”; IEEE International Multitopic Conference, Karachi, Pakistan, 2005 • 3. X. Kong, J. Huang, C. Lin, P. D. Ungsunan, “Performance, Fault-tolerance and Scalability Analysis of Virtual Infrastructure Management System. • J. Barr, A. Narin, “Building Fault-Tolerant applications on AWS”, Amazon Web Services, http://media.amazonwebservices.com/ Dhaka University of Engineering and Technology, Gazipur
  • 20.
    Section Questions andAnswers Thanks Dhaka University of Engineering and Technology, Gazipur

Editor's Notes