SlideShare a Scribd company logo
CENG 5334 - Fault Tolerant
Computing
Fall 2012

Fatih Karabacak
   What is Cloud Computing?
   Reliability of Cloud Service.
   A Fault Tolerance Framework in Cloud Computing.
FT Architecture For Cloud Service Computing
“Cloud computing is Web-based processing,
                       whereby shared resources, software, and
                       information are provided to computers and
                       other devices (such as smart phones) on
                       demand over the Internet.”




Common implies multi-tenancy, not single or
isolated tenancy
Location-independent
Online
Utility implies pay-for-use pricing
Demand implies ~infinite, ~immediate,
~invisible scalability
FT Architecture For Cloud Service Computing
Cloud Service System
   Overflow
   Timeout
   Data resource missing
   Computing resource missing
   Software failure
   Database failure
   Hardware failure
   Network failure
   Request Stage Failures: Overflow and Timeout.

   Execution Stage Failures: Data resource missing,
    Computing resource missing, Software failure,
    Database failure, Hardware failure, and Network
    failure.
FT Architecture For Cloud Service Computing
   The overhead created by proactive and reactive FT
    should be minimized when checkpointing.
   A good fault tolerance should be transparent and it
    should not require source code or application
    modifications.
   It should use fault prediction mechanisms to determine
    when to checkpoint.
   It should use failure detection mechanism to determine
    when to recover the application from a failure.
1)   Fault predictor
2)   PLR (Process-Level Redundancy) Controller
     Daemon
3)   Fault Tolerant Policy (Proactive-Reactive)
4)   Fault Tolerance Daemon Protocol
5)   Checkpoint/Restart Module
A redundancy
technique which
uses the software-
centric model of
transient fault
detection.
FT Architecture For Cloud Service Computing
FT Architecture For Cloud Service Computing
Resource
http://en.wikipedia.org/wiki/Cloud_computin
g
 http://www.focus.com/articles/hosting-
bandwidth/top-10-cloud-computing-trends/
 Y. Dai, B. Yang, J. Dongarra, G. Zhang,
”Cloud Service Reliability: Modeling and
Analysis”
I. Egwutuoha, S. Chen, D. Levy, B. Selic, “ A
Fault Tolerance Framework for High
Performance Computing in Cloud”




                                      Thank you
FT Architecture For Cloud Service Computing

More Related Content

What's hot (19)

PPTX
Exploration of Radars and Software Defined Radios using VisualSim
Deepak Shankar
 
PDF
Mod05lec24(resource mgmt i)
Ankit Gupta
 
PPTX
Chap 1(one) general introduction
Malobe Lottin Cyrille Marcel
 
PPTX
Desktop to Cloud Transformation Planning
Phearin Sok
 
PPTX
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
PDF
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
PPTX
Cloud computing
Aaron Tushabe
 
PDF
Service Ownership with PagerDuty and Rundeck: Help others help you
TraciMyers5
 
PPTX
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
Naoki Shibata
 
PPTX
Chap 2 classification of parralel architecture and introduction to parllel p...
Malobe Lottin Cyrille Marcel
 
PDF
Velocity 2018 preetha appan final
preethaappan
 
PDF
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
DOCX
Error tolerant resource allocation and payment minimization for cloud system
JPINFOTECH JAYAPRAKASH
 
PPTX
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Deepak Shankar
 
PPTX
cloud scheduling
Mudit Verma
 
PPTX
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
Aisha Kalsoom
 
PPTX
Chaos Engineering with Gremlin Platform
Anshul Patel
 
PDF
Stinson post si and verification
Obsidian Software
 
PDF
Intel xeon-scalable-processors-overview
DESMOND YUEN
 
Exploration of Radars and Software Defined Radios using VisualSim
Deepak Shankar
 
Mod05lec24(resource mgmt i)
Ankit Gupta
 
Chap 1(one) general introduction
Malobe Lottin Cyrille Marcel
 
Desktop to Cloud Transformation Planning
Phearin Sok
 
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
Cloud computing
Aaron Tushabe
 
Service Ownership with PagerDuty and Rundeck: Help others help you
TraciMyers5
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
Naoki Shibata
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Malobe Lottin Cyrille Marcel
 
Velocity 2018 preetha appan final
preethaappan
 
Achieving scale and performance using cloud native environment
Rakuten Group, Inc.
 
Error tolerant resource allocation and payment minimization for cloud system
JPINFOTECH JAYAPRAKASH
 
Webinar: Detecting Deadlocks in Electronic Systems using Time-based Simulation
Deepak Shankar
 
cloud scheduling
Mudit Verma
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
Aisha Kalsoom
 
Chaos Engineering with Gremlin Platform
Anshul Patel
 
Stinson post si and verification
Obsidian Software
 
Intel xeon-scalable-processors-overview
DESMOND YUEN
 

Viewers also liked (15)

PPTX
Hardware Software Codesign
destruck
 
PPT
Design of embedded systems
Pradeep Kumar TS
 
PDF
A practical introduction to hardware software codesign 2e
Springer
 
PDF
A petri net model for hardware software codesign
JULIO GONZALEZ SANZ
 
PPTX
Signal modelling
Debangi_G
 
PDF
9 d57105 hardware software co design
Vinod Kumar Gorrepati
 
PPTX
Genetic Algorithm for task scheduling in Cloud Computing Environment
Swapnil Shahade
 
PPTX
Task scheduling Survey in Cloud Computing
Ramandeep Kaur
 
PPTX
Cloud operating system
sadak pramodh
 
PPTX
Fault tolerance techniques for real time operating system
anujos25
 
PPT
Embedded system
mangal das
 
PPTX
AWS vs. Azure
Rob Gillen
 
PPT
Issues in cloud computing
ronak patel
 
PPT
Cloud computing simple ppt
Agarwaljay
 
PPTX
Introduction of Cloud computing
Rkrishna Mishra
 
Hardware Software Codesign
destruck
 
Design of embedded systems
Pradeep Kumar TS
 
A practical introduction to hardware software codesign 2e
Springer
 
A petri net model for hardware software codesign
JULIO GONZALEZ SANZ
 
Signal modelling
Debangi_G
 
9 d57105 hardware software co design
Vinod Kumar Gorrepati
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Swapnil Shahade
 
Task scheduling Survey in Cloud Computing
Ramandeep Kaur
 
Cloud operating system
sadak pramodh
 
Fault tolerance techniques for real time operating system
anujos25
 
Embedded system
mangal das
 
AWS vs. Azure
Rob Gillen
 
Issues in cloud computing
ronak patel
 
Cloud computing simple ppt
Agarwaljay
 
Introduction of Cloud computing
Rkrishna Mishra
 
Ad

Similar to FT Architecture For Cloud Service Computing (20)

PDF
A Comparative Review on Fault Tolerance methods and models in Cloud Computing
IRJET Journal
 
PDF
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
BOHR International Journal of Smart Computing and Information Technology
 
PDF
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
PDF
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
PDF
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
PDF
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
 
PDF
FAILURE FREE CLOUD COMPUTING ARCHITECTURES
ijcsit
 
PDF
Failure Free Cloud Computing Architectures
AIRCC Publishing Corporation
 
PDF
fault tolerance management in cloud computing
Kruthikka Palraj
 
PPT
Adaptive fault tolerance in cloud survey
www.pixelsolutionbd.com
 
PDF
Proactive Scheduling in Cloud Computing
journalBEEI
 
PDF
Fault tolerance on cloud computing
www.pixelsolutionbd.com
 
PDF
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
ijafrc
 
PDF
A Survey on Rule-Based Systems and the significance of Fault Tolerance for Hi...
IJCSIS Research Publications
 
PDF
Adaptive fault tolerance_in_real_time_cloud_computing
www.pixelsolutionbd.com
 
PPT
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
Maurvi04
 
PDF
Exploring Fault Tolerance Strategies in Big Data Infrastructures and Their Im...
AIRCC Publishing Corporation
 
PPTX
Presentation Template.pptx for raesech paper
Hina636704
 
PDF
SelCSP: A Framework to Facilitate Selection of Cloud Service Providers
1crore projects
 
A Comparative Review on Fault Tolerance methods and models in Cloud Computing
IRJET Journal
 
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
BOHR International Journal of Smart Computing and Information Technology
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
ijgca
 
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
 
FAILURE FREE CLOUD COMPUTING ARCHITECTURES
ijcsit
 
Failure Free Cloud Computing Architectures
AIRCC Publishing Corporation
 
fault tolerance management in cloud computing
Kruthikka Palraj
 
Adaptive fault tolerance in cloud survey
www.pixelsolutionbd.com
 
Proactive Scheduling in Cloud Computing
journalBEEI
 
Fault tolerance on cloud computing
www.pixelsolutionbd.com
 
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
ijafrc
 
A Survey on Rule-Based Systems and the significance of Fault Tolerance for Hi...
IJCSIS Research Publications
 
Adaptive fault tolerance_in_real_time_cloud_computing
www.pixelsolutionbd.com
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
Maurvi04
 
Exploring Fault Tolerance Strategies in Big Data Infrastructures and Their Im...
AIRCC Publishing Corporation
 
Presentation Template.pptx for raesech paper
Hina636704
 
SelCSP: A Framework to Facilitate Selection of Cloud Service Providers
1crore projects
 
Ad

FT Architecture For Cloud Service Computing

  • 1. CENG 5334 - Fault Tolerant Computing Fall 2012 Fatih Karabacak
  • 2. What is Cloud Computing?  Reliability of Cloud Service.  A Fault Tolerance Framework in Cloud Computing.
  • 4. “Cloud computing is Web-based processing, whereby shared resources, software, and information are provided to computers and other devices (such as smart phones) on demand over the Internet.” Common implies multi-tenancy, not single or isolated tenancy Location-independent Online Utility implies pay-for-use pricing Demand implies ~infinite, ~immediate, ~invisible scalability
  • 7. Overflow  Timeout  Data resource missing  Computing resource missing  Software failure  Database failure  Hardware failure  Network failure
  • 8. Request Stage Failures: Overflow and Timeout.  Execution Stage Failures: Data resource missing, Computing resource missing, Software failure, Database failure, Hardware failure, and Network failure.
  • 10. The overhead created by proactive and reactive FT should be minimized when checkpointing.  A good fault tolerance should be transparent and it should not require source code or application modifications.  It should use fault prediction mechanisms to determine when to checkpoint.  It should use failure detection mechanism to determine when to recover the application from a failure.
  • 11. 1) Fault predictor 2) PLR (Process-Level Redundancy) Controller Daemon 3) Fault Tolerant Policy (Proactive-Reactive) 4) Fault Tolerance Daemon Protocol 5) Checkpoint/Restart Module
  • 12. A redundancy technique which uses the software- centric model of transient fault detection.
  • 15. Resource http://en.wikipedia.org/wiki/Cloud_computin g  http://www.focus.com/articles/hosting- bandwidth/top-10-cloud-computing-trends/  Y. Dai, B. Yang, J. Dongarra, G. Zhang, ”Cloud Service Reliability: Modeling and Analysis” I. Egwutuoha, S. Chen, D. Levy, B. Selic, “ A Fault Tolerance Framework for High Performance Computing in Cloud” Thank you

Editor's Notes

  • #4: The cloud symbol was used to denote to setting a limit points between the responsibility areas of the customer and provider.
  • #5: Simply, I can summarize some characteristics of cloud computing. The first characters build up the word CLOUD and it’s very easy to remember. They’re Common, Location-independent, Online, Utility implies and Demand implies.It's called cloud computing, the fifth utility (after electric power, gas, water and telephony) and it could change the way individuals and companies operate.
  • #7: The CMS mainly fulfills four different functions as shown in Fig. 1: 1) To manage a request queue that receives job requests from different users for cloud services; 2) To manage computing resources (such as PCs, Clusters, Supercomputers, etc.) all over the Internet; 3) To manage data resources (such as Databases, Publicized Information, URL contents, etc.) all over the Internet; and 4) To schedule a request and divide it into different subtasks and assign the subtasks to different computing resources thatmay access different data resources over the Internet.
  • #8: The model for cloud computing reliability has to consider all types of these failures, whichwould be very complicated…Moreover, these different types of failures are actually correlated with one another (i.e., notindependent) in a cloud service which exhibits another reason why the cloud reliability modelcannot simply utilize any one single existing model in each individual topic (such as softwarereliability, hardware reliability, or network reliability).With such correlations, it is obvious that a new holistic model hasto be developed for cloud reliability.
  • #12: Framework consists of five modules:fault predictor, The fault predictor (FP) runs on each compute node and filters local information to predict failures based on system data.(2) PLR controller daemon,(3) fault tolerance policy, (4) fault tolerance daemon protocol and (5) checkpoint/restart module.
  • #13: Fault Definitions Transient – A fault resulting from temporary environmental conditions. A soft fault. Permanent – a failure or fault that is continuous and stable, a hard fault. In hardware, a permanent fault is an irreversible physical change until repaired. Intermittent – a fault that is only occasionally present due to unstable hardware or varying hardware or software states, e.g., as a function of load or activity.
  • #15: 1. Step: The fault predictor module predicts future fault and, send alarm to PLR controller daemon. 2. Step: PLR controller daemon monitors the VMs and MPI (Message Passing Interface) applications. It has the visibility of HPC applications running on VMs and the visibility of virtualized environments. The PLR controller daemon ensures that redundant nodes are available for live migration and checkpointing. If the nodes are not available, it makes provision of redundant nodes. PLR controller daemon also initials and carries live migrations of VMs to redundant nodes. It initiates checkpointing of MPI applications after migration.3. Step: Fault tolerance daemon protocol notifies the PLR controller daemon when failure occurs through I am alive messages and continues monitoring of communications of MPI applications.4. Step: Checkpoint is initiated with checkpointing library BLCR library which is available in Open MPI implementation [9], after live migration of the VMs to the redundant nodes. The checkpoint files are saved on the network and neighboring node for easy recovery as well as to eliminate single point of failure.5. Step: After checkpointing, the resources which were used for migration are free because cloud resource can be relinquished at will.