Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems

© LY Corporation
Exploration of Fault Identification and
Automatic Recovery in Cloud-based
FPGA Systems
Satoshi Konno 1) 2) and Inoguchi Yasushi 1)
ICCE 2024
5-8 JANUARY 2024 | LAS VEGAS, NV, USA
1) Japan Advanced Institute of Science and Technology (JAIST)
2) LY Corporation,

© LY Corporation
Introduction
Related Works
Challenges in Cloud-Based FPGA Systems
Proposal for Autonomous Monitoring and Recovery
Application of Solutions in Cloud-Based FPGA Systems
01
03
02
04
05
2
Presentation Outline
Conclusion and Future Plans
06

© LY Corporation
Introduction
3

© LY Corporation
• Growing Use of FPGAs in Cloud Environments: FPGAs are
increasingly utilized for data center hardware acceleration,
offering flexibility and efficiency in cloud computing.
• Key Challenges: Resource management, scalability, security
and radiation-induced soft errors pose significant
challenges in cloud-based FPGA systems.
• Need for Automatic Fault Detection and Recovery:
Essential to maintaining system reliability and availability, the
development of automatic fault detection and recovery
methods is critical.
4
Introduction to Cloud-based FPGA Systems
Exploring Challenges and Solutions

© LY Corporation
Related Works
5

© LY Corporation
● Diverse Application Requirements [5][11]:
Varied applications in cloud-based FPGA systems, from simple calculations to
intensive computations, complicate management and optimization.
● Complexity in Standardization [5][11]:
Difficulty in standardizing error information and collection methods across
different FPGA vendors, adding to operational complexity.
● Radiation Risks in High-Density FPGA Clouds [11][12]:
Hundreds of thousands of FPGAs employed in cloud computing are susceptible
to radiation-induced errors, posing a significant risk in large-scale deployments.
6
Challenges in Cloud-Based FPGA Systems
Identifying Key Issues
[5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud." ACM Transactions on Reconfigurable Technology and
Systems (TRETS) 15.3 (2022): 1-42.
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 2019.
[12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM Transactions on
Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.

© LY Corporation
• Trends in Cloud Monitoring Systems [4][9][21]:
• Need for rapid processing of large volumes of metric data in modern
cloud environments
• Transition from single-node, storage-based time-series databases to in-
memory distributed databases.
7
Cloud Operation Metrics and Monitoring Systems
Monitoring Indicators and Trends
[4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016.
[9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014.
[21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan.
Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015

© LY Corporation 8
Proposed Autonomous Monitoring System for Cloud Systems [15][16]
[15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing.
In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.
[16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series
root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
•Autonomous Distributed Monitoring: The
Foreman system is proposed as an
autonomous distributed monitoring system.
•Symmetric Distributed System: The system
uses a symmetric distributed architecture to
prevent any single point of failure.
•In-Memory Time-Series Database: The
metrics database is implemented as a
cyclical in-memory time-series database,
aligning with modern cloud monitoring
systems trends.

© LY Corporation
Challenges in Cloud-Based
FPGA Systems
9

© LY Corporation
• Difficulty in detecting (Silent Data Corruption) SDC [11][12]:
• FPGAs are susceptible to radiation-induced soft errors and,
SDC posing a risk to system integrity.
• Significant vulnerability in large-scale FPGA deployments [11][12]:
• In a hypothetical scenario with 100,000 FPGAs in Denver, Colorado,
Single Event Upsets (SEUs) expected every half-hour, and SDC failures
approximately every 0.5–11 days.
10
Vulnerability to Radiation-Induced Errors
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of
the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019.
[12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM
Transactions on Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.

© LY Corporation
Proposal for Autonomous
Monitoring and Recovery
11

© LY Corporation
• Event-Driven Monitoring Rule Inference Method:
Utilizes case-based reasoning and correlation analysis for detecting and
addressing failures, enhancing system resilience.
• Enhancing Reliability and Availability:
Aims to improve the reliability and availability of cloud services by
proactively identifying and recovering from failures.
• Autonomous Recovery in Cloud Environments:
Focuses on developing systems that can autonomously recover from
failures, minimizing downtime and service disruptions.
12
Proposal for Autonomous Monitoring and Recovery
Innovative Approach for Cloud-based FPGA Systems
In 2019 IEEE 24th
Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.

© LY Corporation 13
Event-Driven QoS Management in Cloud Computing
Four Steps of μQoS Framework
In 2019 IEEE 24th
mx: Unsatisfied metrics, ma: Candidate root cause metrics for mx, atcion01: Monitoring action related ma

© LY Corporation
Application of Solutions in
Cloud-Based FPGA Systems
14

© LY Corporation
This proposal is based on two key assumptions:
● Detectability: Failures in FPGA
computation results within cloud
environments can be identified using
periodic homeostasis monitoring.
● Correlation: Homeostasis failures are
observable through metrics collected from
the FPGA system and its node
environment metrics.
15
Assumptions for FPGA System Monitoring
Temperature
・・・
Status
Error
Homeostasis
Failures
Observable

Application to Cloud-based FPGA Systems
Introducing Homeostasis Monitoring for Cloud-based FPGA Systems
● Deriving Preventive Rule for Radiation-Induced Errors
Implement homeostasis monitoring to analyze correlation with existing monitoring rules in case
of anomalies and derive preventive monitoring rules.
mf: Homeostasis metrics, Rule01: Existing monitoring rule, Rule02: Homeostasis monitoring rule, mx: Correlation metrics for mf, Rule03: Derived new monitoring rule

© LY Corporation
Conclusion
17

© LY Corporation
● Introducing Homeostasis Monitoring:
To cost-effectively enhance the reliability of FPGA systems, we
introduced homeostasis monitoring to address radiation-induced errors.
● Deriving Preventive Rule for Radiation-Induced Errors:
Leverage homeostasis monitoring and existing rule correlations to
derive preventive new monitoring rule.
● Need for Standardization and Further Evaluation:
Identifies the necessity for standardization in error information and calls
for further evaluation in this domain for improved FPGA management.
18
Summarizing Key Findings and Next Steps

© LY Corporation
Related Works
20

© LY Corporation
• Standardization Issues [5][11]:
• Non-standardized application interfaces within FPGAs,
complicating effective operation and management in cloud
computing.
• Inconsistent error logs and hardware error information across
different FPGA vendors.
21
Cloud-based FPGA Systems Issues
[5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud."
ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.3 (2022): 1-42.
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud
computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. 2019.

© LY Corporation
• Importance Monitoring Indicators in Cloud Computing [4]:
• Service Level Indicators (SLIs) define quantitative measures of service levels.
• Service Level Objectives (SLOs) combine SLI metrics with target values or
acceptable ranges.
• Trends in Cloud Monitoring Systems [4][9][21]:
• Need for rapid processing of large volumes of metric data in modern cloud
environments
• Transition from single-node, storage-based time-series databases to in-
memory distributed databases.
22
Monitoring Indicators and Trends
[4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016.
[9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014.
[21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan.
Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015

© LY Corporation
• Overview of μQoS:
• A novel event-driven approach addressing QoS management issues.
• Focuses on autonomous recovery based on root cause analysis.
• Key Objectives of μQoS:
• Accurate and automatic identification of failure root causes from vast
metrics.
• Establishment of a QoS guarantee framework grounded in root cause
analysis.
24
Event-Driven SLO Management in Cloud Computing
A New Approach for Autonomous Recovery and SLO Guarantee
In 2019 IEEE 24th

© LY Corporation
● Metrics Collection and Initial Rule Setting:
Gathering of relevant metrics for FPGA systems and
establishing initial monitoring rules for effective system
management.
● Homeostasis Monitoring:
Detecting infrequent FPGA errors and informs recovery rule
updates, enhancing system reliability cost-effectively.
● Preventive Monitoring and Rule Generation:
Deviating preventive monitoring strategies and dynamic rule
generation to anticipate and mitigate potential system failures.
26
Implementing the Proposed Framework

Application to Cloud-based FPGA Systems
Introducing Homeostasis Monitoring for Cloud-based FPGA Systems

© LY Corporation
● Importance of Reliable FPGA Systems in Cloud:
Emphasizes the critical need for reliable and autonomous
FPGA systems in cloud environments for enhanced service
availability.
● QoS Assurance and Automated Recovery:
Highlights the significance of QoS assurance and the need for
effective automated failure recovery mechanisms.
● Need for Standardization and Further Research:
Identifies the necessity for standardization in error
information and calls for further research in this domain for
improved FPGA management.
29
Summarizing Key Findings and Next Steps

Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems

More Related Content

Similar to Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems

Recently uploaded

Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems