© LY Corporation
Exploration of Fault Identification and
Automatic Recovery in Cloud-based
FPGA Systems
Satoshi Konno 1) 2) and Inoguchi Yasushi 1)
ICCE 2024
5-8 JANUARY 2024 | LAS VEGAS, NV, USA
1) Japan Advanced Institute of Science and Technology (JAIST)
2) LY Corporation,
© LY Corporation
Introduction
Related Works
Challenges in Cloud-Based FPGA Systems
Proposal for Autonomous Monitoring and Recovery
Application of Solutions in Cloud-Based FPGA Systems
01
03
02
04
05
2
Presentation Outline
Conclusion and Future Plans
06
© LY Corporation
Introduction
3
© LY Corporation
• Growing Use of FPGAs in Cloud Environments: FPGAs are
increasingly utilized for data center hardware acceleration,
offering flexibility and efficiency in cloud computing.
• Key Challenges: Resource management, scalability, security
and radiation-induced soft errors pose significant
challenges in cloud-based FPGA systems.
• Need for Automatic Fault Detection and Recovery:
Essential to maintaining system reliability and availability, the
development of automatic fault detection and recovery
methods is critical.
4
Introduction to Cloud-based FPGA Systems
Exploring Challenges and Solutions
© LY Corporation
Related Works
5
© LY Corporation
● Diverse Application Requirements [5][11]:
Varied applications in cloud-based FPGA systems, from simple calculations to
intensive computations, complicate management and optimization.
● Complexity in Standardization [5][11]:
Difficulty in standardizing error information and collection methods across
different FPGA vendors, adding to operational complexity.
● Radiation Risks in High-Density FPGA Clouds [11][12]:
Hundreds of thousands of FPGAs employed in cloud computing are susceptible
to radiation-induced errors, posing a significant risk in large-scale deployments.
6
Challenges in Cloud-Based FPGA Systems
Identifying Key Issues
[5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud." ACM Transactions on Reconfigurable Technology and
Systems (TRETS) 15.3 (2022): 1-42.
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 2019.
[12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM Transactions on
Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.
© LY Corporation
• Trends in Cloud Monitoring Systems [4][9][21]:
• Need for rapid processing of large volumes of metric data in modern
cloud environments
• Transition from single-node, storage-based time-series databases to in-
memory distributed databases.
7
Cloud Operation Metrics and Monitoring Systems
Monitoring Indicators and Trends
[4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016.
[9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014.
[21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan.
Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015
© LY Corporation 8
Cloud Operation Metrics and Monitoring Systems
Proposed Autonomous Monitoring System for Cloud Systems [15][16]
[15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing.
In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.
[16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series
root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
•Autonomous Distributed Monitoring: The
Foreman system is proposed as an
autonomous distributed monitoring system.
•Symmetric Distributed System: The system
uses a symmetric distributed architecture to
prevent any single point of failure.
•In-Memory Time-Series Database: The
metrics database is implemented as a
cyclical in-memory time-series database,
aligning with modern cloud monitoring
systems trends.
© LY Corporation
Challenges in Cloud-Based
FPGA Systems
9
© LY Corporation
• Difficulty in detecting (Silent Data Corruption) SDC [11][12]:
• FPGAs are susceptible to radiation-induced soft errors and,
SDC posing a risk to system integrity.
• Significant vulnerability in large-scale FPGA deployments [11][12]:
• In a hypothetical scenario with 100,000 FPGAs in Denver, Colorado,
Single Event Upsets (SEUs) expected every half-hour, and SDC failures
approximately every 0.5–11 days.
10
Introduction to Cloud-based FPGA Systems
Vulnerability to Radiation-Induced Errors
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of
the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019.
[12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM
Transactions on Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.
© LY Corporation
Proposal for Autonomous
Monitoring and Recovery
11
© LY Corporation
• Event-Driven Monitoring Rule Inference Method:
Utilizes case-based reasoning and correlation analysis for detecting and
addressing failures, enhancing system resilience.
• Enhancing Reliability and Availability:
Aims to improve the reliability and availability of cloud services by
proactively identifying and recovering from failures.
• Autonomous Recovery in Cloud Environments:
Focuses on developing systems that can autonomously recover from
failures, minimizing downtime and service disruptions.
12
Proposal for Autonomous Monitoring and Recovery
Innovative Approach for Cloud-based FPGA Systems
[15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing.
In 2019 IEEE 24th
Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.
[16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series
root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
© LY Corporation 13
Event-Driven QoS Management in Cloud Computing
Four Steps of μQoS Framework
[15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing.
In 2019 IEEE 24th
Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.
[16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series
root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
mx: Unsatisfied metrics, ma: Candidate root cause metrics for mx, atcion01: Monitoring action related ma
© LY Corporation
Application of Solutions in
Cloud-Based FPGA Systems
14
© LY Corporation
This proposal is based on two key assumptions:
● Detectability: Failures in FPGA
computation results within cloud
environments can be identified using
periodic homeostasis monitoring.
● Correlation: Homeostasis failures are
observable through metrics collected from
the FPGA system and its node
environment metrics.
15
Application of Solutions in Cloud-Based FPGA Systems
Assumptions for FPGA System Monitoring
Temperature
・・・
Status
Error
Homeostasis
Failures
Observable
© LY Corporation 16
Application to Cloud-based FPGA Systems
Introducing Homeostasis Monitoring for Cloud-based FPGA Systems
● Deriving Preventive Rule for Radiation-Induced Errors
Implement homeostasis monitoring to analyze correlation with existing monitoring rules in case
of anomalies and derive preventive monitoring rules.
mf: Homeostasis metrics, Rule01: Existing monitoring rule, Rule02: Homeostasis monitoring rule, mx: Correlation metrics for mf, Rule03: Derived new monitoring rule
© LY Corporation
Conclusion
17
© LY Corporation
● Introducing Homeostasis Monitoring:
To cost-effectively enhance the reliability of FPGA systems, we
introduced homeostasis monitoring to address radiation-induced errors.
● Deriving Preventive Rule for Radiation-Induced Errors:
Leverage homeostasis monitoring and existing rule correlations to
derive preventive new monitoring rule.
● Need for Standardization and Further Evaluation:
Identifies the necessity for standardization in error information and calls
for further evaluation in this domain for improved FPGA management.
18
Application of Solutions in Cloud-Based FPGA Systems
Summarizing Key Findings and Next Steps
© LY Corporation
Thank you
© LY Corporation
Related Works
20
© LY Corporation
• Standardization Issues [5][11]:
• Non-standardized application interfaces within FPGAs,
complicating effective operation and management in cloud
computing.
• Inconsistent error logs and hardware error information across
different FPGA vendors.
21
Introduction to Cloud-based FPGA Systems
Cloud-based FPGA Systems Issues
[5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud."
ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.3 (2022): 1-42.
[11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud
computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. 2019.
© LY Corporation
• Importance Monitoring Indicators in Cloud Computing [4]:
• Service Level Indicators (SLIs) define quantitative measures of service levels.
• Service Level Objectives (SLOs) combine SLI metrics with target values or
acceptable ranges.
• Trends in Cloud Monitoring Systems [4][9][21]:
• Need for rapid processing of large volumes of metric data in modern cloud
environments
• Transition from single-node, storage-based time-series databases to in-
memory distributed databases.
22
Cloud Operation Metrics and Monitoring Systems
Monitoring Indicators and Trends
[4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016.
[9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014.
[21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan.
Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015
© LY Corporation
Proposal for Autonomous
Monitoring and Recovery
23
© LY Corporation
• Overview of μQoS:
• A novel event-driven approach addressing QoS management issues.
• Focuses on autonomous recovery based on root cause analysis.
• Key Objectives of μQoS:
• Accurate and automatic identification of failure root causes from vast
metrics.
• Establishment of a QoS guarantee framework grounded in root cause
analysis.
24
Event-Driven SLO Management in Cloud Computing
A New Approach for Autonomous Recovery and SLO Guarantee
[15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing.
In 2019 IEEE 24th
Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019.
[16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series
root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
© LY Corporation
Application of Solutions in
Cloud-Based FPGA Systems
25
© LY Corporation
● Metrics Collection and Initial Rule Setting:
Gathering of relevant metrics for FPGA systems and
establishing initial monitoring rules for effective system
management.
● Homeostasis Monitoring:
Detecting infrequent FPGA errors and informs recovery rule
updates, enhancing system reliability cost-effectively.
● Preventive Monitoring and Rule Generation:
Deviating preventive monitoring strategies and dynamic rule
generation to anticipate and mitigate potential system failures.
26
Application of Solutions in Cloud-Based FPGA Systems
Implementing the Proposed Framework
© LY Corporation 27
Application to Cloud-based FPGA Systems
Introducing Homeostasis Monitoring for Cloud-based FPGA Systems
© LY Corporation
Conclusion
28
© LY Corporation
● Importance of Reliable FPGA Systems in Cloud:
Emphasizes the critical need for reliable and autonomous
FPGA systems in cloud environments for enhanced service
availability.
● QoS Assurance and Automated Recovery:
Highlights the significance of QoS assurance and the need for
effective automated failure recovery mechanisms.
● Need for Standardization and Further Research:
Identifies the necessity for standardization in error
information and calls for further research in this domain for
improved FPGA management.
29
Application of Solutions in Cloud-Based FPGA Systems
Summarizing Key Findings and Next Steps

Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems

  • 1.
    © LY Corporation Explorationof Fault Identification and Automatic Recovery in Cloud-based FPGA Systems Satoshi Konno 1) 2) and Inoguchi Yasushi 1) ICCE 2024 5-8 JANUARY 2024 | LAS VEGAS, NV, USA 1) Japan Advanced Institute of Science and Technology (JAIST) 2) LY Corporation,
  • 2.
    © LY Corporation Introduction RelatedWorks Challenges in Cloud-Based FPGA Systems Proposal for Autonomous Monitoring and Recovery Application of Solutions in Cloud-Based FPGA Systems 01 03 02 04 05 2 Presentation Outline Conclusion and Future Plans 06
  • 3.
  • 4.
    © LY Corporation •Growing Use of FPGAs in Cloud Environments: FPGAs are increasingly utilized for data center hardware acceleration, offering flexibility and efficiency in cloud computing. • Key Challenges: Resource management, scalability, security and radiation-induced soft errors pose significant challenges in cloud-based FPGA systems. • Need for Automatic Fault Detection and Recovery: Essential to maintaining system reliability and availability, the development of automatic fault detection and recovery methods is critical. 4 Introduction to Cloud-based FPGA Systems Exploring Challenges and Solutions
  • 5.
  • 6.
    © LY Corporation ●Diverse Application Requirements [5][11]: Varied applications in cloud-based FPGA systems, from simple calculations to intensive computations, complicate management and optimization. ● Complexity in Standardization [5][11]: Difficulty in standardizing error information and collection methods across different FPGA vendors, adding to operational complexity. ● Radiation Risks in High-Density FPGA Clouds [11][12]: Hundreds of thousands of FPGAs employed in cloud computing are susceptible to radiation-induced errors, posing a significant risk in large-scale deployments. 6 Challenges in Cloud-Based FPGA Systems Identifying Key Issues [5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.3 (2022): 1-42. [11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019. [12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.
  • 7.
    © LY Corporation •Trends in Cloud Monitoring Systems [4][9][21]: • Need for rapid processing of large volumes of metric data in modern cloud environments • Transition from single-node, storage-based time-series databases to in- memory distributed databases. 7 Cloud Operation Metrics and Monitoring Systems Monitoring Indicators and Trends [4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016. [9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014. [21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015
  • 8.
    © LY Corporation8 Cloud Operation Metrics and Monitoring Systems Proposed Autonomous Monitoring System for Cloud Systems [15][16] [15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing. In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019. [16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021 •Autonomous Distributed Monitoring: The Foreman system is proposed as an autonomous distributed monitoring system. •Symmetric Distributed System: The system uses a symmetric distributed architecture to prevent any single point of failure. •In-Memory Time-Series Database: The metrics database is implemented as a cyclical in-memory time-series database, aligning with modern cloud monitoring systems trends.
  • 9.
    © LY Corporation Challengesin Cloud-Based FPGA Systems 9
  • 10.
    © LY Corporation •Difficulty in detecting (Silent Data Corruption) SDC [11][12]: • FPGAs are susceptible to radiation-induced soft errors and, SDC posing a risk to system integrity. • Significant vulnerability in large-scale FPGA deployments [11][12]: • In a hypothetical scenario with 100,000 FPGAs in Denver, Colorado, Single Event Upsets (SEUs) expected every half-hour, and SDC failures approximately every 0.5–11 days. 10 Introduction to Cloud-based FPGA Systems Vulnerability to Radiation-Induced Errors [11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019. [12] Keller, Andrew M., and Michael J. Wirthlin. "The Impact of Terrestrial Radiation on FPGAs in Data Centers." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.2 (2021): 1-21.
  • 11.
    © LY Corporation Proposalfor Autonomous Monitoring and Recovery 11
  • 12.
    © LY Corporation •Event-Driven Monitoring Rule Inference Method: Utilizes case-based reasoning and correlation analysis for detecting and addressing failures, enhancing system resilience. • Enhancing Reliability and Availability: Aims to improve the reliability and availability of cloud services by proactively identifying and recovering from failures. • Autonomous Recovery in Cloud Environments: Focuses on developing systems that can autonomously recover from failures, minimizing downtime and service disruptions. 12 Proposal for Autonomous Monitoring and Recovery Innovative Approach for Cloud-based FPGA Systems [15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing. In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019. [16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
  • 13.
    © LY Corporation13 Event-Driven QoS Management in Cloud Computing Four Steps of μQoS Framework [15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing. In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019. [16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021 mx: Unsatisfied metrics, ma: Candidate root cause metrics for mx, atcion01: Monitoring action related ma
  • 14.
    © LY Corporation Applicationof Solutions in Cloud-Based FPGA Systems 14
  • 15.
    © LY Corporation Thisproposal is based on two key assumptions: ● Detectability: Failures in FPGA computation results within cloud environments can be identified using periodic homeostasis monitoring. ● Correlation: Homeostasis failures are observable through metrics collected from the FPGA system and its node environment metrics. 15 Application of Solutions in Cloud-Based FPGA Systems Assumptions for FPGA System Monitoring Temperature ・・・ Status Error Homeostasis Failures Observable
  • 16.
    © LY Corporation16 Application to Cloud-based FPGA Systems Introducing Homeostasis Monitoring for Cloud-based FPGA Systems ● Deriving Preventive Rule for Radiation-Induced Errors Implement homeostasis monitoring to analyze correlation with existing monitoring rules in case of anomalies and derive preventive monitoring rules. mf: Homeostasis metrics, Rule01: Existing monitoring rule, Rule02: Homeostasis monitoring rule, mx: Correlation metrics for mf, Rule03: Derived new monitoring rule
  • 17.
  • 18.
    © LY Corporation ●Introducing Homeostasis Monitoring: To cost-effectively enhance the reliability of FPGA systems, we introduced homeostasis monitoring to address radiation-induced errors. ● Deriving Preventive Rule for Radiation-Induced Errors: Leverage homeostasis monitoring and existing rule correlations to derive preventive new monitoring rule. ● Need for Standardization and Further Evaluation: Identifies the necessity for standardization in error information and calls for further evaluation in this domain for improved FPGA management. 18 Application of Solutions in Cloud-Based FPGA Systems Summarizing Key Findings and Next Steps
  • 19.
  • 20.
  • 21.
    © LY Corporation •Standardization Issues [5][11]: • Non-standardized application interfaces within FPGAs, complicating effective operation and management in cloud computing. • Inconsistent error logs and hardware error information across different FPGA vendors. 21 Introduction to Cloud-based FPGA Systems Cloud-based FPGA Systems Issues [5] Bobda, Christophe, et al. "The future of FPGA acceleration in datacenters and the cloud." ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.3 (2022): 1-42. [11] Keller, Andrew M., and Michael J. Wirthlin. "Impact of soft errors on large-scale FPGA cloud computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays. 2019.
  • 22.
    © LY Corporation •Importance Monitoring Indicators in Cloud Computing [4]: • Service Level Indicators (SLIs) define quantitative measures of service levels. • Service Level Objectives (SLOs) combine SLI metrics with target values or acceptable ranges. • Trends in Cloud Monitoring Systems [4][9][21]: • Need for rapid processing of large volumes of metric data in modern cloud environments • Transition from single-node, storage-based time-series databases to in- memory distributed databases. 22 Cloud Operation Metrics and Monitoring Systems Monitoring Indicators and Trends [4] Betsy Beyer. Site reliability engineering : How Google runs production systems. O’Reilly Media, Sebastopol, CA, 2016. [9] Brian Harrington and Roy Rapoport. Introducing atlas: Netflix’s primary telemetry platform. Netflix Technology Blog, 2014. [21] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro,Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast,scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015
  • 23.
    © LY Corporation Proposalfor Autonomous Monitoring and Recovery 23
  • 24.
    © LY Corporation •Overview of μQoS: • A novel event-driven approach addressing QoS management issues. • Focuses on autonomous recovery based on root cause analysis. • Key Objectives of μQoS: • Accurate and automatic identification of failure root causes from vast metrics. • Establishment of a QoS guarantee framework grounded in root cause analysis. 24 Event-Driven SLO Management in Cloud Computing A New Approach for Autonomous Recovery and SLO Guarantee [15] Satoshi Konno and D ́efago Xavier. Approximate qos rule derivation based on root cause analysis for cloud computing. In 2019 IEEE 24th Pacific Rim international symposium on dependable computing (PRDC), pages 33–3309. IEEE, 2019. [16] Satoshi Konno, D ́efago Xavier, Tomita Takashi, and Inoguchi Yasushi. Inference qos rule derivation based on time series root cause analysis. Digital Practice of Information Processing Society of Japan, 2(3):11–26, 2021
  • 25.
    © LY Corporation Applicationof Solutions in Cloud-Based FPGA Systems 25
  • 26.
    © LY Corporation ●Metrics Collection and Initial Rule Setting: Gathering of relevant metrics for FPGA systems and establishing initial monitoring rules for effective system management. ● Homeostasis Monitoring: Detecting infrequent FPGA errors and informs recovery rule updates, enhancing system reliability cost-effectively. ● Preventive Monitoring and Rule Generation: Deviating preventive monitoring strategies and dynamic rule generation to anticipate and mitigate potential system failures. 26 Application of Solutions in Cloud-Based FPGA Systems Implementing the Proposed Framework
  • 27.
    © LY Corporation27 Application to Cloud-based FPGA Systems Introducing Homeostasis Monitoring for Cloud-based FPGA Systems
  • 28.
  • 29.
    © LY Corporation ●Importance of Reliable FPGA Systems in Cloud: Emphasizes the critical need for reliable and autonomous FPGA systems in cloud environments for enhanced service availability. ● QoS Assurance and Automated Recovery: Highlights the significance of QoS assurance and the need for effective automated failure recovery mechanisms. ● Need for Standardization and Further Research: Identifies the necessity for standardization in error information and calls for further research in this domain for improved FPGA management. 29 Application of Solutions in Cloud-Based FPGA Systems Summarizing Key Findings and Next Steps