Engineering Cross-Layer
Fault-Tolerance
in Many-Core Systems
PhD Student: Rem Gensh
Supervisor: Professor Alexander Romanovsky
Thesis Committee: Professor Alex Yakovlev, Dr Alexei Iliasov
Definitions
• Fault tolerance is a means for achieving dependability, allowing
us to prevent the system failure in the presence of faults.
• Layer. Abstraction layer, system component.
• Cross-layer approach (or design) is used when it is more efficient
to distribute the task between several layers rather than execute
it only at one layer.
• Many-core systems are those containing tens, hundreds or
thousands cores (multi-core systems have 2-8 cores)
2
Introduction
• Systems’ complexity and abstraction
• TCP/IP as a motivating example
• Many core systems
• Layered fault tolerance
• Cross-layer fault tolerance
3
Systems’ Complexity and Abstraction
• Abstraction simplifies the understanding of the system structure
• Layers of the computer system
• OSI model
• TCP/IP (Internet protocol suite)
• Object-oriented programming
• Components of the system are considered as black boxes
• Each component should provide predefined service according to
its interface
4
TCP/IP as a motivating example
5
TCP/IP cross-layer fault tolerance
• All layers participate in error
detection and error recovery
• Error detection and recovery is
performed by cooperative
activities of several layers
• If an error is not detected at the
lower layer it will be detected and
recovered at the higher layer
• Efficiency and flexibility of TCP/IP
6
Layer
Error
detection
Error
recovery
Application Status codes
Retransmission
or custom
recovery
Transport CRC-16
TCP: ack., neg.
ack., ARQ, seq.
number
Internet CRC-16
Discard
corrupted
packet
Link CRC-32
Discard
corrupted
packet
Many-core systems
• 10, 100 or even 1000 cores
• Heterogeneous architectures
• Redundant cores for ensuring fault tolerance
• Performance, energy efficiency and reliability are very important
factors for many-core systems
7
Layered fault tolerance
• Faults can occur at the different layers of the system stack
• Major part of errors is handled at the layer, where they are
detected.
• Convenience for developer
• Predominance of convenience over the system efficiency
8
Layered fault tolerance
• System layers are considered separately
• Unnecessary error corrections are possible
• Above layer can not specify the required
quality of service of the layer that is below
• Not optimal in terms of performance and
energy consumption
9
Cross-Layer Fault Tolerance
• Fault tolerance will be distributed across
the system stack
• Useful information about the system state
will be shared among the layers
• Various application domains
• Above layers will have the possibility to
specify current needs and required service
level
10
Cross-layer design for wireless sensor
networks
• Single layer approach cannot share important information among different
layers
• Each layer does not have complete information. Optimal operation of the
entire network cannot be guaranteed
• Single layer approach does not have the ability to adapt to the
environmental change
L. Carnevali, L. Ridi, E. Vicario, "Stochastic Fault Trees for cross-layer power management of WSN monitoring systems," IEEE Conference on Emerging Technologies & Factory
Automation, pp. 1-8, 2009.
P. Rachelin Sujae, M. Vigneshpandi, "A Cross Layer Fault Tolerant Communication Architecture for Wireless Sensor Networks," Middle-East Journal of Scientific Research, pp. 1292-
1296, 2014.
Y. Wang, H. Wu, F. Lin, N.F. Tzeng, "Cross-Layer Protocol Design and Optimization for Delay/Fault-Tolerant Mobile Sensor Networks (DFT-MSN’s)," IEEE Journal on selected areas in
communications, vol. 26, no. 5, pp. 809-819, 2008.
11
Challenges
• Investigate the trade-off between reliability, performance and
energy-consumption in many-core systems
• Ensure cross-layer fault tolerance for many-core systems
• Demonstrate that applying the cross-layer fault tolerance can
improve performance and energy-efficiency
12
Plan
• Implement a case-study to gain an experience in developing
cross-layer fault tolerance
• Apply Order Graphs to model cross-layer fault tolerance, power
consumption and performance of many-core systems
• Design novel mechanisms, libraries and patters that will help in
engineering cross-layer fault tolerance of many-core systems
13
Case study: Car number plate recognition
application
• Several character recognition algorithms
• Possibility to specify the operational mode: reliability,
performance, energy efficiency or certain tradeoffs between
these parameters.
• Recover two types of errors:
• CPU core error.
• Insufficient Quality of Service.
14
Conclusion
• Systems’ complexity and abstraction
• Layered fault tolerance
• Cross-layer fault tolerance
15

Engineering Cross-Layer Fault Tolerance in Many-Core Systems

  • 1.
    Engineering Cross-Layer Fault-Tolerance in Many-CoreSystems PhD Student: Rem Gensh Supervisor: Professor Alexander Romanovsky Thesis Committee: Professor Alex Yakovlev, Dr Alexei Iliasov
  • 2.
    Definitions • Fault toleranceis a means for achieving dependability, allowing us to prevent the system failure in the presence of faults. • Layer. Abstraction layer, system component. • Cross-layer approach (or design) is used when it is more efficient to distribute the task between several layers rather than execute it only at one layer. • Many-core systems are those containing tens, hundreds or thousands cores (multi-core systems have 2-8 cores) 2
  • 3.
    Introduction • Systems’ complexityand abstraction • TCP/IP as a motivating example • Many core systems • Layered fault tolerance • Cross-layer fault tolerance 3
  • 4.
    Systems’ Complexity andAbstraction • Abstraction simplifies the understanding of the system structure • Layers of the computer system • OSI model • TCP/IP (Internet protocol suite) • Object-oriented programming • Components of the system are considered as black boxes • Each component should provide predefined service according to its interface 4
  • 5.
    TCP/IP as amotivating example 5
  • 6.
    TCP/IP cross-layer faulttolerance • All layers participate in error detection and error recovery • Error detection and recovery is performed by cooperative activities of several layers • If an error is not detected at the lower layer it will be detected and recovered at the higher layer • Efficiency and flexibility of TCP/IP 6 Layer Error detection Error recovery Application Status codes Retransmission or custom recovery Transport CRC-16 TCP: ack., neg. ack., ARQ, seq. number Internet CRC-16 Discard corrupted packet Link CRC-32 Discard corrupted packet
  • 7.
    Many-core systems • 10,100 or even 1000 cores • Heterogeneous architectures • Redundant cores for ensuring fault tolerance • Performance, energy efficiency and reliability are very important factors for many-core systems 7
  • 8.
    Layered fault tolerance •Faults can occur at the different layers of the system stack • Major part of errors is handled at the layer, where they are detected. • Convenience for developer • Predominance of convenience over the system efficiency 8
  • 9.
    Layered fault tolerance •System layers are considered separately • Unnecessary error corrections are possible • Above layer can not specify the required quality of service of the layer that is below • Not optimal in terms of performance and energy consumption 9
  • 10.
    Cross-Layer Fault Tolerance •Fault tolerance will be distributed across the system stack • Useful information about the system state will be shared among the layers • Various application domains • Above layers will have the possibility to specify current needs and required service level 10
  • 11.
    Cross-layer design forwireless sensor networks • Single layer approach cannot share important information among different layers • Each layer does not have complete information. Optimal operation of the entire network cannot be guaranteed • Single layer approach does not have the ability to adapt to the environmental change L. Carnevali, L. Ridi, E. Vicario, "Stochastic Fault Trees for cross-layer power management of WSN monitoring systems," IEEE Conference on Emerging Technologies & Factory Automation, pp. 1-8, 2009. P. Rachelin Sujae, M. Vigneshpandi, "A Cross Layer Fault Tolerant Communication Architecture for Wireless Sensor Networks," Middle-East Journal of Scientific Research, pp. 1292- 1296, 2014. Y. Wang, H. Wu, F. Lin, N.F. Tzeng, "Cross-Layer Protocol Design and Optimization for Delay/Fault-Tolerant Mobile Sensor Networks (DFT-MSN’s)," IEEE Journal on selected areas in communications, vol. 26, no. 5, pp. 809-819, 2008. 11
  • 12.
    Challenges • Investigate thetrade-off between reliability, performance and energy-consumption in many-core systems • Ensure cross-layer fault tolerance for many-core systems • Demonstrate that applying the cross-layer fault tolerance can improve performance and energy-efficiency 12
  • 13.
    Plan • Implement acase-study to gain an experience in developing cross-layer fault tolerance • Apply Order Graphs to model cross-layer fault tolerance, power consumption and performance of many-core systems • Design novel mechanisms, libraries and patters that will help in engineering cross-layer fault tolerance of many-core systems 13
  • 14.
    Case study: Carnumber plate recognition application • Several character recognition algorithms • Possibility to specify the operational mode: reliability, performance, energy efficiency or certain tradeoffs between these parameters. • Recover two types of errors: • CPU core error. • Insufficient Quality of Service. 14
  • 15.
    Conclusion • Systems’ complexityand abstraction • Layered fault tolerance • Cross-layer fault tolerance 15