SlideShare a Scribd company logo
<Insert Picture Here>




Node Management in Oracle Clusterware
Markus Michalewicz
Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remain at the sole discretion of Oracle.




Agenda
• Oracle Clusterware 11.2.0.1 Processes
                                            <Insert Picture Here>

• Node Monitoring Basics

• Node Eviction Basics

• Re-bootless Node Fencing (restart)

• Advanced Node Management

• The Corner Cases

• More Information / Q&A
Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management




Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management – focus!




                             OHASD


                                         CSSD
                                        ora.cssd


                                     CSSDMONITOR
                                      (was: oprocd)
                                     ora.cssdmonitor
<Insert Picture Here>



 Node Monitoring Basics




Basic Hardware Layout Oracle Clusterware
Node management is hardware independent

           Public Lan               Public Lan



                               Private Lan /
                               Interconnect




    CSSD                CSSD                        CSSD



             SAN                         SAN
            Network                     Network
                           Voting
                            Disk
What does CSSD do?
CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
   – Private Interconnect  Network Heartbeat
   – Voting Disk based communication  Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
  nodes dependent on heartbeat feedback (failures)




                      CSSD            “Ping”           CSSD




                                      “Ping”




Network Heartbeat
Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
   – Reducing the css_misscount time is generally not supported


• Network heartbeat failures will lead to node evictions
   – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
     mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds




                      CSSD            “Ping”           CSSD
Disk Heartbeat
Voting Disk basics – Part 1
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
   – I/O errors indicate clear accessibility problems  timeout is irrelevant


• Disk heartbeat failures will lead to node evictions
   – CSSD-log: … [CSSD] [1115699552] >TRACE:   clssnmReadDskHeartbeat:
     node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)




                        CSSD                               CSSD




                                         “Ping”




Voting Disk Structure
Voting Disk basics – Part 2
• Voting Disks contain dynamic and static data:
   – Dynamic data: disk heartbeat logging
   – Static data: information about the nodes in the cluster


• With 11.2.0.1 Voting Disks got an “identity”:
   – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
     1.   2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]


• Voting Disks must therefore not be copied using “dd” or “cp” anymore




                   Node information             Disk Heartbeat Logging
“Simple Majority Rule”
Voting Disk basics – Part 3
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
  – Each node must “see” the simple majority of configured Voting Disks
     at all times in order not to be evicted (to remain in the cluster)

         trunc(n/2+1) with n=number of voting disks configured and n>=1




                      CSSD                               CSSD




Insertion 1: “Simple Majority Rule”…
… In extended Oracle clusters



                      • http://www.oracle.com/goto/rac
                          – Using standard NFS to support
                            a third voting file for extended
                            cluster configurations (PDF)


          CSSD                                                      CSSD




                        • Same principles apply
                        • Voting Disks are just
                          geographically dispersed
Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesn’t change its use

 [GRID]> crsctl query css votedisk
  1.   2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
  2.   2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
  3.   2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
 Located 3 voting disk(s).



• Oracle ASM auto creates 1/3/5 Voting Files
  – Based on Ext/Normal/High redundancy
    and on Failure Groups in the Disk Group
  – Per default there is one failure group per disk
  – ASM will enforce the required number of disks
  – New failure group type: Quorum Failgroup




                                                      <Insert Picture Here>



  Node Eviction Basics
Why are nodes evicted?
 To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
   – Shared data must not be written by independently operating nodes
   – The easiest way to prevent this is to forcibly remove a node from the cluster




                          1                               2

                        CSSD                              CSSD




How are nodes evicted in general?
“STONITH like” or node eviction basics – Part 1
• Once it is determined that a node needs to be evicted,
   – A “kill request” is sent to the respective node(s)
   – Using all (remaining) communication channels


• A node (CSSD) is requested to “kill itself”  “STONITH like”
   – “STONITH” foresees that a remote node kills the node to be evicted




                          1                               2

                        CSSD                              CSSD
How are nodes evicted?
EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
   – It is determined which nodes can still talk to each other
   – A “kill request” is sent to the node(s) to be evicted
          Using all (remaining) communication channels  Voting Disk(s)


• A node is requested to “kill itself”; executer: typically CSSD



                        1

                      CSSD                            CSSD


                                                  2




How can nodes be evicted?
Using IPMI / Node eviction basics – Part 2
• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)
   – Intelligent Platform Management Interface (IPMI) drivers required


• IPMI allows remote-shutdown of nodes using additional hardware
   – A Baseboard Management Controller (BMC) per cluster node is required




                        1
                      CSSD                            CSSD
Insertion: Node Eviction Using IPMI
EXAMPLE: Heartbeat failure
• The network heartbeat between the nodes has failed
   – It is determined which nodes can still talk to each other
   – IPMI is used to remotely shutdown the node to be evicted




                       1
                     CSSD




Which node is evicted?
Node eviction basics – Part 3
• Voting Disks and heartbeat communication is used to determine the node


• In a 2 node cluster, the node with the lowest node number should survive
• In a n-node cluster, the biggest sub-cluster should survive (votes based)




                       1                             2

                     CSSD                            CSSD
<Insert Picture Here>



  Re-bootless Node
  Fencing (restart)




Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
   – Re-boots affect applications that might run an a node, but are not protected
   – Customer requirement: prevent a reboot, just stop the cluster – implemented...




                Standalone                               Standalone
                  App X                                    App Y
               Oracle RAC                             Oracle RAC
                DB Inst. 1                             DB Inst. 2




                       CSSD                                  CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• It starts with a failure – e.g. network heartbeat or interconnect failure




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC                              Oracle RAC
                DB Inst. 1                              DB Inst. 2




                       CSSD                                   CSSD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• It starts with a failure – e.g. network heartbeat or interconnect failure




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC                              Oracle RAC
                DB Inst. 1                              DB Inst. 2




                       CSSD                                   CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted


• Then IO issuing processes are killed; it is made sure that no IO process remains
   – For a RAC DB mainly the log writer and the database writer are of concern




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC
                DB Inst. 1




                       CSSD                                   CSSD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• Once all IO issuing processes are killed, remaining processes are stopped
   – IF the check for a successful kill of the IO processes, fails → reboot




                Standalone                                Standalone
                  App X                                     App Y
               Oracle RAC
                DB Inst. 1




                       CSSD                                   CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• Once all remaining processes are stopped, the stack stops itself with a “restart flag”




                 Standalone                                 Standalone
                   App X                                      App Y
                Oracle RAC
                 DB Inst. 1




                        CSSD                                   OHASD




Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
   – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• OHASD will finally attempt to restart the stack after the graceful shutdown




                 Standalone                                 Standalone
                   App X                                      App Y
                Oracle RAC
                 DB Inst. 1




                        CSSD                                   OHASD
Re-bootless Node Fencing (restart)
EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
   –   IF the check for a successful kill of the IO processes fails → reboot
   –   IF CSSD gets killed during the operation → reboot
   –   IF cssdmonitor (oprocd replacement) is not scheduled → reboot
   –   IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot




                 Standalone                              Standalone
                   App X                                   App Y
                Oracle RAC                            Oracle RAC
                 DB Inst. 1                            DB Inst. 2




                        CSSD                                 CSSD




                                                                      <Insert Picture Here>



  Advanced Node
  Management
Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• Each node in the cluster is “pinged” every second (network heartbeat)
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second




        1                             2                             3
       CSSD                           CSSD                          CSSD




                                     1
                                     2
                                     3




Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• In a n-node cluster, the biggest sub-cluster should survive (votes based)




        1                             2                             3
       CSSD                           CSSD                          CSSD


                                          2

                                     1


                                     3
Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Redundant Voting Disks  Oracle managed redundancy




                     • Assume for a moment only 2
      1                voting disks are supported…
     CSSD
                 2                                         3
          CSSD                                             CSSD




Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Advanced scenarios need to be considered




      1
                      • Without the “Simple Majority
     CSSD
                        Rule”, what would we do?
                 2                                         3
          CSSD                                             CSSD



                      • Even with the “Simple
                        Majority Rule” in place
                         – Each node can see only one
                           voting disk, which would lead
                           to an eviction of all nodes
Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5

                         1
                         2

                         3
    1
   CSSD
                   2                3
            CSSD                    CSSD


        1                       1
        2                       2

        3                       3




Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5

                         1
                         2

                         3
    1
   CSSD
                   2                3
            CSSD                    CSSD


        1                       1
        2                       2

        3                       3
<Insert Picture Here>



 The Corner Cases




Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…




               • A properly configured cluster
                 with 3 voting disks as shown


      CSSD                                       CSSD




               • What happens if there is a
                 storage network failure as
                 shown (lost remote access)?
Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…




                       • There will be no node eviction!
                       • IF storage mirroring is used
                         (for data files), the respective
                         solution must handle this case.
          CSSD                                              CSSD




                     • Covered in Oracle ASM 11.2.0.2:
                        – _asm_storagemaysplit = TRUE
                        – Backported to 11.1.0.7




Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
   – CSSD failed for some reason
   – CSSD is not scheduled within a certain margin


 OCSSDMONITOR (was: oprocd) will take over and execute



            1

          CSSD                           CSSD
Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
   – CSSD failed for some reason
   – CSSD is not scheduled within a certain margin


 OCSSDMONITOR (was: oprocd) will take over and execute



                1

               CSSD                 CSSDmonitor




                                                  CSSD




Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Cluster members (e.g Oracle RAC instances) can request
  Oracle Clusterware to kill a specific member of the cluster

• Oracle Clusterware will then attempt to kill the requested member




                    Oracle RAC                 Oracle RAC
                     DB Inst. 1                 DB Inst. 2
 Inst. 1:
kill inst. 2



                           CSSD                      CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
  escalation can be issued, which leads to the eviction of the
  node, on which the particular member currently resides



               Oracle RAC                    Oracle RAC
                DB Inst. 1                    DB Inst. 2
 Inst. 1:
kill inst. 2



                      CSSD                         CSSD




Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
  escalation can be issued, which leads to the eviction of the
  node, on which the particular member currently resides



               Oracle RAC                    Oracle RAC
                DB Inst. 1                    DB Inst. 2
 Inst. 1:
kill inst. 2



                      CSSD                         CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member


• If the requested member kill is unsuccessful, a node eviction
 escalation can be issued, which leads to the eviction of the
 node, on which the particular member currently resides



            Oracle RAC
             DB Inst. 1




                   CSSD




                                                           <Insert Picture Here>



  More Information
More Information
• My Oracle Support Notes:
  – ID 294430.1 - CSS Timeout Computation in Oracle Clusterware
  – ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration
    for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,
    Panic and Reboot


• http://www.oracle.com/goto/clusterware
  – Oracle Clusterware 11g Release 2 Technical Overview


• http://www.oracle.com/goto/asm


• http://www.oracle.com/goto/rac

More Related Content

What's hot (20)

PDF
How to Use Oracle RAC in a Cloud? - A Support Question
Markus Michalewicz
 
PDF
Oracle RAC on Extended Distance Clusters - Customer Examples
Markus Michalewicz
 
PDF
Understanding Oracle RAC 12c Internals OOW13 [CON8806]
Markus Michalewicz
 
PDF
The Oracle RAC Family of Solutions - Presentation
Markus Michalewicz
 
PDF
New Generation Oracle RAC Performance
Anil Nair
 
PDF
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
Nabil Nawaz
 
PDF
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
PDF
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Ludovico Caldara
 
PDF
Oracle RAC 12c Overview
Markus Michalewicz
 
PDF
Understanding oracle rac internals part 1 - slides
Mohamed Farouk
 
PDF
Zero Data Loss Recovery Applianceのご紹介
オラクルエンジニア通信
 
PDF
New availability features in oracle rac 12c release 2 anair ss
Anil Nair
 
PDF
Oracle statistics by example
Mauro Pagano
 
PDF
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
PDF
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
Markus Michalewicz
 
PDF
Advanced RAC troubleshooting: Network
Riyaj Shamsudeen
 
PDF
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 
PDF
AIOUG-GroundBreakers-Jul 2019 - 19c RAC
Sandesh Rao
 
PDF
Redo internals ppt
Riyaj Shamsudeen
 
PDF
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Markus Michalewicz
 
How to Use Oracle RAC in a Cloud? - A Support Question
Markus Michalewicz
 
Oracle RAC on Extended Distance Clusters - Customer Examples
Markus Michalewicz
 
Understanding Oracle RAC 12c Internals OOW13 [CON8806]
Markus Michalewicz
 
The Oracle RAC Family of Solutions - Presentation
Markus Michalewicz
 
New Generation Oracle RAC Performance
Anil Nair
 
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
Nabil Nawaz
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Ludovico Caldara
 
Oracle RAC 12c Overview
Markus Michalewicz
 
Understanding oracle rac internals part 1 - slides
Mohamed Farouk
 
Zero Data Loss Recovery Applianceのご紹介
オラクルエンジニア通信
 
New availability features in oracle rac 12c release 2 anair ss
Anil Nair
 
Oracle statistics by example
Mauro Pagano
 
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
Markus Michalewicz
 
Advanced RAC troubleshooting: Network
Riyaj Shamsudeen
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 
AIOUG-GroundBreakers-Jul 2019 - 19c RAC
Sandesh Rao
 
Redo internals ppt
Riyaj Shamsudeen
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Markus Michalewicz
 

Viewers also liked (19)

PDF
Understanding Oracle RAC 11g Release 2 Internals
Markus Michalewicz
 
PDF
Oracle RAC on Extended Distance Clusters - Presentation
Markus Michalewicz
 
PDF
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
PDF
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
PDF
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Markus Michalewicz
 
PDF
Oracle RAC 12c Release 2 - Overview
Markus Michalewicz
 
PPT
Oracle High Availability
Farooq Hussain
 
PDF
Oracle RAC - A Safe Investment into the Future of Your IT
Markus Michalewicz
 
PDF
Paper: Oracle RAC Internals - The Cache Fusion Edition
Markus Michalewicz
 
PPTX
Understand oracle real application cluster
Satishbabu Gunukula
 
PDF
Oracle RAC - Customer Proven Scalability
Markus Michalewicz
 
PDF
Oracle RAC BP for Upgrade & More by Anil Nair and Markus Michalewicz
Markus Michalewicz
 
PDF
Maximizing Oracle RAC Uptime
Markus Michalewicz
 
PDF
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Markus Michalewicz
 
PDF
Paper: Oracle RAC and Oracle RAC One Node on Extended Distance (Stretched) Cl...
Markus Michalewicz
 
PDF
Oracle Multitenant meets Oracle RAC - IOUG 2014 Version
Markus Michalewicz
 
PDF
Oracle Flex ASM - What’s New and Best Practices by Jim Williams
Markus Michalewicz
 
PDF
Oracle Database In-Memory Meets Oracle RAC
Markus Michalewicz
 
PPSX
Oracle 11g R2 RAC implementation and concept
Santosh Kangane
 
Understanding Oracle RAC 11g Release 2 Internals
Markus Michalewicz
 
Oracle RAC on Extended Distance Clusters - Presentation
Markus Michalewicz
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
Oracle RAC 12c Collaborate Best Practices - IOUG 2014 version
Markus Michalewicz
 
Oracle RAC 12c Release 2 - Overview
Markus Michalewicz
 
Oracle High Availability
Farooq Hussain
 
Oracle RAC - A Safe Investment into the Future of Your IT
Markus Michalewicz
 
Paper: Oracle RAC Internals - The Cache Fusion Edition
Markus Michalewicz
 
Understand oracle real application cluster
Satishbabu Gunukula
 
Oracle RAC - Customer Proven Scalability
Markus Michalewicz
 
Oracle RAC BP for Upgrade & More by Anil Nair and Markus Michalewicz
Markus Michalewicz
 
Maximizing Oracle RAC Uptime
Markus Michalewicz
 
Oracle RAC and Your Way to the Cloud by Angelo Pruscino
Markus Michalewicz
 
Paper: Oracle RAC and Oracle RAC One Node on Extended Distance (Stretched) Cl...
Markus Michalewicz
 
Oracle Multitenant meets Oracle RAC - IOUG 2014 Version
Markus Michalewicz
 
Oracle Flex ASM - What’s New and Best Practices by Jim Williams
Markus Michalewicz
 
Oracle Database In-Memory Meets Oracle RAC
Markus Michalewicz
 
Oracle 11g R2 RAC implementation and concept
Santosh Kangane
 
Ad

Similar to Oracle Clusterware Node Management and Voting Disks (20)

PDF
clusterware_Advisor_Webcast_Node_Reboot.pdf
KamelKhelifi6
 
PPSX
RAC - The Savior of DBA
Nikhil Kumar
 
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
PDF
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC
Sandesh Rao
 
PDF
Best practices oracle_clusterware_session355_wp
wardell henley
 
PDF
Linux-HA with Pacemaker
Kris Buytaert
 
PDF
An introduction to_rac_system_test_planning_methods
Ajith Narayanan
 
PDF
Linux-HA with Pacemaker
Kris Buytaert
 
PPTX
Oracle real application clusters system tests with demo
Ajith Narayanan
 
PDF
Using Machine Learning to Debug Oracle RAC Issues
Anil Nair
 
PDF
SUSE: Software Defined Storage
Kangaroot
 
PPTX
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Community
 
PDF
FreeBSD: Looking forward to another 10 years by Jordan Hubbard
eurobsdcon
 
PPT
Mcserviceguard2
grogers1124
 
PDF
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
PDF
Cluster pitfalls recommand
sprdd
 
PDF
2015 open storage workshop ceph software defined storage
Andrew Underwood
 
ODP
Block Storage For VMs With Ceph
The Linux Foundation
 
PDF
XenSummit - 08/28/2012
Ceph Community
 
DOCX
Rac questions
parvezsigan
 
clusterware_Advisor_Webcast_Node_Reboot.pdf
KamelKhelifi6
 
RAC - The Savior of DBA
Nikhil Kumar
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC
Sandesh Rao
 
Best practices oracle_clusterware_session355_wp
wardell henley
 
Linux-HA with Pacemaker
Kris Buytaert
 
An introduction to_rac_system_test_planning_methods
Ajith Narayanan
 
Linux-HA with Pacemaker
Kris Buytaert
 
Oracle real application clusters system tests with demo
Ajith Narayanan
 
Using Machine Learning to Debug Oracle RAC Issues
Anil Nair
 
SUSE: Software Defined Storage
Kangaroot
 
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Community
 
FreeBSD: Looking forward to another 10 years by Jordan Hubbard
eurobsdcon
 
Mcserviceguard2
grogers1124
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Cluster pitfalls recommand
sprdd
 
2015 open storage workshop ceph software defined storage
Andrew Underwood
 
Block Storage For VMs With Ceph
The Linux Foundation
 
XenSummit - 08/28/2012
Ceph Community
 
Rac questions
parvezsigan
 
Ad

More from Markus Michalewicz (20)

PDF
Achieving Continuous Availability for Your Applications with Oracle MAA
Markus Michalewicz
 
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Markus Michalewicz
 
PDF
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
Markus Michalewicz
 
PDF
Oracle Cloud is Best for Oracle Database - High Availability
Markus Michalewicz
 
PDF
Oracle Database – Mission Critical
Markus Michalewicz
 
PDF
2020 – A Decade of Change
Markus Michalewicz
 
PDF
Standard Edition High Availability (SEHA) - The Why, What & How
Markus Michalewicz
 
PDF
Why Use an Oracle Database?
Markus Michalewicz
 
PDF
"Changing Role of the DBA" Skills to Have, to Obtain & to Nurture - Updated 2...
Markus Michalewicz
 
PDF
MAA for Oracle Database, Exadata and the Cloud
Markus Michalewicz
 
PDF
(Oracle) DBA and Other Skills Needed in 2020
Markus Michalewicz
 
PDF
Make Your Application “Oracle RAC Ready” & Test For It
Markus Michalewicz
 
PDF
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 
PDF
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Markus Michalewicz
 
PDF
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
Markus Michalewicz
 
PDF
Oracle Database Availability & Scalability Across Versions & Editions
Markus Michalewicz
 
PDF
Oracle RAC 19c - the Basis for the Autonomous Database
Markus Michalewicz
 
PDF
From HA to Maximum Availability - A Holistic Historical Discussion
Markus Michalewicz
 
PDF
Why to Use an Oracle Database?
Markus Michalewicz
 
PDF
A Cloud Journey - Move to the Oracle Cloud
Markus Michalewicz
 
Achieving Continuous Availability for Your Applications with Oracle MAA
Markus Michalewicz
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Markus Michalewicz
 
HA, Scalability, DR & MAA in Oracle Database 21c - Overview
Markus Michalewicz
 
Oracle Cloud is Best for Oracle Database - High Availability
Markus Michalewicz
 
Oracle Database – Mission Critical
Markus Michalewicz
 
2020 – A Decade of Change
Markus Michalewicz
 
Standard Edition High Availability (SEHA) - The Why, What & How
Markus Michalewicz
 
Why Use an Oracle Database?
Markus Michalewicz
 
"Changing Role of the DBA" Skills to Have, to Obtain & to Nurture - Updated 2...
Markus Michalewicz
 
MAA for Oracle Database, Exadata and the Cloud
Markus Michalewicz
 
(Oracle) DBA and Other Skills Needed in 2020
Markus Michalewicz
 
Make Your Application “Oracle RAC Ready” & Test For It
Markus Michalewicz
 
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Markus Michalewicz
 
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
Markus Michalewicz
 
Oracle Database Availability & Scalability Across Versions & Editions
Markus Michalewicz
 
Oracle RAC 19c - the Basis for the Autonomous Database
Markus Michalewicz
 
From HA to Maximum Availability - A Holistic Historical Discussion
Markus Michalewicz
 
Why to Use an Oracle Database?
Markus Michalewicz
 
A Cloud Journey - Move to the Oracle Cloud
Markus Michalewicz
 

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Oracle Clusterware Node Management and Voting Disks

  • 1. <Insert Picture Here> Node Management in Oracle Clusterware Markus Michalewicz Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
  • 2. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle. Agenda • Oracle Clusterware 11.2.0.1 Processes <Insert Picture Here> • Node Monitoring Basics • Node Eviction Basics • Re-bootless Node Fencing (restart) • Advanced Node Management • The Corner Cases • More Information / Q&A
  • 3. Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management – focus! OHASD CSSD ora.cssd CSSDMONITOR (was: oprocd) ora.cssdmonitor
  • 4. <Insert Picture Here> Node Monitoring Basics Basic Hardware Layout Oracle Clusterware Node management is hardware independent Public Lan Public Lan Private Lan / Interconnect CSSD CSSD CSSD SAN SAN Network Network Voting Disk
  • 5. What does CSSD do? CSSD monitors and evicts nodes • Monitors nodes using 2 communication channels: – Private Interconnect  Network Heartbeat – Voting Disk based communication  Disk Heartbeat • Evicts (forcibly removes nodes from a cluster) nodes dependent on heartbeat feedback (failures) CSSD “Ping” CSSD “Ping” Network Heartbeat Interconnect basics • Each node in the cluster is “pinged” every second • Nodes must respond in css_misscount time (defaults to 30 secs.) – Reducing the css_misscount time is generally not supported • Network heartbeat failures will lead to node evictions – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds CSSD “Ping” CSSD
  • 6. Disk Heartbeat Voting Disk basics – Part 1 • Each node in the cluster “pings” (r/w) the Voting Disk(s) every second • Nodes must receive a response in (long / short) diskTimeout time – I/O errors indicate clear accessibility problems  timeout is irrelevant • Disk heartbeat failures will lead to node evictions – CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1) CSSD CSSD “Ping” Voting Disk Structure Voting Disk basics – Part 2 • Voting Disks contain dynamic and static data: – Dynamic data: disk heartbeat logging – Static data: information about the nodes in the cluster • With 11.2.0.1 Voting Disks got an “identity”: – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA] • Voting Disks must therefore not be copied using “dd” or “cp” anymore Node information Disk Heartbeat Logging
  • 7. “Simple Majority Rule” Voting Disk basics – Part 3 • Oracle supports redundant Voting Disks for disk failure protection • “Simple Majority Rule” applies: – Each node must “see” the simple majority of configured Voting Disks at all times in order not to be evicted (to remain in the cluster)  trunc(n/2+1) with n=number of voting disks configured and n>=1 CSSD CSSD Insertion 1: “Simple Majority Rule”… … In extended Oracle clusters • http://www.oracle.com/goto/rac – Using standard NFS to support a third voting file for extended cluster configurations (PDF) CSSD CSSD • Same principles apply • Voting Disks are just geographically dispersed
  • 8. Insertion 2: Voting Disk in Oracle ASM The way of storing Voting Disks doesn’t change its use [GRID]> crsctl query css votedisk 1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA] 2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA] 3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA] Located 3 voting disk(s). • Oracle ASM auto creates 1/3/5 Voting Files – Based on Ext/Normal/High redundancy and on Failure Groups in the Disk Group – Per default there is one failure group per disk – ASM will enforce the required number of disks – New failure group type: Quorum Failgroup <Insert Picture Here> Node Eviction Basics
  • 9. Why are nodes evicted?  To prevent worse things from happening… • Evicting (fencing) nodes is a preventive measure (a good thing)! • Nodes are evicted to prevent consequences of a split brain: – Shared data must not be written by independently operating nodes – The easiest way to prevent this is to forcibly remove a node from the cluster 1 2 CSSD CSSD How are nodes evicted in general? “STONITH like” or node eviction basics – Part 1 • Once it is determined that a node needs to be evicted, – A “kill request” is sent to the respective node(s) – Using all (remaining) communication channels • A node (CSSD) is requested to “kill itself”  “STONITH like” – “STONITH” foresees that a remote node kills the node to be evicted 1 2 CSSD CSSD
  • 10. How are nodes evicted? EXAMPLE: Heartbeat failure • The network heartbeat between nodes has failed – It is determined which nodes can still talk to each other – A “kill request” is sent to the node(s) to be evicted  Using all (remaining) communication channels  Voting Disk(s) • A node is requested to “kill itself”; executer: typically CSSD 1 CSSD CSSD 2 How can nodes be evicted? Using IPMI / Node eviction basics – Part 2 • Oracle Clusterware 11.2.0.1 and later supports IPMI (optional) – Intelligent Platform Management Interface (IPMI) drivers required • IPMI allows remote-shutdown of nodes using additional hardware – A Baseboard Management Controller (BMC) per cluster node is required 1 CSSD CSSD
  • 11. Insertion: Node Eviction Using IPMI EXAMPLE: Heartbeat failure • The network heartbeat between the nodes has failed – It is determined which nodes can still talk to each other – IPMI is used to remotely shutdown the node to be evicted 1 CSSD Which node is evicted? Node eviction basics – Part 3 • Voting Disks and heartbeat communication is used to determine the node • In a 2 node cluster, the node with the lowest node number should survive • In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 CSSD CSSD
  • 12. <Insert Picture Here> Re-bootless Node Fencing (restart) Re-bootless Node Fencing (restart) Fence the cluster, do not reboot the node • Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot” • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because: – Re-boots affect applications that might run an a node, but are not protected – Customer requirement: prevent a reboot, just stop the cluster – implemented... Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 13. Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • It starts with a failure – e.g. network heartbeat or interconnect failure Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD
  • 14. Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Then IO issuing processes are killed; it is made sure that no IO process remains – For a RAC DB mainly the log writer and the database writer are of concern Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Once all IO issuing processes are killed, remaining processes are stopped – IF the check for a successful kill of the IO processes, fails → reboot Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD CSSD
  • 15. Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Once all remaining processes are stopped, the stack stops itself with a “restart flag” Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASD Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less: – Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • OHASD will finally attempt to restart the stack after the graceful shutdown Standalone Standalone App X App Y Oracle RAC DB Inst. 1 CSSD OHASD
  • 16. Re-bootless Node Fencing (restart) EXCEPTIONS • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…: – IF the check for a successful kill of the IO processes fails → reboot – IF CSSD gets killed during the operation → reboot – IF cssdmonitor (oprocd replacement) is not scheduled → reboot – IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot Standalone Standalone App X App Y Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 CSSD CSSD <Insert Picture Here> Advanced Node Management
  • 17. Determine the Biggest Sub-Cluster Voting Disk basics – Part 4 • Each node in the cluster is “pinged” every second (network heartbeat) • Each node in the cluster “pings” (r/w) the Voting Disk(s) every second 1 2 3 CSSD CSSD CSSD 1 2 3 Determine the Biggest Sub-Cluster Voting Disk basics – Part 4 • In a n-node cluster, the biggest sub-cluster should survive (votes based) 1 2 3 CSSD CSSD CSSD 2 1 3
  • 18. Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 • Redundant Voting Disks  Oracle managed redundancy • Assume for a moment only 2 1 voting disks are supported… CSSD 2 3 CSSD CSSD Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 • Advanced scenarios need to be considered 1 • Without the “Simple Majority CSSD Rule”, what would we do? 2 3 CSSD CSSD • Even with the “Simple Majority Rule” in place – Each node can see only one voting disk, which would lead to an eviction of all nodes
  • 19. Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3 Redundant Voting Disks – Why odd? Voting Disk basics – Part 5 1 2 3 1 CSSD 2 3 CSSD CSSD 1 1 2 2 3 3
  • 20. <Insert Picture Here> The Corner Cases Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way… • A properly configured cluster with 3 voting disks as shown CSSD CSSD • What happens if there is a storage network failure as shown (lost remote access)?
  • 21. Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way… • There will be no node eviction! • IF storage mirroring is used (for data files), the respective solution must handle this case. CSSD CSSD • Covered in Oracle ASM 11.2.0.2: – _asm_storagemaysplit = TRUE – Backported to 11.1.0.7 Case 2: CSSD is stuck CSSD cannot execute request • A node is requested to “kill itself” • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin  OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSD
  • 22. Case 2: CSSD is stuck CSSD cannot execute request • A node is requested to “kill itself” • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.: – CSSD failed for some reason – CSSD is not scheduled within a certain margin  OCSSDMONITOR (was: oprocd) will take over and execute 1 CSSD CSSDmonitor CSSD Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Cluster members (e.g Oracle RAC instances) can request Oracle Clusterware to kill a specific member of the cluster • Oracle Clusterware will then attempt to kill the requested member Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD
  • 23. Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC Oracle RAC DB Inst. 1 DB Inst. 2 Inst. 1: kill inst. 2 CSSD CSSD
  • 24. Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests • Oracle Clusterware will then attempt to kill the requested member • If the requested member kill is unsuccessful, a node eviction escalation can be issued, which leads to the eviction of the node, on which the particular member currently resides Oracle RAC DB Inst. 1 CSSD <Insert Picture Here> More Information
  • 25. More Information • My Oracle Support Notes: – ID 294430.1 - CSS Timeout Computation in Oracle Clusterware – ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing, Panic and Reboot • http://www.oracle.com/goto/clusterware – Oracle Clusterware 11g Release 2 Technical Overview • http://www.oracle.com/goto/asm • http://www.oracle.com/goto/rac