Oracle Clusterware Node Management and Voting Disks

<Insert Picture Here>

Node Management in Oracle Clusterware
Markus Michalewicz
Senior Principal Product Manager Oracle RAC and Oracle RAC One Node

The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remain at the sole discretion of Oracle.

Agenda
• Oracle Clusterware 11.2.0.1 Processes

• Node Monitoring Basics

• Node Eviction Basics

• Re-bootless Node Fencing (restart)

• Advanced Node Management

• The Corner Cases

• More Information / Q&A

Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management

Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management – focus!

OHASD

CSSD
ora.cssd

CSSDMONITOR
(was: oprocd)
ora.cssdmonitor


Node Monitoring Basics

Basic Hardware Layout Oracle Clusterware
Node management is hardware independent

Public Lan Public Lan

Private Lan /
Interconnect

CSSD CSSD CSSD

SAN SAN
Network Network
Voting
Disk

What does CSSD do?
CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
– Private Interconnect  Network Heartbeat
– Voting Disk based communication  Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
nodes dependent on heartbeat feedback (failures)

CSSD “Ping” CSSD

“Ping”

Network Heartbeat
Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
– Reducing the css_misscount time is generally not supported

• Network heartbeat failures will lead to node evictions
– CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds

CSSD “Ping” CSSD

Disk Heartbeat
Voting Disk basics – Part 1
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
– I/O errors indicate clear accessibility problems  timeout is irrelevant

• Disk heartbeat failures will lead to node evictions
– CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:
node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)

CSSD CSSD

“Ping”

Voting Disk Structure
• Voting Disks contain dynamic and static data:
– Dynamic data: disk heartbeat logging
– Static data: information about the nodes in the cluster

• With 11.2.0.1 Voting Disks got an “identity”:
– E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]

• Voting Disks must therefore not be copied using “dd” or “cp” anymore

Node information Disk Heartbeat Logging

“Simple Majority Rule”
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
– Each node must “see” the simple majority of configured Voting Disks
at all times in order not to be evicted (to remain in the cluster)

 trunc(n/2+1) with n=number of voting disks configured and n>=1

CSSD CSSD

Insertion 1: “Simple Majority Rule”…
… In extended Oracle clusters

• http://www.oracle.com/goto/rac
– Using standard NFS to support
a third voting file for extended
cluster configurations (PDF)

CSSD CSSD

• Same principles apply
• Voting Disks are just
geographically dispersed

Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesn’t change its use

[GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
Located 3 voting disk(s).

• Oracle ASM auto creates 1/3/5 Voting Files
– Based on Ext/Normal/High redundancy
and on Failure Groups in the Disk Group
– Per default there is one failure group per disk
– ASM will enforce the required number of disks
– New failure group type: Quorum Failgroup


Node Eviction Basics

Why are nodes evicted?
 To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
– Shared data must not be written by independently operating nodes
– The easiest way to prevent this is to forcibly remove a node from the cluster

1 2

CSSD CSSD

How are nodes evicted in general?
“STONITH like” or node eviction basics – Part 1
• Once it is determined that a node needs to be evicted,
– A “kill request” is sent to the respective node(s)
– Using all (remaining) communication channels

• A node (CSSD) is requested to “kill itself”  “STONITH like”
– “STONITH” foresees that a remote node kills the node to be evicted

1 2

CSSD CSSD

How are nodes evicted?
EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
– It is determined which nodes can still talk to each other
– A “kill request” is sent to the node(s) to be evicted
 Using all (remaining) communication channels  Voting Disk(s)

• A node is requested to “kill itself”; executer: typically CSSD

1

CSSD CSSD

2

How can nodes be evicted?
Using IPMI / Node eviction basics – Part 2
• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)
– Intelligent Platform Management Interface (IPMI) drivers required

• IPMI allows remote-shutdown of nodes using additional hardware
– A Baseboard Management Controller (BMC) per cluster node is required

1
CSSD CSSD

Insertion: Node Eviction Using IPMI
EXAMPLE: Heartbeat failure
• The network heartbeat between the nodes has failed
– It is determined which nodes can still talk to each other
– IPMI is used to remotely shutdown the node to be evicted

1
CSSD

Which node is evicted?
Node eviction basics – Part 3
• Voting Disks and heartbeat communication is used to determine the node

• In a 2 node cluster, the node with the lowest node number should survive
• In a n-node cluster, the biggest sub-cluster should survive (votes based)

1 2

CSSD CSSD


Re-bootless Node
Fencing (restart)

Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
– Re-boots affect applications that might run an a node, but are not protected
– Customer requirement: prevent a reboot, just stop the cluster – implemented...

Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2

CSSD CSSD

How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• It starts with a failure – e.g. network heartbeat or interconnect failure

App X App Y

CSSD CSSD

How it works

• It starts with a failure – e.g. network heartbeat or interconnect failure

App X App Y

CSSD CSSD

How it works

• Then IO issuing processes are killed; it is made sure that no IO process remains
– For a RAC DB mainly the log writer and the database writer are of concern

App X App Y
Oracle RAC
DB Inst. 1

CSSD CSSD

How it works

• Once all IO issuing processes are killed, remaining processes are stopped
– IF the check for a successful kill of the IO processes, fails → reboot

App X App Y
Oracle RAC
DB Inst. 1

CSSD CSSD

How it works

• Once all remaining processes are stopped, the stack stops itself with a “restart flag”

App X App Y
Oracle RAC
DB Inst. 1

CSSD OHASD

How it works

• OHASD will finally attempt to restart the stack after the graceful shutdown

App X App Y
Oracle RAC
DB Inst. 1

CSSD OHASD

EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
– IF the check for a successful kill of the IO processes fails → reboot
– IF CSSD gets killed during the operation → reboot
– IF cssdmonitor (oprocd replacement) is not scheduled → reboot
– IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot

App X App Y

CSSD CSSD


Advanced Node
Management

Determine the Biggest Sub-Cluster
• Each node in the cluster is “pinged” every second (network heartbeat)
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second

1 2 3
CSSD CSSD CSSD

1
2
3

Determine the Biggest Sub-Cluster
• In a n-node cluster, the biggest sub-cluster should survive (votes based)

1 2 3
CSSD CSSD CSSD

2

1

3

Redundant Voting Disks – Why odd?
• Redundant Voting Disks  Oracle managed redundancy

• Assume for a moment only 2
1 voting disks are supported…
CSSD
2 3
CSSD CSSD

• Advanced scenarios need to be considered

1
• Without the “Simple Majority
CSSD
Rule”, what would we do?
2 3
CSSD CSSD

• Even with the “Simple
Majority Rule” in place
– Each node can see only one
voting disk, which would lead
to an eviction of all nodes


1
2

3
1
CSSD
2 3
CSSD CSSD

1 1
2 2

3 3


1
2

3
1
CSSD
2 3
CSSD CSSD

1 1
2 2

3 3


The Corner Cases

Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…

• A properly configured cluster
with 3 voting disks as shown

CSSD CSSD

• What happens if there is a
storage network failure as
shown (lost remote access)?

Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…

• There will be no node eviction!
• IF storage mirroring is used
(for data files), the respective
solution must handle this case.
CSSD CSSD

• Covered in Oracle ASM 11.2.0.2:
– _asm_storagemaysplit = TRUE
– Backported to 11.1.0.7

Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin

 OCSSDMONITOR (was: oprocd) will take over and execute

1

CSSD CSSD

Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin

 OCSSDMONITOR (was: oprocd) will take over and execute

1

CSSD CSSDmonitor

CSSD

Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Cluster members (e.g Oracle RAC instances) can request
Oracle Clusterware to kill a specific member of the cluster

• Oracle Clusterware will then attempt to kill the requested member

Inst. 1:
kill inst. 2

CSSD CSSD


• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides

Inst. 1:
kill inst. 2

CSSD CSSD



Inst. 1:
kill inst. 2

CSSD CSSD



Oracle RAC
DB Inst. 1

CSSD


More Information

More Information
• My Oracle Support Notes:
– ID 294430.1 - CSS Timeout Computation in Oracle Clusterware
– ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration
for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,
Panic and Reboot

• http://www.oracle.com/goto/clusterware
– Oracle Clusterware 11g Release 2 Technical Overview

• http://www.oracle.com/goto/asm

• http://www.oracle.com/goto/rac

Oracle Clusterware Node Management and Voting Disks

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Oracle Clusterware Node Management and Voting Disks (20)

More from Markus Michalewicz (20)

Recently uploaded (20)

Oracle Clusterware Node Management and Voting Disks