SlideShare a Scribd company logo
Linux Synchronization Mechanism: RCU (Read-
Copy-Update)
Adrian Huang | Apr, 2023
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: six basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
task 0
task 1
task N read_unlock()
read_lock()
Critical Section
write_unlock()
write_lock()
Readers
Critical Section
[Overview] rwlock (reader-writer spinlock) vs RCU
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Deferred destruction: Safe to free memory
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers
task 0
task 1
task N
Writers/Updaters
(Publisher)
rwlock_t
1. Mutual exclusion between reader and writer
2. Writer might be starved
1. RCU is a non-blocking synchronization mechanism: No mutual exclusion between readers and a writer
2. No specific *lock* data structure
[Overview] rwlock (reader-writer spinlock) vs RCU
* Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
How does RCU achieve concurrent readers and one writer?
(No mutual exclusion between readers and one writer)
rwlock
• Mutual exclusion between reader and writer
• Writer might be starved
RCU
• A non-blocking synchronization mechanism
• No specific *lock* data structure
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: six basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
6 important APIs
Readers or
Updaters
Primitive Purpose
Readers
rcu_read_lock() Start an RCU read-side critical section
rcu_read_unlock() End an RCU read-side critical section
rcu_derefernce() Safely load an RCU-protected pointer
Updaters
synchronize_rcu() Wait for all pre-existing RCU read-side critical sections to complete
call_rcu() Invoke the specified function after all pre-existing RCU read-side critical sections
complete
rcu_assign_pointer() Safely update an RCU-protected pointer
RCU manipulates pointers
RCU Implementation Overview
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer:
removal phase
Wait for job completion of readers
Safe to free memory:
reclamation phase
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
• RCU is often used as a replacement for reader-writer locking
• Lockless readers
✓ rcu_read_lock() simply disables preemption: No lock mechanism (Ex: spinlock, mutex and so on)
✓ Readers need not wait for updates (non-blocking): low overhead and excellent scalability
Reader
RCU Implementation Overview
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer:
removal phase
Wait for job completion of readers
Safe to free memory:
reclamation phase
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
• [Writer] Waiting for readers
✓ Removal phase: remove references to data items (possibly by replacing them with references to new versions of these
data items)
➢ Can run concurrently with readers: readers see either the old or the new version of the data structure rather than
a partially updated reference.
✓ synchronize_rcu() and call_rcu(): wait for readers exiting critical section
✓ Block or register a callback that is invoked after active readers have completed.
✓ Reclamation phase: reclaim data items.
• Quiescent State (QS): [per-core] The time after pre-existing readers are done
✓ Context switch: a valid quiescent state
• Grace Period (GP): All cores have passed through the quiescent state → Complete a grace period
Typical RCU Update Sequence
RCU: List Manipulation – Old and New data
Head 5 2 9
Allocate/fill a structure 8
Insert to the list
1
2
Head 5 2 9
8
RCU: List Manipulation – Old and New data
Head 5 2 9
8
Remove
1
Head 5 2 9
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
2
Old data 2 8
New data 2 9
Either the old data or the new data is read by readers.
RCU: List Manipulation – Old and New data
Head 5 2 9
8
Remove
1
Head 5 2 9
2
Old data 2 8
New data 2 9
Either the old data or the new
data is read by readers.
RCU reader Old data is read
New data is read
Legend
* Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
Reader/Writer synchronization
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
synchronize_rcu();
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, NULL);
kfree(tmp);
RCU Reader
RCU Writer w/ valid pointer assignment
RCU Writer w/ NULL pointer
assignment and free memory
RCU Spatial/Temporal Synchronization: Sample Code
RCU Reader
RCU Writer/Updater
Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
RCU Spatial/Temporal Synchronization
5,25 9,81
curconfig
Address Space
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (5)
*cur_b = mcp->b; (25)
rcu_read_unlock();
mcp = kmalloc(…);
rcu_assign_pointer(curconfig, mcp);
synchronize_rcu();
…
…
kfree(old_mcp);
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (9)
*cur_b = mcp->b; (81)
rcu_read_unlock();
Grace
Period
Readers
Readers
• Temporal Synchronization
✓ [Reader] rcu_read_lock() / rcu_read_unlock()
✓ [Writer/Update] synchronize_rcu() / call_rcu()
• Spatial Synchronization
• [Reader] rcu_dereference()
• [Writer/Update] rcu_assign_pointer()
Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
Time
RCU Spatial/Temporal Synchronization
5,25 9,81
curconfig
Address Space
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (5)
*cur_b = mcp->b; (25)
rcu_read_unlock();
mcp = kmalloc(…);
rcu_assign_pointer(curconfig, mcp);
synchronize_rcu();
…
…
kfree(old_mcp);
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (9)
*cur_b = mcp->b; (81)
rcu_read_unlock();
Grace
Period
Readers
Readers
RCU combines temporal and spatial synchronization in order to approximate
reader-writer locking
• Temporal Synchronization
✓ [Reader] rcu_read_lock() / rcu_read_unlock()
✓ [Writer/Update] synchronize_rcu() / call_rcu()
• Spatial Synchronization
• [Reader] rcu_dereference()
• [Writer/Update] rcu_assign_pointer()
Reader: rcu_read_lock() & rcu_read_unlock():
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
✓ Simply disable/enable preemption when entering/exiting RCU critical section
• [Why] A QS is detected by a context switch.
Kernel Source Reference: 2.6.24
Reader: rcu_read_lock() & rcu_read_unlock()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
Kernel Source Reference: 2.6.24
• __acquire() / __release(): sparse checker (semantic parser)
✓ Sparse checking of RCU-protected pointers: Add __rcu marker
➢ Reference: PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
Reader: rcu_read_lock() & rcu_read_unlock()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
Kernel Source Reference: 2.6.24
• rcu_read_acquire() / rcu_read_release(): lock dep (runtime locking correctness validator)
✓ Programming error (deadlock)
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
It’s illegal to block while in an RCU read-side CS (Exception: CONFIG_PREEMPT_RCU=y)
Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
rcu_assign_pointer() and rcu_dereference() invocations communicate spatial
synchronization via stores to and loads from the RCU-protected pointer
Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
Reader: rcu_dereference()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
printk("%dn", ptr->number);
RCU Reader
• Preserve order
• Load -> Load, Load -> Store and Store -> Store
• Might be re-order because of store buffer
• Store -> Load
Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system
Kernel Source Reference: 2.6.24
Writer: rcu_assign_pointer()
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
• Preserve order
• Load -> Load, Load -> Store and Store -> Store
• Might be re-order because of store buffer
• Store -> Load
Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system
RCU Writer
Kernel Source Reference: 2.6.24
Writer: synchronize_rcu() & call_rcu()
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
RCU Writer
• Mark the end of updater code and the beginning of reclaimer code
• [Synchronous] synchronize_rcu()
✓ Block until all pre-existing RCU read-side critical sections on all CPUs have completed
✓ leverages call_rcu()
• [Asynchronous] call_rcu()
✓ Queue a callback for invocation after a grace period
✓ [Scenario]
➢ It’s illegal to block RCU updater
➢ Update-side performance is critically important
[Updater] Ensure only one writer: removal phase
Wait for job completion of readers
[Reclaimer] Safe to free memory: reclamation phase
Why the name RCU (read-copy update)?
RCU Writer
Create a copy
Why the name RCU (read-copy update)?
RCU Writer
Update the copy
Why the name RCU (read-copy update)?
RCU Writer
Replace the old entry with
the newly created one
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
reader
list_del_rcu() synchronize_rcu() kfree()
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
Quiescent State (QS)
• A point in the code where there can be no references held to RCU-protected data
structures, which is normally any point outside of an RCU read-side critical section.
✓ RCU read-side critical section is defined by the range between rcu_read_lock() and
rcu_read_unlock().
Grace Period (GP)
• All threads (cores) pass through at least one quiescent state.
Quotes from the book “Is Parallel Programming Hard, And, If So, What Can You Do About It?”
reader
list_del_rcu() synchronize_rcu() kfree()
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
reader
list_del_rcu() synchronize_rcu() kfree()
synchronize_rcu(): How to detect a grace period?
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers - 6 Approaches
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Reference Counter
• Reference counter (share data) updated by rcu_read_lock() and rcu_read_unlock()
✓ Scalability problem due to cache bouncing
Reference Counter
* Figure reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – CPU Register
• Check CPUs’ register (Prevent from accessing share data such as reference counter)
✓CPUs’ register: Check each CPU’s program counter (PC)
➢The updater polls each relevant PC. If the PC in not within read-side code, the corresponding CPU is within a quiescent state.
➢A complete grace period: All CPU’s PCs have been observed to be outside of read-side code.
✓Challenges
➢Readers might invoke other functions
➢Code-motion optimization
CPU Register
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Wait for a fixed period time
• Wait the enough time to comfortably exceed the lifetime of any reasonable reader
✓Unreasonable reader → issue!
Wait for a fixed period time
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Wait forever
• Accommodate the unreasonable reader.
• Bad reputation: leaking memory
✓ Memory leaks often require untimely and inconvenient reboots.
✓ Work well in a high-availability cluster where systems were periodically
crashed in order to ensure that cluster really remained highly available.
Wait forever
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Avoid the period crashes
• Covered by stop-the-world garbage collector
• In today’s always-connected always-on world, stopping the world can
gravely degrade response times.
Avoid the period crashes
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – QSBR
• Numerous applications already have states (termed quiescent states) that
can be reached only after all pre-existing readers are done.
✓ Transaction-processing application: the time between a pair of successive
transactions might be a quiescent state.
✓ Non-preemptive OS kernel: Context switch can be a quiescent state.
✓ [Non-preemptive OS kernel] RCU reader must be prohibited from blocking
while referencing a global data.
QSBR
* Reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
Grace Period: Waiting for readers – QSBR
[Concept] Implementation for non-preemptive Linux kernel
* [Not production quality] sched_setaffinity() function causes the
current thread to execute on the specified CPU, which forces the
destination CPU to execute a context switch.
[Concept] QSBR: non-production and non-preemptible
implementation
Force the destination CPU to execution switch: Completion of RCU reader
Reference: Page #142 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
synchronize_rcu()
rcu_read_lock() & rcu_read_unlock() → disable/enable preemption
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
✓Will discuss implementation details about classic RCU *only*
• RCU Flavors
• RCU Usage Summary & RCU Case Study
Classic RCU (< 2.6.29) and Tree RCU (>= 2.6.29)
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
• Global cpumask: each bit indicates each core
• A grace period is complete if rcu_ctrlblk.cpumask = 0
• Scalability problem happens when the number of cores are increased
✓ Lock contention: QS cores are contended to update rcu_ctrlblk.cpumask
✓ Cache bouncing for frequent writes
CPU 0 CPU 1 CPU N
. . .
Classic RCU
• Group cpumask: reduce lock contention
• A grace period is complete if root’s rcu_node->qsmask = 0
• Excellent scalability
Tree RCU
qsmask
struct rcu_node
CPU 2 CPU 3
qsmask
struct rcu_node
CPU M . . .
qsmask
struct rcu_node
Root node
struct rcu_state
Reference code: v2.6.24
Reference code: v2.6.29
Classic RCU (< 2.6.29)
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
• Global cpumask: each bit indicates each core
• Scalability problem happens when the number of cores are increased
✓ Lock contention
✓ Cache bouncing for frequent writes
start a new grace period
rcu_ctrlblk.cpumask=0xffff
schedule(): clear the
corresponding bit in
rcu_ctrlblk.cpumask
rcu_ctrlblk.cpumask=0?
Context switch
Y: Finish a grace period
N: Wait for all CPUs
to pass a QS
High-level concept: Suppose it’s a 16-core system
Kernel Source Reference: 2.6.24
Classic RCU: Data Structure
rcu_head
struct rcu_head *next
rcu_ctrlblk
func
cur
completed
next_pending
signaled
lock
cpumask
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
Global control block per-cpu
qs_pending: Avoid cacheline thrashing by assessing
this percpu instead of cpumask bitmap
rcu_head rcu_head
Callback supplied by call_rcu()
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
Kernel Source Reference: 2.6.24
Classic RCU: start_kernel() -> rcu_init()
rcu_ctrlblk
cur = -300
completed = -300
next_pending
signaled
lock
cpumask = 0
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
Global control block:
rcu_ctrlblk
per-cpu: rcu_data
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
Kernel Source Reference: 2.6.24
struct rcu_ctrlblk rcu_bh_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
rcu_ctrlblk
cur = -300
completed = -300
next_pending
signaled
lock
cpumask = 0
Global control block:
rcu_bh_ctrlblk
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
per-cpu: rcu_bh_data
1 2
init percpu data init percpu data
rcu_tasklet per-cpu
rcu_process_callbacks()
3 init rcu_tasklet
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
[Global control block]
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu variable initialized by rcu_init_percpu_data()
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: start_kernel() -> rcu_init(): show “rcu_ctrlblk” only
Classic RCU: call_rcu()
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
call_rcu(&my_rcu_head, my_rcu_func);
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: synchronize_rcu() leverages call_rcu()
Kernel Source Reference: 2.6.24
Classic RCU: Invoke rcu_pending() when a timer interrupt is triggered
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
tasklet_schedule if rdp->donelist: All callbacks are not
invoked yet due to rdp->blimit (limit on a processed batch)
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: Check if a CPU passes a QS
rcu_check_callbacks
update_process_times
rcu_pending()?
Y
timer interrupt
Kernel Source Reference: 2.6.24
rcu_qsctr_inc
schedule
Timer interrupt Scheduler: context switch
[Note] “rcu_data.passed_quiesc = 1” does not clear the corresponding rcu_ctrlblk.cpumask
directly. More checks are performed. See later slides.
Classic RCU: first timer interrupt
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
[Global control block]
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
address of
Manipulation
Kernel Source Reference: 2.6.24
Classic RCU
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = rcp->cur + 1 = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
__rcu_process_callbacks()
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = rcp->cur + 1 = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_start_batch()
• A new grace period is started
• All cpus must go through a quiescent state
✓ at least two calls to rcu_check_quiescent_state() are required
➢ The first call: A new grace period is running
➢ The second call: If there was a quiescent state, then
1. Update rcu_ctrlblk.cpumask: Clear the
corresponding CPU bit
2. rcu_ctrlblk.cpumask is empty: the grace period is
completed.
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 1
signaled = 0
lock
cpumask = 0
rcu_ctrlblk
cur++ → -299
completed = -300
next_pending = 1 0
signaled = 0 0
lock
cpumask = cpu_online_map & ~nohz_cpu_mask
Kernel Source Reference: 2.6.24
Classic RCU
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur++ → -299
completed = -300
next_pending = 1 0
signaled = 0 0
lock
cpumask = 0 all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_start_batch()
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
rcu_start_batch()
• A new grace period is started
• All cpus must go through a quiescent state
✓ at least two calls to rcu_check_quiescent_state() are required
➢ The first call: A new grace period is running
➢ The second call: If there was a quiescent state, then
1. Update rcu_ctrlblk.cpumask: Clear the
corresponding CPU bit
2. rcu_ctrlblk.cpumask is empty: the grace period is
completed.
The first call
The second call
Kernel Source Reference: 2.6.24
The first call, then wait for
the next qs (context switch)
The second call
per-cpu
rcu_data
quiescbatch = rcp->cur = -299
passed_quiesc = 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
After the first call
Classic RCU (< 2.6.29): first timer interrupt Kernel Source Reference: 2.6.24
per-cpu
rcu_data
quiescbatch = rcp->cur = -299
passed_quiesc = 0 1
qs_pending = 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
Context switch happens
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
schedule
rcu_qsctr_inc
rdp->passed_quiesc = 1
Classic RCU: first timer interrupt: quiescent state (context switch)
2
1
Kernel Source Reference: 2.6.24
Classic RCU: second timer interrupt
per-cpu
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
Second timer interrupt -> __rcu_pending()
True
Kernel Source Reference: 2.6.24
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
Classic RCU
per-cpu
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 1 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = cur = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
The second call
cpu_clear()
1
2
3
4
5
6 All CPUs pass QS
7
Classic RCU: Second timer interrupt: rcu_start_batch()
rcu_ctrlblk
cur++ → -298
completed = -299
next_pending = 0
signaled = 0 0
lock
cpumask = 0 all_online_cpu_bitmask
rcu_ctrlblk
cur = -299
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = 0
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
Third timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
1
2
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299 = rcp->cur = -298
passed_quiesc = 1 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
1
2
3
4
5
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = rcp->cur = -298
passed_quiesc = 1 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
1
2
3
Invoke callbacks: depend on
rdp->blimit (Check function
‘rcu_do_batch’ for detail)
Kernel Source Reference: 2.6.24
Tree RCU (>= 2.6.29)
CPU 0 CPU 1 CPU N
. . .
• Group cpumask: reduce lock contention
• A grace period is complete if root’s rcu_node->qsmask = 0
• Excellent scalability
Tree RCU
qsmask
struct rcu_node
CPU 2 CPU 4
qsmask
struct rcu_node
CPU M . . .
qsmask
struct rcu_node
Root node
struct rcu_state
Won’t describe implementation detail about tree RCU. (kinda similar to classic RCU)
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
• Vanilla RCU
• Bottom-half Flavor (Historical)
• Sched Flavor (Historical)
• Sleepable RCU (SRCU)
• Tasks RCU
• Tasks Rude RCU
• Tasks Trace RCU
RCU Flavors
• Reader RCU APIs
✓rcu_read_lock() / rcu_read_unlock()
✓rcu_dereference()
• Writer RCU APIs
✓rcu_assign_pointer()
✓Synchronous grace-period wait primitive: synchronize_rcu()
✓Asynchronous grace-period wait primitive: call_rcu()
RCU Flavors: Vanilla RCU (Classic RCU and Tree RCU)
• Networking data structures that may be subjected to remote denial-
of-service attacks.
✓CPUs never exit softirq execution due to DoS attacks.
➢Prevent CPUs from executing a context switch → prevent grace periods from ever ending.
◼ Out-of-memory and a system hang
✓Disabling bottom halve can prevent the issue.
• Reader RCU APIs
✓rcu_read_lock_bh() / rcu_read_unlock_bh()
➢ local_bh_disable() / local_bh_enable()
✓rcu_dereference_bh()
• Writer: No change
RCU Flavors: Bottom-half Flavor (Historical)
• Before preemptible RCU, context switch is a quiescent state
✓ [A complete grace period] Need to wait for all pre-existing interrupt and NMI handler
• [CONFIG_PREEMPTION=n]
✓ Vanilla RCU and RCU-sched grace period waits for pre-existing interrupt and NMI handlers
✓ Vanilla RCU and RCU-sched have identical implementations
• [CONFIG_PREEMPTION=Y] for RCU-sched
✓ Preemptible RCU does not need to wait for pre-existing interrupt and NMI handler.
✓ The code outside of an RCU read-side critical section → a QS
✓ rcu_read_lock_sched() → disable preemption
✓ rcu_read_unlock_sched() → re-enable preemption
✓ A preemption attempt during the RCU-sched read-side critical section:
✓ rcu_read_unlock_sched() will enter the scheduler
• Reader RCU APIs
✓ rcu_read_lock_sched() / rcu_read_unlock_sched()
✓ preempt_disable() / preempt_enable()
✓ local_irq_save() / local_irq_restore()
✓ hardirq enter / hardirq exit
✓ NMI enter / NMI exit
✓ rcu_dereference_sched()
• Writer: No change
RCU Flavors: Sched Flavor (Historical)
• Classic RCU: blocking or sleeping is strictly prohibited
• SRCU: Allow arbitrary sleeping (or blocking) within RCU read-side critical
section
✓Real-time kernel
➢ Require that spinlock critical section be preemptible
➢ Require that RCU read-side critical section be preemptible
✓Extend grace periods
• Different domains (srcu_struct structure) are defined
• Benefit: a slow SRCU reader in one domain does not delay a SRCU grace period in
another domain
RCU Flavors: Sleepable RCU (SRCU)
struct srcu_struct ss;
int idx;
idx = srcu_read_lock(&ss);
do_something();
srcu_read_unlock(&ss, idx);
• RCU mechanism
✓Keep old version of data structure until no CPU holds a reference to it. → the
structure can be freed (context switch happens).
• Tasks RCU
✓Defer the destruction of an old data structure until it is known that no process holds a reference to it
✓Scenario: Handle the trampolines used in Linux-kernel tracing
➢ Tracer subsystem: ftrace and kprobe
➢ Kprobe: Return probe – Trampoline
✓Reader RCU APIs
➢ No explicit read-side marker
➢ Voluntary context switches separate successive Tasks RCU read-side critical sections.
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks()
RCU Flavors: Tasks RCU
• Tasks RCU does not wait for idle tasks
✓Idle tasks: do not run voluntary context switches
➢ Remain idle for long periods of time
✓Cannot work for tracing of code within idle loop
• Tasks Rude RCU
✓Scenario: Trampoline that might be involved in tracing of code within the idle loop
✓Reader RCU APIs
➢ No explicit read-side marker
➢ Preemption-disabled region of code is a Tasks Rude RCU reader
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_rude()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_rude()
RCU Flavors: Tasks Rude RCU
• Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given
trampoline
• Tasks Rude RCU
✓Scenario: BPF programs need to sleep
✓Reader RCU APIs
➢ Explicit read-side marker: rcu_read_lock_trace() / rcu_read_unlock_trace()
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_trace()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_trace()
RCU Flavors: Tasks Trace RCU
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
RCU Usage Summary
Usage Description Applicable?
Routing Table 1. Read mostly
2. Stale and inconsistent data is permissible
Work great
Linux kernel’s mapping from user-level
System-V semaphore IDs to in-kernel data
structures
1. Read mostly: Semaphore tends to be used far more
frequently than they are created and destroyed.
2. Need consistent data: perform a semaphore operation
on a semaphore that has already been deleted.
Work well
dentry cache in Linux kernel 1. Read/Write workload
2. Need consistent data
Might be ok
SLAB_TYPESAFE_BY_RCU slab-allocator flag
provides type-safe memory to RCU readers
1. Write mostly
2. Need consistent data
Not best
Case Study: Design Patterns and Lock Granularity
• Code locking: use global locks only
✓ Lock contention & scalability issue
• Data locking
✓ Many data structures may be partitioned
➢ Example: hash table
✓ Each partition of data structure has its own lock
✓ Improve lock contention & better scalability
• Data ownership
✓ Data structure is partitioned over threads or CPUs
➢ Each thread/CPU accesses its subset of data structure without any
synchronization overhead
➢ The most-heavily used data owned by a single CPU → Hot spot in this CPU
➢ No sharing is required -> achieve ideal performance
➢ Example: percpu variables in Linux kernel
Case Study: Code locking: dentry lookup
• v2.5.61 and earlier
• dcache_lock
✓ protect the hash chain (hash table), d_child, d_alias,
d_lru lists and d_inode
.
.
dentry_hashtable
Source code from: v2.5.61
Case Study: Code locking: dentry lookup
• v2.5.61 and earlier
• dcache_lock
✓ protect the hash chain (hash table), d_child, d_alias,
d_lru lists and d_inode
.
.
dentry_hashtable
Code locking by `dcache_lock`
• dcache_lock protects the whole hash table
• More lock contention, poor scalability
Code locking
Source code from: v2.5.61
Case Study: Data locking: dentry lookup
• v2.5.62 and later
• RCU
✓ lock-free for dcache (dentry) look-up
✓ No need to acquire `dcache_lock` when traversing the
hash table in d_lookup
✓ Rely on RCU to ensure the dentry has not been *freed*.
• dcache_lock
✓ dcache_lock must be taken for the followings:
✓ Traversing and updating
✓ Hashtable update
.
.
dentry_hashtable
Data locking by `dentry->d_lock
• Improve lock contention
• Better scalability
Source code from: v2.5.62
Data locking
Case Study: Data locking: dentry lookup - seqlock
Source code from: v2.5.67
Case Study: Data locking: dentry lookup - seqlock
Source code from: v2.5.67
This approach is still used in latest kernel (v6.3)
Reference
• Is Parallel Programming Hard, And, If So, What Can You Do About It?
• What is RCU? – “Read, Copy, Update”
• What Does It Mean To Be An RCU Implementation?
• https://docs.kernel.org/RCU/index.html
• Using RCU for linked lists — a case study
• Scaling dcache with RCU
• Sleepable RCU (SRCU)
• https://lwn.net/Articles/253651/
• https://zhuanlan.zhihu.com/p/90223380
• Preemptible RCU
• https://lwn.net/Articles/253651/
• https://zhuanlan.zhihu.com/p/90223380
Backup
RCU (Read-Copy-Update)
• Non-blocking synchronization
✓Deadlock Immunity: RCU read-side primitives do not block, spin or even do
backwards branches → execution time is deterministic.
➢Exception (programming error):
➢Immunity to priority inversion
◼ Low-priority RCU readers cannot prevent a high-priority RCU updater from acquiring the
update-side lock
◼ A low-priority RCU updater cannot prevent high-priority RCU readers from entering read-side
critical section
➢[-rt kernel] RCU is susceptible to priority inversion scenarios:
➢ A High-priority process blocked waiting for an RCU grace period to elapse can be blocked by
low-priority RCU readers. --> Solve by RCU priority boosting.
➢ [RCU priority boosting] Require rcu_read_unclock() do deboosting, which entails acquiring
scheduler locks.
➢ Need to avoid deadlocks within the scheduler and RCU: v5.15 kernel requires RCU to
avoid invoking the scheduler while holding any of RCU’s locks
➢ rcu_read_unlock() is not always lockless when RCU priority boosting is enabled.
RCU Properties
• Reader
✓Reads need not wait for updates
➢Low-cost or even no-cost readers → excellent scalability
✓Each reader has a coherent view of each object within the block of rcu_read_lock()
and rcu_read_unlock()
• Updater
✓synchronized_rcu(): ensure that objects are not freed until after the completion of all
readers that might be using them.
✓rcu_assign_pointer() and rcu_dereference(): efficient and scalable mechanisms for
publishing and reading new versions of an object.
Wait for pre-existing RCU readers
• Wait for the RCU read-side critical section: rcu_read_lock()/rcu_read_unlock()
✓Illegal to sleep within an RCU read-side critical section because a context switch is a
quiescent state
* If any portion of a given critical section precedes the
beginning of a given grace period, then RCU
guarantees that all of that critical section will precede
the end of that grace period.
Wait for pre-existing RCU readers: if-then relationship
• If any portion of a given critical section precedes the beginning of a
given grace period, then RCU guarantees that all of that critical
section will precede the end of that grace period.
✓ r1 = 0 and r2 = 0
• If any portion of a given critical section follows the end of a given
grace period, then RCU guarantees that all of that critical section will
follow the beginning of that grace period.
✓ r1 = 1 and r2 = 1
• What would happen if the order of P()’s two accesses was reversed?
✓ Nothing change because loads from x and y are in the same
RCU read-side critical section.
Wait for pre-existing RCU readers: if-then relationship
• An RCU read-side critical section can be completely overlapped by an
RCU grace period.
✓ r1 = 1 and r2 = 0
✓ Cannot be r1 = 0 and r2 = 1
Wait for pre-existing RCU readers: if-then relationship
RCU Grace-Period Ordering Guarantee
Given a grace period, each readers ends before the end of that grace period,
starts after the beginning of that grace period, or both.
Maintain Multiple Versions of Recently Updated Objects
• RCU accommodates synchronization-free readers (weak temporal
synchronization) by maintaining multiple versions of data
* This slide is referred from page 44 of What is RCU?
* This slide is referred from page 80 of slides
task_struct
cg_list
rcu_node_entry
(CONFIG_PREEMPT_RCU=y)
tasks
thread_node
(linked to signal_struct)
Kernel Source Reference: 6.2
Not use RCU: not read-mostly
data structures (use spinlock)
RCU: Defer destruction
ptraced
ptrace_entry
RCU + spinlock
children
sibling
Case study: Access task_struct linked list
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock);

More Related Content

What's hot (20)

PDF
Memory Mapping Implementation (mmap) in Linux Kernel
Adrian Huang
 
PPTX
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
Adrian Huang
 
PDF
Process Address Space: The way to create virtual address (page table) of user...
Adrian Huang
 
PDF
semaphore & mutex.pdf
Adrian Huang
 
PDF
malloc & vmalloc in Linux
Adrian Huang
 
PDF
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Anne Nicolas
 
PDF
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
Adrian Huang
 
PDF
[232] 성능어디까지쥐어짜봤니 송태웅
NAVER D2
 
PPTX
Linux Initialization Process (1)
shimosawa
 
PDF
Memory Compaction in Linux Kernel.pdf
Adrian Huang
 
PDF
Ixgbe internals
SUSE Labs Taipei
 
PPTX
Linux kernel debugging
Hao-Ran Liu
 
PDF
Performance Analysis Tools for Linux Kernel
lcplcp1
 
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
PDF
Page cache in Linux kernel
Adrian Huang
 
PDF
from Binary to Binary: How Qemu Works
Zhen Wei
 
PDF
Memory Management with Page Folios
Adrian Huang
 
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
PPTX
Linux Memory Management
Ni Zo-Ma
 
PDF
L4 Microkernel :: Design Overview
National Cheng Kung University
 
Memory Mapping Implementation (mmap) in Linux Kernel
Adrian Huang
 
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
Adrian Huang
 
Process Address Space: The way to create virtual address (page table) of user...
Adrian Huang
 
semaphore & mutex.pdf
Adrian Huang
 
malloc & vmalloc in Linux
Adrian Huang
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Anne Nicolas
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
Adrian Huang
 
[232] 성능어디까지쥐어짜봤니 송태웅
NAVER D2
 
Linux Initialization Process (1)
shimosawa
 
Memory Compaction in Linux Kernel.pdf
Adrian Huang
 
Ixgbe internals
SUSE Labs Taipei
 
Linux kernel debugging
Hao-Ran Liu
 
Performance Analysis Tools for Linux Kernel
lcplcp1
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
Page cache in Linux kernel
Adrian Huang
 
from Binary to Binary: How Qemu Works
Zhen Wei
 
Memory Management with Page Folios
Adrian Huang
 
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
Linux Memory Management
Ni Zo-Ma
 
L4 Microkernel :: Design Overview
National Cheng Kung University
 

Similar to Linux Synchronization Mechanism: RCU (Read Copy Update) (20)

PPT
RCU
bergwolf
 
ODP
Gluster d thread_synchronization_using_urcu_lca2016
Gluster.org
 
PDF
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Anne Nicolas
 
PDF
Userspace RCU library : what linear multiprocessor scalability means for your...
Alexey Ivanov
 
PDF
Yet another introduction to Linux RCU
Viller Hsiao
 
ODP
Glusterd_thread_synchronization_using_urcu_lca2016
Atin Mukherjee
 
PDF
2021 10-13 i ox query processing
Andrew Lamb
 
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
ODP
Thread synchronization in GlusterD using URCU
Atin Mukherjee
 
PDF
Kernel Recipes 2015 - So you want to write a Linux driver framework
Anne Nicolas
 
PPT
Using Statspack and AWR for Memory Monitoring and Tuning
Texas Memory Systems, and IBM Company
 
PDF
Streaming replication in practice
Alexey Lesovsky
 
PPT
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
John Kanagaraj
 
PPTX
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
PPT
10 Multicore 07
timcrack
 
PDF
FIFOPt
sumeet jain
 
PDF
rcu dan porter read update linux copy.pdf
Rajeshravi49
 
PDF
Introduction to RCU
Kernel TLV
 
PDF
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
PPT
Advanced Operating System, Distributed Operating System
Shikha Manrai
 
Gluster d thread_synchronization_using_urcu_lca2016
Gluster.org
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Anne Nicolas
 
Userspace RCU library : what linear multiprocessor scalability means for your...
Alexey Ivanov
 
Yet another introduction to Linux RCU
Viller Hsiao
 
Glusterd_thread_synchronization_using_urcu_lca2016
Atin Mukherjee
 
2021 10-13 i ox query processing
Andrew Lamb
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
Thread synchronization in GlusterD using URCU
Atin Mukherjee
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Anne Nicolas
 
Using Statspack and AWR for Memory Monitoring and Tuning
Texas Memory Systems, and IBM Company
 
Streaming replication in practice
Alexey Lesovsky
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
John Kanagaraj
 
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
10 Multicore 07
timcrack
 
FIFOPt
sumeet jain
 
rcu dan porter read update linux copy.pdf
Rajeshravi49
 
Introduction to RCU
Kernel TLV
 
Bpf performance tools chapter 4 bcc
Viller Hsiao
 
Advanced Operating System, Distributed Operating System
Shikha Manrai
 
Ad

Recently uploaded (20)

PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Ad

Linux Synchronization Mechanism: RCU (Read Copy Update)

  • 1. Linux Synchronization Mechanism: RCU (Read- Copy-Update) Adrian Huang | Apr, 2023
  • 2. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: six basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 3. task 0 task 1 task N read_unlock() read_lock() Critical Section write_unlock() write_lock() Readers Critical Section [Overview] rwlock (reader-writer spinlock) vs RCU rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer Wait for job completion of readers Deferred destruction: Safe to free memory task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers task 0 task 1 task N Writers/Updaters (Publisher) rwlock_t 1. Mutual exclusion between reader and writer 2. Writer might be starved 1. RCU is a non-blocking synchronization mechanism: No mutual exclusion between readers and a writer 2. No specific *lock* data structure
  • 4. [Overview] rwlock (reader-writer spinlock) vs RCU * Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It? How does RCU achieve concurrent readers and one writer? (No mutual exclusion between readers and one writer) rwlock • Mutual exclusion between reader and writer • Writer might be starved RCU • A non-blocking synchronization mechanism • No specific *lock* data structure
  • 5. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: six basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 6. 6 important APIs Readers or Updaters Primitive Purpose Readers rcu_read_lock() Start an RCU read-side critical section rcu_read_unlock() End an RCU read-side critical section rcu_derefernce() Safely load an RCU-protected pointer Updaters synchronize_rcu() Wait for all pre-existing RCU read-side critical sections to complete call_rcu() Invoke the specified function after all pre-existing RCU read-side critical sections complete rcu_assign_pointer() Safely update an RCU-protected pointer RCU manipulates pointers
  • 7. RCU Implementation Overview rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer: removal phase Wait for job completion of readers Safe to free memory: reclamation phase task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) • RCU is often used as a replacement for reader-writer locking • Lockless readers ✓ rcu_read_lock() simply disables preemption: No lock mechanism (Ex: spinlock, mutex and so on) ✓ Readers need not wait for updates (non-blocking): low overhead and excellent scalability Reader
  • 8. RCU Implementation Overview rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer: removal phase Wait for job completion of readers Safe to free memory: reclamation phase task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) • [Writer] Waiting for readers ✓ Removal phase: remove references to data items (possibly by replacing them with references to new versions of these data items) ➢ Can run concurrently with readers: readers see either the old or the new version of the data structure rather than a partially updated reference. ✓ synchronize_rcu() and call_rcu(): wait for readers exiting critical section ✓ Block or register a callback that is invoked after active readers have completed. ✓ Reclamation phase: reclaim data items. • Quiescent State (QS): [per-core] The time after pre-existing readers are done ✓ Context switch: a valid quiescent state • Grace Period (GP): All cores have passed through the quiescent state → Complete a grace period Typical RCU Update Sequence
  • 9. RCU: List Manipulation – Old and New data Head 5 2 9 Allocate/fill a structure 8 Insert to the list 1 2 Head 5 2 9 8
  • 10. RCU: List Manipulation – Old and New data Head 5 2 9 8 Remove 1 Head 5 2 9 rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) 2 Old data 2 8 New data 2 9 Either the old data or the new data is read by readers.
  • 11. RCU: List Manipulation – Old and New data Head 5 2 9 8 Remove 1 Head 5 2 9 2 Old data 2 8 New data 2 9 Either the old data or the new data is read by readers. RCU reader Old data is read New data is read Legend * Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 12. Reader/Writer synchronization rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); synchronize_rcu(); spin_unlock spinlock rcu_assign_pointer(shared_abc, NULL); kfree(tmp); RCU Reader RCU Writer w/ valid pointer assignment RCU Writer w/ NULL pointer assignment and free memory
  • 13. RCU Spatial/Temporal Synchronization: Sample Code RCU Reader RCU Writer/Updater Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 14. RCU Spatial/Temporal Synchronization 5,25 9,81 curconfig Address Space rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (5) *cur_b = mcp->b; (25) rcu_read_unlock(); mcp = kmalloc(…); rcu_assign_pointer(curconfig, mcp); synchronize_rcu(); … … kfree(old_mcp); rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (9) *cur_b = mcp->b; (81) rcu_read_unlock(); Grace Period Readers Readers • Temporal Synchronization ✓ [Reader] rcu_read_lock() / rcu_read_unlock() ✓ [Writer/Update] synchronize_rcu() / call_rcu() • Spatial Synchronization • [Reader] rcu_dereference() • [Writer/Update] rcu_assign_pointer() Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It? Time
  • 15. RCU Spatial/Temporal Synchronization 5,25 9,81 curconfig Address Space rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (5) *cur_b = mcp->b; (25) rcu_read_unlock(); mcp = kmalloc(…); rcu_assign_pointer(curconfig, mcp); synchronize_rcu(); … … kfree(old_mcp); rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (9) *cur_b = mcp->b; (81) rcu_read_unlock(); Grace Period Readers Readers RCU combines temporal and spatial synchronization in order to approximate reader-writer locking • Temporal Synchronization ✓ [Reader] rcu_read_lock() / rcu_read_unlock() ✓ [Writer/Update] synchronize_rcu() / call_rcu() • Spatial Synchronization • [Reader] rcu_dereference() • [Writer/Update] rcu_assign_pointer()
  • 16. Reader: rcu_read_lock() & rcu_read_unlock(): rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment ✓ Simply disable/enable preemption when entering/exiting RCU critical section • [Why] A QS is detected by a context switch. Kernel Source Reference: 2.6.24
  • 17. Reader: rcu_read_lock() & rcu_read_unlock() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment Kernel Source Reference: 2.6.24 • __acquire() / __release(): sparse checker (semantic parser) ✓ Sparse checking of RCU-protected pointers: Add __rcu marker ➢ Reference: PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
  • 18. Reader: rcu_read_lock() & rcu_read_unlock() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment Kernel Source Reference: 2.6.24 • rcu_read_acquire() / rcu_read_release(): lock dep (runtime locking correctness validator) ✓ Programming error (deadlock)
  • 19. rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment It’s illegal to block while in an RCU read-side CS (Exception: CONFIG_PREEMPT_RCU=y) Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
  • 20. rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment rcu_assign_pointer() and rcu_dereference() invocations communicate spatial synchronization via stores to and loads from the RCU-protected pointer Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
  • 21. Reader: rcu_dereference() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); printk("%dn", ptr->number); RCU Reader • Preserve order • Load -> Load, Load -> Store and Store -> Store • Might be re-order because of store buffer • Store -> Load Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system Kernel Source Reference: 2.6.24
  • 22. Writer: rcu_assign_pointer() synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory • Preserve order • Load -> Load, Load -> Store and Store -> Store • Might be re-order because of store buffer • Store -> Load Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system RCU Writer Kernel Source Reference: 2.6.24
  • 23. Writer: synchronize_rcu() & call_rcu() synchronize_rcu() or call_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory RCU Writer • Mark the end of updater code and the beginning of reclaimer code • [Synchronous] synchronize_rcu() ✓ Block until all pre-existing RCU read-side critical sections on all CPUs have completed ✓ leverages call_rcu() • [Asynchronous] call_rcu() ✓ Queue a callback for invocation after a grace period ✓ [Scenario] ➢ It’s illegal to block RCU updater ➢ Update-side performance is critically important [Updater] Ensure only one writer: removal phase Wait for job completion of readers [Reclaimer] Safe to free memory: reclamation phase
  • 24. Why the name RCU (read-copy update)? RCU Writer Create a copy
  • 25. Why the name RCU (read-copy update)? RCU Writer Update the copy
  • 26. Why the name RCU (read-copy update)? RCU Writer Replace the old entry with the newly created one
  • 27. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time reader list_del_rcu() synchronize_rcu() kfree()
  • 28. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time Quiescent State (QS) • A point in the code where there can be no references held to RCU-protected data structures, which is normally any point outside of an RCU read-side critical section. ✓ RCU read-side critical section is defined by the range between rcu_read_lock() and rcu_read_unlock(). Grace Period (GP) • All threads (cores) pass through at least one quiescent state. Quotes from the book “Is Parallel Programming Hard, And, If So, What Can You Do About It?” reader list_del_rcu() synchronize_rcu() kfree()
  • 29. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time reader list_del_rcu() synchronize_rcu() kfree() synchronize_rcu(): How to detect a grace period?
  • 30. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers - 6 Approaches
  • 31. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Reference Counter • Reference counter (share data) updated by rcu_read_lock() and rcu_read_unlock() ✓ Scalability problem due to cache bouncing Reference Counter * Figure reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 32. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – CPU Register • Check CPUs’ register (Prevent from accessing share data such as reference counter) ✓CPUs’ register: Check each CPU’s program counter (PC) ➢The updater polls each relevant PC. If the PC in not within read-side code, the corresponding CPU is within a quiescent state. ➢A complete grace period: All CPU’s PCs have been observed to be outside of read-side code. ✓Challenges ➢Readers might invoke other functions ➢Code-motion optimization CPU Register
  • 33. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Wait for a fixed period time • Wait the enough time to comfortably exceed the lifetime of any reasonable reader ✓Unreasonable reader → issue! Wait for a fixed period time
  • 34. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Wait forever • Accommodate the unreasonable reader. • Bad reputation: leaking memory ✓ Memory leaks often require untimely and inconvenient reboots. ✓ Work well in a high-availability cluster where systems were periodically crashed in order to ensure that cluster really remained highly available. Wait forever
  • 35. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Avoid the period crashes • Covered by stop-the-world garbage collector • In today’s always-connected always-on world, stopping the world can gravely degrade response times. Avoid the period crashes
  • 36. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – QSBR • Numerous applications already have states (termed quiescent states) that can be reached only after all pre-existing readers are done. ✓ Transaction-processing application: the time between a pair of successive transactions might be a quiescent state. ✓ Non-preemptive OS kernel: Context switch can be a quiescent state. ✓ [Non-preemptive OS kernel] RCU reader must be prohibited from blocking while referencing a global data. QSBR
  • 37. * Reference from Is Parallel Programming Hard, And, If So, What Can You Do About It? Grace Period: Waiting for readers – QSBR [Concept] Implementation for non-preemptive Linux kernel * [Not production quality] sched_setaffinity() function causes the current thread to execute on the specified CPU, which forces the destination CPU to execute a context switch.
  • 38. [Concept] QSBR: non-production and non-preemptible implementation Force the destination CPU to execution switch: Completion of RCU reader Reference: Page #142 of Is Parallel Programming Hard, And, If So, What Can You Do About It? synchronize_rcu() rcu_read_lock() & rcu_read_unlock() → disable/enable preemption
  • 39. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU ✓Will discuss implementation details about classic RCU *only* • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 40. Classic RCU (< 2.6.29) and Tree RCU (>= 2.6.29) struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . • Global cpumask: each bit indicates each core • A grace period is complete if rcu_ctrlblk.cpumask = 0 • Scalability problem happens when the number of cores are increased ✓ Lock contention: QS cores are contended to update rcu_ctrlblk.cpumask ✓ Cache bouncing for frequent writes CPU 0 CPU 1 CPU N . . . Classic RCU • Group cpumask: reduce lock contention • A grace period is complete if root’s rcu_node->qsmask = 0 • Excellent scalability Tree RCU qsmask struct rcu_node CPU 2 CPU 3 qsmask struct rcu_node CPU M . . . qsmask struct rcu_node Root node struct rcu_state Reference code: v2.6.24 Reference code: v2.6.29
  • 41. Classic RCU (< 2.6.29) struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . • Global cpumask: each bit indicates each core • Scalability problem happens when the number of cores are increased ✓ Lock contention ✓ Cache bouncing for frequent writes start a new grace period rcu_ctrlblk.cpumask=0xffff schedule(): clear the corresponding bit in rcu_ctrlblk.cpumask rcu_ctrlblk.cpumask=0? Context switch Y: Finish a grace period N: Wait for all CPUs to pass a QS High-level concept: Suppose it’s a 16-core system Kernel Source Reference: 2.6.24
  • 42. Classic RCU: Data Structure rcu_head struct rcu_head *next rcu_ctrlblk func cur completed next_pending signaled lock cpumask rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu Global control block per-cpu qs_pending: Avoid cacheline thrashing by assessing this percpu instead of cpumask bitmap rcu_head rcu_head Callback supplied by call_rcu() struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . Kernel Source Reference: 2.6.24
  • 43. Classic RCU: start_kernel() -> rcu_init() rcu_ctrlblk cur = -300 completed = -300 next_pending signaled lock cpumask = 0 rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu Global control block: rcu_ctrlblk per-cpu: rcu_data struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . Kernel Source Reference: 2.6.24 struct rcu_ctrlblk rcu_bh_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . rcu_ctrlblk cur = -300 completed = -300 next_pending signaled lock cpumask = 0 Global control block: rcu_bh_ctrlblk rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu per-cpu: rcu_bh_data 1 2 init percpu data init percpu data rcu_tasklet per-cpu rcu_process_callbacks() 3 init rcu_tasklet
  • 44. rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 [Global control block] static struct rcu_ctrlblk rcu_ctrlblk; per-cpu variable initialized by rcu_init_percpu_data() rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) address of address of address of Kernel Source Reference: 2.6.24 Classic RCU: start_kernel() -> rcu_init(): show “rcu_ctrlblk” only
  • 45. Classic RCU: call_rcu() per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func call_rcu(&my_rcu_head, my_rcu_func); address of address of address of Kernel Source Reference: 2.6.24
  • 46. Classic RCU: synchronize_rcu() leverages call_rcu() Kernel Source Reference: 2.6.24
  • 47. Classic RCU: Invoke rcu_pending() when a timer interrupt is triggered rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist tasklet_schedule if rdp->donelist: All callbacks are not invoked yet due to rdp->blimit (limit on a processed batch) per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of address of Kernel Source Reference: 2.6.24
  • 48. Classic RCU: Check if a CPU passes a QS rcu_check_callbacks update_process_times rcu_pending()? Y timer interrupt Kernel Source Reference: 2.6.24 rcu_qsctr_inc schedule Timer interrupt Scheduler: context switch [Note] “rcu_data.passed_quiesc = 1” does not clear the corresponding rcu_ctrlblk.cpumask directly. More checks are performed. See later slides.
  • 49. Classic RCU: first timer interrupt rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 [Global control block] static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of address of Manipulation Kernel Source Reference: 2.6.24
  • 50. Classic RCU per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = rcp->cur + 1 = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of __rcu_process_callbacks() Kernel Source Reference: 2.6.24
  • 51. Classic RCU: first timer interrupt per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = rcp->cur + 1 = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; rcu_start_batch() • A new grace period is started • All cpus must go through a quiescent state ✓ at least two calls to rcu_check_quiescent_state() are required ➢ The first call: A new grace period is running ➢ The second call: If there was a quiescent state, then 1. Update rcu_ctrlblk.cpumask: Clear the corresponding CPU bit 2. rcu_ctrlblk.cpumask is empty: the grace period is completed. Kernel Source Reference: 2.6.24
  • 52. Classic RCU: first timer interrupt rcu_ctrlblk cur = -300 completed = -300 next_pending = 1 signaled = 0 lock cpumask = 0 rcu_ctrlblk cur++ → -299 completed = -300 next_pending = 1 0 signaled = 0 0 lock cpumask = cpu_online_map & ~nohz_cpu_mask Kernel Source Reference: 2.6.24
  • 53. Classic RCU per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur++ → -299 completed = -300 next_pending = 1 0 signaled = 0 0 lock cpumask = 0 all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; rcu_start_batch() Kernel Source Reference: 2.6.24
  • 54. Classic RCU: first timer interrupt rcu_start_batch() • A new grace period is started • All cpus must go through a quiescent state ✓ at least two calls to rcu_check_quiescent_state() are required ➢ The first call: A new grace period is running ➢ The second call: If there was a quiescent state, then 1. Update rcu_ctrlblk.cpumask: Clear the corresponding CPU bit 2. rcu_ctrlblk.cpumask is empty: the grace period is completed. The first call The second call Kernel Source Reference: 2.6.24
  • 55. The first call, then wait for the next qs (context switch) The second call per-cpu rcu_data quiescbatch = rcp->cur = -299 passed_quiesc = 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; After the first call Classic RCU (< 2.6.29): first timer interrupt Kernel Source Reference: 2.6.24
  • 56. per-cpu rcu_data quiescbatch = rcp->cur = -299 passed_quiesc = 0 1 qs_pending = 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; Context switch happens rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu schedule rcu_qsctr_inc rdp->passed_quiesc = 1 Classic RCU: first timer interrupt: quiescent state (context switch) 2 1 Kernel Source Reference: 2.6.24
  • 57. Classic RCU: second timer interrupt per-cpu rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; Second timer interrupt -> __rcu_pending() True Kernel Source Reference: 2.6.24 rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt
  • 58. Classic RCU per-cpu rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 1 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = cur = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; The second call cpu_clear() 1 2 3 4 5 6 All CPUs pass QS 7
  • 59. Classic RCU: Second timer interrupt: rcu_start_batch() rcu_ctrlblk cur++ → -298 completed = -299 next_pending = 0 signaled = 0 0 lock cpumask = 0 all_online_cpu_bitmask rcu_ctrlblk cur = -299 completed = -299 next_pending = 0 signaled = 0 lock cpumask = 0 Kernel Source Reference: 2.6.24
  • 60. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y Third timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist Kernel Source Reference: 2.6.24
  • 61. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of 1 2 Kernel Source Reference: 2.6.24
  • 62. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 = rcp->cur = -298 passed_quiesc = 1 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask 1 2 3 4 5 Kernel Source Reference: 2.6.24
  • 63. Classic RCU: Third timer interrupt rcu_data quiescbatch = rcp->cur = -298 passed_quiesc = 1 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask 1 2 3 Invoke callbacks: depend on rdp->blimit (Check function ‘rcu_do_batch’ for detail) Kernel Source Reference: 2.6.24
  • 64. Tree RCU (>= 2.6.29) CPU 0 CPU 1 CPU N . . . • Group cpumask: reduce lock contention • A grace period is complete if root’s rcu_node->qsmask = 0 • Excellent scalability Tree RCU qsmask struct rcu_node CPU 2 CPU 4 qsmask struct rcu_node CPU M . . . qsmask struct rcu_node Root node struct rcu_state Won’t describe implementation detail about tree RCU. (kinda similar to classic RCU)
  • 65. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 66. • Vanilla RCU • Bottom-half Flavor (Historical) • Sched Flavor (Historical) • Sleepable RCU (SRCU) • Tasks RCU • Tasks Rude RCU • Tasks Trace RCU RCU Flavors
  • 67. • Reader RCU APIs ✓rcu_read_lock() / rcu_read_unlock() ✓rcu_dereference() • Writer RCU APIs ✓rcu_assign_pointer() ✓Synchronous grace-period wait primitive: synchronize_rcu() ✓Asynchronous grace-period wait primitive: call_rcu() RCU Flavors: Vanilla RCU (Classic RCU and Tree RCU)
  • 68. • Networking data structures that may be subjected to remote denial- of-service attacks. ✓CPUs never exit softirq execution due to DoS attacks. ➢Prevent CPUs from executing a context switch → prevent grace periods from ever ending. ◼ Out-of-memory and a system hang ✓Disabling bottom halve can prevent the issue. • Reader RCU APIs ✓rcu_read_lock_bh() / rcu_read_unlock_bh() ➢ local_bh_disable() / local_bh_enable() ✓rcu_dereference_bh() • Writer: No change RCU Flavors: Bottom-half Flavor (Historical)
  • 69. • Before preemptible RCU, context switch is a quiescent state ✓ [A complete grace period] Need to wait for all pre-existing interrupt and NMI handler • [CONFIG_PREEMPTION=n] ✓ Vanilla RCU and RCU-sched grace period waits for pre-existing interrupt and NMI handlers ✓ Vanilla RCU and RCU-sched have identical implementations • [CONFIG_PREEMPTION=Y] for RCU-sched ✓ Preemptible RCU does not need to wait for pre-existing interrupt and NMI handler. ✓ The code outside of an RCU read-side critical section → a QS ✓ rcu_read_lock_sched() → disable preemption ✓ rcu_read_unlock_sched() → re-enable preemption ✓ A preemption attempt during the RCU-sched read-side critical section: ✓ rcu_read_unlock_sched() will enter the scheduler • Reader RCU APIs ✓ rcu_read_lock_sched() / rcu_read_unlock_sched() ✓ preempt_disable() / preempt_enable() ✓ local_irq_save() / local_irq_restore() ✓ hardirq enter / hardirq exit ✓ NMI enter / NMI exit ✓ rcu_dereference_sched() • Writer: No change RCU Flavors: Sched Flavor (Historical)
  • 70. • Classic RCU: blocking or sleeping is strictly prohibited • SRCU: Allow arbitrary sleeping (or blocking) within RCU read-side critical section ✓Real-time kernel ➢ Require that spinlock critical section be preemptible ➢ Require that RCU read-side critical section be preemptible ✓Extend grace periods • Different domains (srcu_struct structure) are defined • Benefit: a slow SRCU reader in one domain does not delay a SRCU grace period in another domain RCU Flavors: Sleepable RCU (SRCU) struct srcu_struct ss; int idx; idx = srcu_read_lock(&ss); do_something(); srcu_read_unlock(&ss, idx);
  • 71. • RCU mechanism ✓Keep old version of data structure until no CPU holds a reference to it. → the structure can be freed (context switch happens). • Tasks RCU ✓Defer the destruction of an old data structure until it is known that no process holds a reference to it ✓Scenario: Handle the trampolines used in Linux-kernel tracing ➢ Tracer subsystem: ftrace and kprobe ➢ Kprobe: Return probe – Trampoline ✓Reader RCU APIs ➢ No explicit read-side marker ➢ Voluntary context switches separate successive Tasks RCU read-side critical sections. ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks() RCU Flavors: Tasks RCU
  • 72. • Tasks RCU does not wait for idle tasks ✓Idle tasks: do not run voluntary context switches ➢ Remain idle for long periods of time ✓Cannot work for tracing of code within idle loop • Tasks Rude RCU ✓Scenario: Trampoline that might be involved in tracing of code within the idle loop ✓Reader RCU APIs ➢ No explicit read-side marker ➢ Preemption-disabled region of code is a Tasks Rude RCU reader ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_rude() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_rude() RCU Flavors: Tasks Rude RCU
  • 73. • Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given trampoline • Tasks Rude RCU ✓Scenario: BPF programs need to sleep ✓Reader RCU APIs ➢ Explicit read-side marker: rcu_read_lock_trace() / rcu_read_unlock_trace() ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_trace() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_trace() RCU Flavors: Tasks Trace RCU
  • 74. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 75. RCU Usage Summary Usage Description Applicable? Routing Table 1. Read mostly 2. Stale and inconsistent data is permissible Work great Linux kernel’s mapping from user-level System-V semaphore IDs to in-kernel data structures 1. Read mostly: Semaphore tends to be used far more frequently than they are created and destroyed. 2. Need consistent data: perform a semaphore operation on a semaphore that has already been deleted. Work well dentry cache in Linux kernel 1. Read/Write workload 2. Need consistent data Might be ok SLAB_TYPESAFE_BY_RCU slab-allocator flag provides type-safe memory to RCU readers 1. Write mostly 2. Need consistent data Not best
  • 76. Case Study: Design Patterns and Lock Granularity • Code locking: use global locks only ✓ Lock contention & scalability issue • Data locking ✓ Many data structures may be partitioned ➢ Example: hash table ✓ Each partition of data structure has its own lock ✓ Improve lock contention & better scalability • Data ownership ✓ Data structure is partitioned over threads or CPUs ➢ Each thread/CPU accesses its subset of data structure without any synchronization overhead ➢ The most-heavily used data owned by a single CPU → Hot spot in this CPU ➢ No sharing is required -> achieve ideal performance ➢ Example: percpu variables in Linux kernel
  • 77. Case Study: Code locking: dentry lookup • v2.5.61 and earlier • dcache_lock ✓ protect the hash chain (hash table), d_child, d_alias, d_lru lists and d_inode . . dentry_hashtable Source code from: v2.5.61
  • 78. Case Study: Code locking: dentry lookup • v2.5.61 and earlier • dcache_lock ✓ protect the hash chain (hash table), d_child, d_alias, d_lru lists and d_inode . . dentry_hashtable Code locking by `dcache_lock` • dcache_lock protects the whole hash table • More lock contention, poor scalability Code locking Source code from: v2.5.61
  • 79. Case Study: Data locking: dentry lookup • v2.5.62 and later • RCU ✓ lock-free for dcache (dentry) look-up ✓ No need to acquire `dcache_lock` when traversing the hash table in d_lookup ✓ Rely on RCU to ensure the dentry has not been *freed*. • dcache_lock ✓ dcache_lock must be taken for the followings: ✓ Traversing and updating ✓ Hashtable update . . dentry_hashtable Data locking by `dentry->d_lock • Improve lock contention • Better scalability Source code from: v2.5.62 Data locking
  • 80. Case Study: Data locking: dentry lookup - seqlock Source code from: v2.5.67
  • 81. Case Study: Data locking: dentry lookup - seqlock Source code from: v2.5.67 This approach is still used in latest kernel (v6.3)
  • 82. Reference • Is Parallel Programming Hard, And, If So, What Can You Do About It? • What is RCU? – “Read, Copy, Update” • What Does It Mean To Be An RCU Implementation? • https://docs.kernel.org/RCU/index.html • Using RCU for linked lists — a case study • Scaling dcache with RCU • Sleepable RCU (SRCU) • https://lwn.net/Articles/253651/ • https://zhuanlan.zhihu.com/p/90223380 • Preemptible RCU • https://lwn.net/Articles/253651/ • https://zhuanlan.zhihu.com/p/90223380
  • 84. RCU (Read-Copy-Update) • Non-blocking synchronization ✓Deadlock Immunity: RCU read-side primitives do not block, spin or even do backwards branches → execution time is deterministic. ➢Exception (programming error): ➢Immunity to priority inversion ◼ Low-priority RCU readers cannot prevent a high-priority RCU updater from acquiring the update-side lock ◼ A low-priority RCU updater cannot prevent high-priority RCU readers from entering read-side critical section ➢[-rt kernel] RCU is susceptible to priority inversion scenarios: ➢ A High-priority process blocked waiting for an RCU grace period to elapse can be blocked by low-priority RCU readers. --> Solve by RCU priority boosting. ➢ [RCU priority boosting] Require rcu_read_unclock() do deboosting, which entails acquiring scheduler locks. ➢ Need to avoid deadlocks within the scheduler and RCU: v5.15 kernel requires RCU to avoid invoking the scheduler while holding any of RCU’s locks ➢ rcu_read_unlock() is not always lockless when RCU priority boosting is enabled.
  • 85. RCU Properties • Reader ✓Reads need not wait for updates ➢Low-cost or even no-cost readers → excellent scalability ✓Each reader has a coherent view of each object within the block of rcu_read_lock() and rcu_read_unlock() • Updater ✓synchronized_rcu(): ensure that objects are not freed until after the completion of all readers that might be using them. ✓rcu_assign_pointer() and rcu_dereference(): efficient and scalable mechanisms for publishing and reading new versions of an object.
  • 86. Wait for pre-existing RCU readers • Wait for the RCU read-side critical section: rcu_read_lock()/rcu_read_unlock() ✓Illegal to sleep within an RCU read-side critical section because a context switch is a quiescent state * If any portion of a given critical section precedes the beginning of a given grace period, then RCU guarantees that all of that critical section will precede the end of that grace period.
  • 87. Wait for pre-existing RCU readers: if-then relationship • If any portion of a given critical section precedes the beginning of a given grace period, then RCU guarantees that all of that critical section will precede the end of that grace period. ✓ r1 = 0 and r2 = 0
  • 88. • If any portion of a given critical section follows the end of a given grace period, then RCU guarantees that all of that critical section will follow the beginning of that grace period. ✓ r1 = 1 and r2 = 1 • What would happen if the order of P()’s two accesses was reversed? ✓ Nothing change because loads from x and y are in the same RCU read-side critical section. Wait for pre-existing RCU readers: if-then relationship
  • 89. • An RCU read-side critical section can be completely overlapped by an RCU grace period. ✓ r1 = 1 and r2 = 0 ✓ Cannot be r1 = 0 and r2 = 1 Wait for pre-existing RCU readers: if-then relationship
  • 90. RCU Grace-Period Ordering Guarantee Given a grace period, each readers ends before the end of that grace period, starts after the beginning of that grace period, or both.
  • 91. Maintain Multiple Versions of Recently Updated Objects • RCU accommodates synchronization-free readers (weak temporal synchronization) by maintaining multiple versions of data
  • 92. * This slide is referred from page 44 of What is RCU?
  • 93. * This slide is referred from page 80 of slides
  • 94. task_struct cg_list rcu_node_entry (CONFIG_PREEMPT_RCU=y) tasks thread_node (linked to signal_struct) Kernel Source Reference: 6.2 Not use RCU: not read-mostly data structures (use spinlock) RCU: Defer destruction ptraced ptrace_entry RCU + spinlock children sibling Case study: Access task_struct linked list __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);