Optimize pipelining by parsing and prefetching multiple commands#2092
Conversation
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2092 +/- ##
============================================
+ Coverage 72.02% 72.21% +0.19%
============================================
Files 126 126
Lines 70493 70616 +123
============================================
+ Hits 50771 50994 +223
+ Misses 19722 19622 -100
🚀 New features to boost your workflow:
|
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There was a problem hiding this comment.
Did some benchmarks on AMD EPYC 7713 32 CPUs
- config:
databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads *
io-threads-do-reads yes
protected-mode no
maxmemory 2500mb
maxmemory-policy noeviction
hz 1
latency-monitor-threshold 0
- cset shield
- cpu bind
- cache population
valkey-benchmark -h <ip> -c 96 --threads 96 -d 256 -t set -n 3000000 -r 3000000 --sequential - benchmark itself
valkey-benchmark -h <ip> -c 192 --threads 96 -d 256 -t get -n 10000000 -r 3000000 -P 10
io-threads 1 unstable result:
Summary:
throughput summary: 352795.91 requests per second
latency summary (msec):
avg min p50 p95 p99 max
5.396 0.048 5.279 5.727 10.455 124.287
io-threads 1 this pr result:
Summary:
throughput summary: 604375.69 requests per second
latency summary (msec):
avg min p50 p95 p99 max
3.100 0.040 2.935 3.431 5.791 132.991
io-threads 9 unstable result:
Summary:
throughput summary: 383435.59 requests per second
latency summary (msec):
avg min p50 p95 p99 max
4.981 0.056 4.959 5.255 5.503 79.039
io-threads 9 this pr result:
Summary:
throughput summary: 738334.31 requests per second
latency summary (msec):
avg min p50 p95 p99 max
2.553 0.032 2.567 2.751 2.919 36.703
Seems like a solid 2x improvement in pipelining workload! Also did not find any memory leaks or instabilities during testing, LGTM!
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
PingXie
left a comment
There was a problem hiding this comment.
@zuiderkwast, I think our changes are indeed very close, especially around the command queue concept. that being said (and I know I am super biased here :)), I think my PR is closer to the finishing line. Do you want to hop onto my branch and collaborate on mine next?
https://github.com/PingXie/valkey/tree/pipeline
madolson
left a comment
There was a problem hiding this comment.
The overall approach makes sense to me, I have some minor thoughts but I haven't thought to much how to think about this vs Ping's PR yet.
Looking back at the original IO threading, we batch N keys, maybe we should change it to be batch N commands instead, since that will make sure we are fully parsing a command. That way we can probably more tightly unify the code.
Thanks @madolson for taking the time to look at this.
I think we shall prefetch full commands, but we can still configure it as the number of keys. If we go over the number N when adding a command to the batch, then we can allow prefetching a few more, let's say up to N*2. That way we prefetch by default 16 keys, but allow prefetching up to 32 keys to stick to full commands. We should definitely unify the IO-threaded and single-threaded code paths for this code and more of the parse-lookup-prefetch-execute. I avoided refactoring to keep the diff small but I think it's necessary to at least rename some functions to keep this comprehensive. |
my preference, for the sake of simplicity, is that we continue to count keys. as soon as we use up ultimately the L1 data cache on Intel cpus is 32 KiB or 48 KiB, typically, with a cache line size of 64 bytes, which gives us 512 or 768 cache lines, respectively. I haven't counted carefully the number of memory accesses needed to prefetch a key-value pair of strings in the new hashtable but I guess it would be 5 (including both keys and hash table)? Assuming 5, prefetching 32 keys would consume 160 cache lines, which is not a trivial number compared to 512 already. |
|
Hello, it seems that this pull request has some conflicts with the unstable branch. Do you need any assistance working on it? Merging it into 9.0 would be beneficial for many users (a lot of clients use pipelines automatically). |
|
After discussion with @PingXie and others, I wanted to rewrite it a bit and unify the prefetching code paths for io-threaded and single-threaded modes. I probably can't make it to 9.0 though. How would folks (@PingXie, @madolson) feel about merging this more or less as is for 9.0 and improving it later? |
parseCommand -> parseInputBuffer
processMultibulkBuffer -> parseMultibulkBuffer
processInlineBuffer -> parseInlineBuffer
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
|
We're getting de-sync errors in the test: Later in the day update: Just to comment since Viktor messaged me on slack. Apparently there is something getting incorrectly updated for replication offset. This will have impact for other deeply pipelined replication streams, but it looks like only the slot migration tests are triggering that. |
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since slot migration uses chaining replication, if the replication offset is wrong, what is sent to the replicas of the target would be wrong: So that would make sense why we see the desync errors |
Try to fix the failures seen for `test "PSYNC2 #3899 regression: verify consistency"`. This change resets the query buffer parser state in `replicationCachePrimary()` which is called when the connection to the primary is lost. Before #2092, this was done by `resetClient()`. The solution was inspired by the discussion about the regression mentioned (discussion from 2017) and the related commits from that time: 6bc6bd4, 469d6e2, c180bc7. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…key-io#2092) Instead of parsing only one command per client before executing it, parse multiple commands from the query buffer and batch-prefetch the keys accessed by the commands in the queue before executing them. This is an optimization for pipelined commands, both with and without I/O threads. The optimization is currently disabled for the replication stream, due to failures (probably caused by how the replication offset is calculated based on the query buffer offset). * When parsing commands from the input buffer, multiple commands are parsed and stored in a command queue per client. * In single-threaded mode (I/O threads off) keys are batch-prefetched before the commands in the queue are executed. Multi-key commands like MGET, MSET and DEL benefit from this even if pipelining is not used. * Prefetching when I/O threads are used does prefetching for multiple clients in parallel. This code takes client command queues into account, improving prefetching when pipelining is used. The batch size is controlled by the existing config `prefetch-batch-max-size` (default 16), which so far only was used together with I/O threads. The config is moved to a different section in `valkey.conf`. * When I/O threads are used and the maximum number of keys are prefetched, a client's command is executed, then the next one in the queue, etc. If there are more commands in the queue for which the keys have not been prefetched (say the client sends 16 pipelined MGET with 16 keys in each) keys for the next few commands in the queue are prefetched before the commands is executed if prefetching has not been done for the next command. (This utilizes the code path used in single-threaded mode.) Code improvements: * Decoupling of command parser state and command execution state: * The variables reqtype, multibulklen and bulklen refer to the current position in the query buffer. These are no longer reset in resetClient (which runs after each command being executed). Instead, they are reset in the parser code after each completely parsed command. * The command parser code is partially decoupled from the client struct. The query buffer is still one per client, but the resulting argument vector is stored in caller-defined variables. Fixes valkey-io#2044 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Try to fix the failures seen for `test "PSYNC2 valkey-io#3899 regression: verify consistency"`. This change resets the query buffer parser state in `replicationCachePrimary()` which is called when the connection to the primary is lost. Before valkey-io#2092, this was done by `resetClient()`. The solution was inspired by the discussion about the regression mentioned (discussion from 2017) and the related commits from that time: 6bc6bd4, 469d6e2, c180bc7. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Instead of parsing only one command per client before executing it,
parse multiple commands from the query buffer and batch-prefetch the
keys accessed by the commands in the queue before executing them.
This is an optimization for pipelined commands, both with and without
I/O threads. The optimization is currently disabled for the replication
stream, due to failures (probably caused by how the replication offset
is calculated based on the query buffer offset).
* When parsing commands from the input buffer, multiple commands are
parsed and stored in a command queue per client.
* In single-threaded mode (I/O threads off) keys are batch-prefetched
before the commands in the queue are executed. Multi-key commands like
MGET, MSET and DEL benefit from this even if pipelining is not used.
* Prefetching when I/O threads are used does prefetching for multiple
clients in parallel. This code takes client command queues into account,
improving prefetching when pipelining is used. The batch size is
controlled by the existing config `prefetch-batch-max-size` (default
16), which so far only was used together with I/O threads. The config is
moved to a different section in `valkey.conf`.
* When I/O threads are used and the maximum number of keys are
prefetched, a client's command is executed, then the next one in the
queue, etc. If there are more commands in the queue for which the keys
have not been prefetched (say the client sends 16 pipelined MGET with 16
keys in each) keys for the next few commands in the queue are prefetched
before the commands is executed if prefetching has not been done for the
next command. (This utilizes the code path used in single-threaded
mode.)
Code improvements:
* Decoupling of command parser state and command execution state:
* The variables reqtype, multibulklen and bulklen refer to the current
position in the query buffer. These are no longer reset in resetClient
(which runs after each command being executed). Instead, they are
reset in the parser code after each completely parsed command.
* The command parser code is partially decoupled from the client struct.
The query buffer is still one per client, but the resulting argument
vector is stored in caller-defined variables.
Fixes #2044
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Try to fix the failures seen for `test "PSYNC2 #3899 regression: verify consistency"`. This change resets the query buffer parser state in `replicationCachePrimary()` which is called when the connection to the primary is lost. Before #2092, this was done by `resetClient()`. The solution was inspired by the discussion about the regression mentioned (discussion from 2017) and the related commits from that time: 6bc6bd4, 469d6e2, c180bc7. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…key-io#2092) Instead of parsing only one command per client before executing it, parse multiple commands from the query buffer and batch-prefetch the keys accessed by the commands in the queue before executing them. This is an optimization for pipelined commands, both with and without I/O threads. The optimization is currently disabled for the replication stream, due to failures (probably caused by how the replication offset is calculated based on the query buffer offset). * When parsing commands from the input buffer, multiple commands are parsed and stored in a command queue per client. * In single-threaded mode (I/O threads off) keys are batch-prefetched before the commands in the queue are executed. Multi-key commands like MGET, MSET and DEL benefit from this even if pipelining is not used. * Prefetching when I/O threads are used does prefetching for multiple clients in parallel. This code takes client command queues into account, improving prefetching when pipelining is used. The batch size is controlled by the existing config `prefetch-batch-max-size` (default 16), which so far only was used together with I/O threads. The config is moved to a different section in `valkey.conf`. * When I/O threads are used and the maximum number of keys are prefetched, a client's command is executed, then the next one in the queue, etc. If there are more commands in the queue for which the keys have not been prefetched (say the client sends 16 pipelined MGET with 16 keys in each) keys for the next few commands in the queue are prefetched before the commands is executed if prefetching has not been done for the next command. (This utilizes the code path used in single-threaded mode.) Code improvements: * Decoupling of command parser state and command execution state: * The variables reqtype, multibulklen and bulklen refer to the current position in the query buffer. These are no longer reset in resetClient (which runs after each command being executed). Instead, they are reset in the parser code after each completely parsed command. * The command parser code is partially decoupled from the client struct. The query buffer is still one per client, but the resulting argument vector is stored in caller-defined variables. Fixes valkey-io#2044 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Try to fix the failures seen for `test "PSYNC2 valkey-io#3899 regression: verify consistency"`. This change resets the query buffer parser state in `replicationCachePrimary()` which is called when the connection to the primary is lost. Before valkey-io#2092, this was done by `resetClient()`. The solution was inspired by the discussion about the regression mentioned (discussion from 2017) and the related commits from that time: 6bc6bd4, 469d6e2, c180bc7. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Multi-command parsing in parseMultibulkBuffer (introduced #2092) was disabled for replicated clients because the per-command replication offset relied on c->qb_pos as the right boundary of the just-applied command. Replication stream is actually a big pipeline, so if it can be supported, the processing speed of the replica can be improved. Decouple parsing position from application position by introducing a single per-client field that tracks the currently processed command's right boundary in querybuf: client.qb_applied: right boundary of the current command in querybuf. Set by the parsers (for c->argv, = c->qb_pos) and advanced incrementally by consumeCommandQueue() using each queued command's input_bytes (qb_applied += p->input_bytes). parsedCommand.input_bytes already records the number of querybuf bytes consumed to parse one command (multibulk header + each bulk-length line + arg bytes + per-arg CRLFs), see parseMultibulk for mode details. It is a relative quantity, independent of where querybuf has been trimmed since parsing, so it is exactly what's needed to advance qb_applied across multi-command parsing, so no extra per-command snapshot field is required. commandProcessed() now uses c->qb_applied instead of c->qb_pos to advance reploff by exactly the bytes of the just-applied command. beforeNextClient shifts only c->qb_applied (a single variable) when the replicated client's querybuf is trimmed; pending queue entries carry only relative input_bytes and therefore need no fix-up. qb_applied is populated unconditionally on the first-command path so that a client transitioning to replicated mid-command (e.g. SYNCSLOTS ESTABLISH installs slot_migration_job inside its own handler) still has a valid value when commandProcessed() runs. Signed-off-by: Binbin <binloveplay1314@qq.com>
Instead of parsing only one command per client before executing it, parse multiple commands from the query buffer and batch-prefetch the keys accessed by the commands in the queue before executing them.
This is an optimization for pipelined commands, both with and without I/O threads. The optimization is currently disabled for the replication stream, due to failures (probably caused by how the replication offset is calculated based on the query buffer offset).
prefetch-batch-max-size(default 16), which so far only was used together with I/O threads. The config is moved to a different section invalkey.conf.Code improvements:
Benchmarking with and without I/O threads, see below.
Fixes #2044