Showing posts with label crc64. Show all posts
Showing posts with label crc64. Show all posts

Thursday, May 9, 2024

CRC64 in Valkey got faster

As teased in my last blog post, CRC64 in Redis was supposed to get faster, but it didn't. About a year after I posted my PR, Redis the company (formerly Garantia Data, then Redis Labs, whom I've worked a booth with at AWS, written articles for, gave several talks for, etc.) changed the license of Redis the software. I don't like the new license. Subsequently, Madelyn Olson asked me to port my changes over to Valkey, and I want to make the world better, so I did.

When I originally ported to Valkey, it was using the generally unused `SERVER_TEST` define, which was not exposed in a convenient manner. Since then, Madelyn has migrated those tests over to the new C-level unittest framework, so instructions have changed since my changes, and like most software engineers, I like to be able to test my things, and tell you how to test them too.

Testing updated crc64 performance in Valkey

The first thing you'll want is to get the most recent unstable branch of Valkey (if you want to get all of Valkey / Redis history, omit the '--depth 1' argument), which will require (at minimum) git, make, and a recent gcc or clang:

 

% git clone --depth 1 https://github.com/valkey-io/valkey.git
%
cd valkey/src
%
make valkey-unit-tests
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000 --combine

The first unit test should output something like...

[START] - test_crc64combine.c
test size=10000000 algorithm=crc_1byte 375 M/sec matches=1
test size=10000000 algorithm=crcspeed 1345 M/sec matches=1
test size=10000000 algorithm=crcdual 2246 M/sec matches=1
test size=10000000 algorithm=crctri 2428 M/sec matches=1
test size=6093750 algorithm=crc_1byte 402 M/sec matches=1
test size=6093750 algorithm=crcspeed 1343 M/sec matches=1
test size=6093750 algorithm=crcdual 2254 M/sec matches=1
test size=6093750 algorithm=crctri 2423 M/sec matches=1
...

Examining the output, wherever you see crcdual and crctri start being slower than crcspeed, is where you should set your dual / tri cutoffs. At present, these cutoffs are set at default levels that seem to work well with Intel processors in the Sandy Bridge era or later (OEM shipments starting in 2013). I don't know about performance on AMD cpus, or even ARM cpus on the new macs, but if someone wants to test and let me know, I'd be appreciative.

Similarly, we can see how our crc extension performs with the additional --combine argument, producing something like:

[START] - test_crc64combine.c
init_64 size=18446744073709551615 in 689000 nsec
hash_as_size_combine size=10000000 in 1036 nsec
largest_combine size=18446744073709551615 in 10187 nsec
combine size=10000000 in 1049 nsec
combine size=6093750 in 2333 nsec
combine size=3713381 in 1576 nsec
combine size=2262843 in 1560 nsec
combine size=1378922 in 1217 nsec
combine size=840282 in 1431 nsec
combine size=512048 in 1030 nsec
...

We have generally set our cutoffs higher for non-x86 (arm-64 on MacOS / Raspberry Pi) because we don't have an explicitly vectorized combine function for them. Mostly the combine function is a matter of curiosity, but for longer-term scaling, it is important, as it sets the minimum size where combining makes sense vs. doing the regular calculation.

How did we get here, and can we go faster?

I originally added combine functionality in 2019 as I was building my own threaded Redis fork called StructD (now defunct). Because I had threads, spawning multiple threads in addition to optimizing individual thread performance would allow me to process entire files in parallel, chunked out to workers, merging and verifying the results at the end.

While it might not seem necessary to perform CRC64 verification with such speed, we were both creating and loading 10-100 gigabyte snapshots. As I was building StructD, the crcspeed changes from Matt Standcliff had not yet been merged in, and we were seeing 4+ minutes to verify our largest snapshots.

With 1.4 gigs/second after merging crcspeed, that pushed timing down to ~75 seconds just to verify. But with my updated 2.4 gig/second crctri, plus being able to go multi-threaded, my record for verifying a 40 gig snapshot on my local machine (file stored in shared memory to eliminate disk IO) was about 2.5 seconds, compared with about 30 seconds for April 25, 2020 crcspeed Redis, and 4+ minutes with one byte at a time classic. Overall, seeing a 100x performance improvement just in verifying snapshots.

Testing updated performance in smhasher

Clocking in at about 15 gigs/second CRC64 computation using 8 threads is not quite a linear speedup, but it beats every other real CRC of any bit width available in the smhasher suite of hashes, hardware or software. But to compare properly, our single-threaded CRC64 variant is about half the speed of hardware CRC64 on my machine, using only software. Available as crc64_jones, crc64_jones1, crc64_jones2, and crc64_jones3 for the Valkey automatic 1-3 pipeline, 1, 2, and 3-pipeline versions, respectively (you will need git, make, cmake, gcc, and g++ to build smhasher via 'sh build.sh' then './SMHasher --test=Speed crc64_jones').

On the whole, my changes demonstrate a path for single-threaded CRC seeing near-hardware performance for any polynomial on modern CPUs with enough spare pipelines, L1/L2 cache, and memory bandwidth. When combined with threads directly, we become limited by data source IO speeds and memory bandwidth - not CRC algorithm performance.

For more free software, please feel free to sponsor me over on Github.

Thursday, April 6, 2023

CRC64 in Redis is getting even faster

On Friday last week, I submitted this PR to Redis: https://github.com/redis/redis/pull/11988 .

In the PR, I make single-threaded CRC64 53-73% faster for a CPU that is 13 years old, and ~67% faster for a CPU that is 5 years old. I'll get into the details more as to how / why, but modern CPUs could see 2x+, and an update to the algorithm should allow for leveraging more CPU hardware as it scales. This is also generally applicable to all CRC polynomials, and brings software CRC within 2x of hardware CRC performance.

Oh, then I also made CRC concatenation >10,000x faster, but I suspect that was previously known, and just not reported for what it was.

I know, this is going to feel like "learn this one simple trick to make all your programs faster" clickbait, but for some algorithms, 2x faster is just the start. (and extrapolating to other CRCs

Computer Architecture Background

For folks who already know about the history of computers, and how / why CPUs do what they do, jump ahead to "Making it faster".

 Decades before I was born, Alan Turing and John von Neumann developed competing thoughts about how to conceptualize and build computers. Ultimately, the von Neuman architecture won out. It is best characterized by a central processing unit with control flow, arithmetic operations, along with an interface to memory, input devices, and output devices. To go a little further for our specific example, processors have been built to basically do a few things really well: read instructions, decode those instructions into sub-instructions, wait for data to arrive, then finally execute on any data that is available.

Fetching things can sometimes take a while, and execution isn't nearly an instant operation. Even with your 3 or 4 gigahertz PC, what's really going on is that most CPU instructions, especially complex mathematical functions like square root, division, and modulo, all take multiple steps. So perhaps if you are performing a division operation, you may only get a few bits each cycle, and you may need to wait tens of cycles to get your result. Normally this is all hidden, so you only get your final results, and not incomplete results.

In the specification manuals for most CPUs, you can look up information about how many instructions of what types can be dispatched in a cycle, and how many cycles each operation will take. Depending on the operation, and other things going on at the time, sometimes you can start multiple operations of the same type in adjacent "pipelines", like having multiple lines in a factory. Sometimes you can add items to the lines every cycle, and sometimes each step takes longer than a cycle and items to be added to the pipeline have to wait. Sometimes, there are groups of 2, 4, or 8 items that need to be processed at once, and we have instructions specifically designed to process these groups in what is called vectorization.

Making it faster

Back in 2013, Mark Adler released performance improvements to the CRC class of algorithms. I didn't learn of them until several years later, through Matt Standcliff's Redis work and optimizations around 2017 or 2018. Huge thank you to Mark for writing the code, Matt for making it work with Redis, and Madelyn Olson for finally getting it merged in 2020.

A friend and former coworker sent me a reference to the "hacker's delight CRC", which was posted in 2014. That looks to use the same techniques that Mark Adler released in 2013.

I don't know who originated the work, but I'm going to guess it was Mark Adler or one of his colleagues, being he is the creator of the Adler CRC32, and co-creator of the Zip format, among others.

What was done? Mark transformed the typically bit or byte-wise operations of CRC into what could be done up to 8 bytes at a time. CPUs are built to do many such operations at the same time, so this naturally sped up CRC from around 400-408 MB/second to ~1420 MB/second, at least on one of my daily-workhorse Intel Xeon 2670's. That's a ~3.5x speedup by switching from 1 byte at a time to 8 bytes. Quite incredible if I didn't compile, run, and compare the outputs myself.

Not used, and rarely mentioned, I noticed that Mark had provided a method to do CRC combining. Where normally if you had some data, you had to have one function run from start to finish over that one whole block of data. You could pause, but you had to do the first part first. This limitation is very common among hashing algorithms.

Again, I don't know the origination of the idea, but Mark determined a way to merge the crcs of conceptually adjacent segments. So you can cut your 1 segment into 2, compute over the segments in parallel, combine the results, and get the same result as if you had computed the value serially.

Normally, most folks would add some threads, build a thread scheduler, and work out how to make this faster with threads. I did this, but only because I already had a threaded snapshot engine for Redis (not a part of this PR), so my CRCs were naturally faster. But I wasn't just interested in threads, and Mark's improvements weren't from thread shuffling on one CPU, it was from putting more operations in the CPU from one thread.

Armed with the knowledge that I could extend a CRC, I pulled the CRC operation over 8 bytes at a time into a group of macros, then had the single thread run over 2 or 3 segments producing 2 or 3 concurrent 8-byte segment CRCs over the data, all in one thread.

I had difficulty making CRC64 faster for data smaller than about 2-3 megabytes until I noticed something important. The original CRC combine method was taking ~270 microseconds to staple ~1 meg CRCs together. After some vectorization, manual and automatic, I settled on a compiler-vectorized method that got me down to ~29 microseconds for the CRC combine operation back in spring 2018.

On Wednesday, March 29, after looking at the algorithm further while preparing the patch for what ultimately became this pull request, I noticed that if you were to cache a certain set of data, much like we had done with the CRC computation in the first place, you could eliminate much of that remaining ~29 microseconds. So I did.

After a quick cache initialization that takes around 140 microseconds, or just more than half the time of the original staple over 1 meg, a 32 kilobyte cache is populated for your polynomial that allows CRC64 stapling. Due to an apparent cycle in the CRC after 64 bits of merge, we can actually merge unlimited length segments. After initialization, calls to combine about a megabyte now take roughly 25 nanoseconds. Yes, nanoseconds. From 270 microseconds without vectorization and caching, to 29 microseconds with vectorization, then all the way to 25 nanoseconds with caching.

So at least in Redis-land, CRC64 combining is about 10,800 times faster overall with the use of a 32 kilobyte cache and some auto-vectorization.

Wow.

While I am pretty sure that the caching portion is a well-known optimization, at least to Mark Adler, as I see a similar set of CRC combine optimizations in the most recent version of his codebase, which makes me think I should do a literature check before I write software. That said, Mark does not provide the additional vector operations, and without them, the cache speedup (at least in my version) is limited to getting us to 8 microseconds, instead of 50 nanoseconds typical worst-case.

So overall, the Redis CRC combine is about 160x faster than CRCany's combine, and nets >10,000x speedup from what I was originally working with, compared to around a 60x speedup with caching, or only 10x with vectorization (the combination is super effective).

Here are some benchmark outputs in a Github gist.

Overall, that gets us from 400-408 Megs/second for 1 byte at a time, to ~1420 megs/second for 8 bytes at a time, to up to ~2325 megs/second. So roughly 63% faster on an Intel Xeon E5-2670 v0 @ 2.60GHz with 20Mb of cache.

That is an old CPU, originally marketed Q1 2012. So it has a relatively smaller number of load, store, and arithmetic units compared to a more modern processor. I happen to have a slightly newer Intel Core i3-8130U @ 2.20 ghz, which was first listed for sale Q1 2018, and which I got in the summer of 2019, inside the laptop in which it currently runs.

Again, we see 400-408 Megs/second for 1 byte at a time processing. The 8-byte at a time version gets us up to ~1400 megs/second, or 3.5x faster, with the 24-byte at a time version seeing ~2350 megs/second on larger data. Ours gets us an approximate 68% speedup at 24 bytes, over the 8 byte at a time method.

In some benchmarks on ~10k of data (fitting inside L1 cache), I saw crc64 achieve upwards of 5670 megs/second on the 8130U.

Beating 1 byte/cycle

On my 5 year old mobile CPU, we see the 2.2ghz processor handling 2350 Megs/second, which is 1.068 bytes/cycle. If the data happens to already be loaded into cache after being created by another step, as is the case in many Redis snapshot operations, we see 2.6 bytes/cycle processed.

This speed puts us on par with the performance of "properly engineered" hashes that don't use any tables, and we are within a factor of 2 of the hardware CRC64 available in some more modern CPUs. For those about to ask, no we can't use that CRC in Redis or generally, it is not the right polynomial; if you run the functions on the data, you get different outputs.

I do not know if you can convert the output of CRC from one polynomial to another, but I suspect the answer is no. Might be an interesting question to ask a good mathematician.

Similar optimizations on 16 and 24-way (or more) processing can be done on CRC16 in Redis, or any other CRCs anywhere, as the 8 bytes at a time processing has already been implemented in CRCany. We skip this 16/24 byte optimization in Redis because CRC16 is typically used there for cluster slot identification, which is performed on keys that are generally in the tens to maybe hundreds of bytes in size, and there isn't a benefit to the optimization until 128 bytes for 16-way, and ~2k for 24-way. On the other hand, CRC64 is typically performed over snapshots of keys and data, which can reach to tens of gigabytes in size.

I'll try to get the optimizations upstream to crcany soon.

Summary

That's a lot of information, but CRC64 in Redis is getting about 60-70% faster for old machines, and likely 2x or faster on even newer hardware. For CRC combining, we achieve ~50ns worst-case time.

References

Pull request: https://github.com/redis/redis/pull/11988

Banchmark outputs: https://gist.github.com/josiahcarlson/9a147d39cc492951faffb723149f7926

Mark Adler's crcany project: https://github.com/madler/crcany/