Showing posts with label redis. Show all posts
Showing posts with label redis. Show all posts

Thursday, May 9, 2024

CRC64 in Valkey got faster

As teased in my last blog post, CRC64 in Redis was supposed to get faster, but it didn't. About a year after I posted my PR, Redis the company (formerly Garantia Data, then Redis Labs, whom I've worked a booth with at AWS, written articles for, gave several talks for, etc.) changed the license of Redis the software. I don't like the new license. Subsequently, Madelyn Olson asked me to port my changes over to Valkey, and I want to make the world better, so I did.

When I originally ported to Valkey, it was using the generally unused `SERVER_TEST` define, which was not exposed in a convenient manner. Since then, Madelyn has migrated those tests over to the new C-level unittest framework, so instructions have changed since my changes, and like most software engineers, I like to be able to test my things, and tell you how to test them too.

Testing updated crc64 performance in Valkey

The first thing you'll want is to get the most recent unstable branch of Valkey (if you want to get all of Valkey / Redis history, omit the '--depth 1' argument), which will require (at minimum) git, make, and a recent gcc or clang:

 

% git clone --depth 1 https://github.com/valkey-io/valkey.git
%
cd valkey/src
%
make valkey-unit-tests
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000 --combine

The first unit test should output something like...

[START] - test_crc64combine.c
test size=10000000 algorithm=crc_1byte 375 M/sec matches=1
test size=10000000 algorithm=crcspeed 1345 M/sec matches=1
test size=10000000 algorithm=crcdual 2246 M/sec matches=1
test size=10000000 algorithm=crctri 2428 M/sec matches=1
test size=6093750 algorithm=crc_1byte 402 M/sec matches=1
test size=6093750 algorithm=crcspeed 1343 M/sec matches=1
test size=6093750 algorithm=crcdual 2254 M/sec matches=1
test size=6093750 algorithm=crctri 2423 M/sec matches=1
...

Examining the output, wherever you see crcdual and crctri start being slower than crcspeed, is where you should set your dual / tri cutoffs. At present, these cutoffs are set at default levels that seem to work well with Intel processors in the Sandy Bridge era or later (OEM shipments starting in 2013). I don't know about performance on AMD cpus, or even ARM cpus on the new macs, but if someone wants to test and let me know, I'd be appreciative.

Similarly, we can see how our crc extension performs with the additional --combine argument, producing something like:

[START] - test_crc64combine.c
init_64 size=18446744073709551615 in 689000 nsec
hash_as_size_combine size=10000000 in 1036 nsec
largest_combine size=18446744073709551615 in 10187 nsec
combine size=10000000 in 1049 nsec
combine size=6093750 in 2333 nsec
combine size=3713381 in 1576 nsec
combine size=2262843 in 1560 nsec
combine size=1378922 in 1217 nsec
combine size=840282 in 1431 nsec
combine size=512048 in 1030 nsec
...

We have generally set our cutoffs higher for non-x86 (arm-64 on MacOS / Raspberry Pi) because we don't have an explicitly vectorized combine function for them. Mostly the combine function is a matter of curiosity, but for longer-term scaling, it is important, as it sets the minimum size where combining makes sense vs. doing the regular calculation.

How did we get here, and can we go faster?

I originally added combine functionality in 2019 as I was building my own threaded Redis fork called StructD (now defunct). Because I had threads, spawning multiple threads in addition to optimizing individual thread performance would allow me to process entire files in parallel, chunked out to workers, merging and verifying the results at the end.

While it might not seem necessary to perform CRC64 verification with such speed, we were both creating and loading 10-100 gigabyte snapshots. As I was building StructD, the crcspeed changes from Matt Standcliff had not yet been merged in, and we were seeing 4+ minutes to verify our largest snapshots.

With 1.4 gigs/second after merging crcspeed, that pushed timing down to ~75 seconds just to verify. But with my updated 2.4 gig/second crctri, plus being able to go multi-threaded, my record for verifying a 40 gig snapshot on my local machine (file stored in shared memory to eliminate disk IO) was about 2.5 seconds, compared with about 30 seconds for April 25, 2020 crcspeed Redis, and 4+ minutes with one byte at a time classic. Overall, seeing a 100x performance improvement just in verifying snapshots.

Testing updated performance in smhasher

Clocking in at about 15 gigs/second CRC64 computation using 8 threads is not quite a linear speedup, but it beats every other real CRC of any bit width available in the smhasher suite of hashes, hardware or software. But to compare properly, our single-threaded CRC64 variant is about half the speed of hardware CRC64 on my machine, using only software. Available as crc64_jones, crc64_jones1, crc64_jones2, and crc64_jones3 for the Valkey automatic 1-3 pipeline, 1, 2, and 3-pipeline versions, respectively (you will need git, make, cmake, gcc, and g++ to build smhasher via 'sh build.sh' then './SMHasher --test=Speed crc64_jones').

On the whole, my changes demonstrate a path for single-threaded CRC seeing near-hardware performance for any polynomial on modern CPUs with enough spare pipelines, L1/L2 cache, and memory bandwidth. When combined with threads directly, we become limited by data source IO speeds and memory bandwidth - not CRC algorithm performance.

For more free software, please feel free to sponsor me over on Github.

Thursday, April 6, 2023

CRC64 in Redis is getting even faster

On Friday last week, I submitted this PR to Redis: https://github.com/redis/redis/pull/11988 .

In the PR, I make single-threaded CRC64 53-73% faster for a CPU that is 13 years old, and ~67% faster for a CPU that is 5 years old. I'll get into the details more as to how / why, but modern CPUs could see 2x+, and an update to the algorithm should allow for leveraging more CPU hardware as it scales. This is also generally applicable to all CRC polynomials, and brings software CRC within 2x of hardware CRC performance.

Oh, then I also made CRC concatenation >10,000x faster, but I suspect that was previously known, and just not reported for what it was.

I know, this is going to feel like "learn this one simple trick to make all your programs faster" clickbait, but for some algorithms, 2x faster is just the start. (and extrapolating to other CRCs

Computer Architecture Background

For folks who already know about the history of computers, and how / why CPUs do what they do, jump ahead to "Making it faster".

 Decades before I was born, Alan Turing and John von Neumann developed competing thoughts about how to conceptualize and build computers. Ultimately, the von Neuman architecture won out. It is best characterized by a central processing unit with control flow, arithmetic operations, along with an interface to memory, input devices, and output devices. To go a little further for our specific example, processors have been built to basically do a few things really well: read instructions, decode those instructions into sub-instructions, wait for data to arrive, then finally execute on any data that is available.

Fetching things can sometimes take a while, and execution isn't nearly an instant operation. Even with your 3 or 4 gigahertz PC, what's really going on is that most CPU instructions, especially complex mathematical functions like square root, division, and modulo, all take multiple steps. So perhaps if you are performing a division operation, you may only get a few bits each cycle, and you may need to wait tens of cycles to get your result. Normally this is all hidden, so you only get your final results, and not incomplete results.

In the specification manuals for most CPUs, you can look up information about how many instructions of what types can be dispatched in a cycle, and how many cycles each operation will take. Depending on the operation, and other things going on at the time, sometimes you can start multiple operations of the same type in adjacent "pipelines", like having multiple lines in a factory. Sometimes you can add items to the lines every cycle, and sometimes each step takes longer than a cycle and items to be added to the pipeline have to wait. Sometimes, there are groups of 2, 4, or 8 items that need to be processed at once, and we have instructions specifically designed to process these groups in what is called vectorization.

Making it faster

Back in 2013, Mark Adler released performance improvements to the CRC class of algorithms. I didn't learn of them until several years later, through Matt Standcliff's Redis work and optimizations around 2017 or 2018. Huge thank you to Mark for writing the code, Matt for making it work with Redis, and Madelyn Olson for finally getting it merged in 2020.

A friend and former coworker sent me a reference to the "hacker's delight CRC", which was posted in 2014. That looks to use the same techniques that Mark Adler released in 2013.

I don't know who originated the work, but I'm going to guess it was Mark Adler or one of his colleagues, being he is the creator of the Adler CRC32, and co-creator of the Zip format, among others.

What was done? Mark transformed the typically bit or byte-wise operations of CRC into what could be done up to 8 bytes at a time. CPUs are built to do many such operations at the same time, so this naturally sped up CRC from around 400-408 MB/second to ~1420 MB/second, at least on one of my daily-workhorse Intel Xeon 2670's. That's a ~3.5x speedup by switching from 1 byte at a time to 8 bytes. Quite incredible if I didn't compile, run, and compare the outputs myself.

Not used, and rarely mentioned, I noticed that Mark had provided a method to do CRC combining. Where normally if you had some data, you had to have one function run from start to finish over that one whole block of data. You could pause, but you had to do the first part first. This limitation is very common among hashing algorithms.

Again, I don't know the origination of the idea, but Mark determined a way to merge the crcs of conceptually adjacent segments. So you can cut your 1 segment into 2, compute over the segments in parallel, combine the results, and get the same result as if you had computed the value serially.

Normally, most folks would add some threads, build a thread scheduler, and work out how to make this faster with threads. I did this, but only because I already had a threaded snapshot engine for Redis (not a part of this PR), so my CRCs were naturally faster. But I wasn't just interested in threads, and Mark's improvements weren't from thread shuffling on one CPU, it was from putting more operations in the CPU from one thread.

Armed with the knowledge that I could extend a CRC, I pulled the CRC operation over 8 bytes at a time into a group of macros, then had the single thread run over 2 or 3 segments producing 2 or 3 concurrent 8-byte segment CRCs over the data, all in one thread.

I had difficulty making CRC64 faster for data smaller than about 2-3 megabytes until I noticed something important. The original CRC combine method was taking ~270 microseconds to staple ~1 meg CRCs together. After some vectorization, manual and automatic, I settled on a compiler-vectorized method that got me down to ~29 microseconds for the CRC combine operation back in spring 2018.

On Wednesday, March 29, after looking at the algorithm further while preparing the patch for what ultimately became this pull request, I noticed that if you were to cache a certain set of data, much like we had done with the CRC computation in the first place, you could eliminate much of that remaining ~29 microseconds. So I did.

After a quick cache initialization that takes around 140 microseconds, or just more than half the time of the original staple over 1 meg, a 32 kilobyte cache is populated for your polynomial that allows CRC64 stapling. Due to an apparent cycle in the CRC after 64 bits of merge, we can actually merge unlimited length segments. After initialization, calls to combine about a megabyte now take roughly 25 nanoseconds. Yes, nanoseconds. From 270 microseconds without vectorization and caching, to 29 microseconds with vectorization, then all the way to 25 nanoseconds with caching.

So at least in Redis-land, CRC64 combining is about 10,800 times faster overall with the use of a 32 kilobyte cache and some auto-vectorization.

Wow.

While I am pretty sure that the caching portion is a well-known optimization, at least to Mark Adler, as I see a similar set of CRC combine optimizations in the most recent version of his codebase, which makes me think I should do a literature check before I write software. That said, Mark does not provide the additional vector operations, and without them, the cache speedup (at least in my version) is limited to getting us to 8 microseconds, instead of 50 nanoseconds typical worst-case.

So overall, the Redis CRC combine is about 160x faster than CRCany's combine, and nets >10,000x speedup from what I was originally working with, compared to around a 60x speedup with caching, or only 10x with vectorization (the combination is super effective).

Here are some benchmark outputs in a Github gist.

Overall, that gets us from 400-408 Megs/second for 1 byte at a time, to ~1420 megs/second for 8 bytes at a time, to up to ~2325 megs/second. So roughly 63% faster on an Intel Xeon E5-2670 v0 @ 2.60GHz with 20Mb of cache.

That is an old CPU, originally marketed Q1 2012. So it has a relatively smaller number of load, store, and arithmetic units compared to a more modern processor. I happen to have a slightly newer Intel Core i3-8130U @ 2.20 ghz, which was first listed for sale Q1 2018, and which I got in the summer of 2019, inside the laptop in which it currently runs.

Again, we see 400-408 Megs/second for 1 byte at a time processing. The 8-byte at a time version gets us up to ~1400 megs/second, or 3.5x faster, with the 24-byte at a time version seeing ~2350 megs/second on larger data. Ours gets us an approximate 68% speedup at 24 bytes, over the 8 byte at a time method.

In some benchmarks on ~10k of data (fitting inside L1 cache), I saw crc64 achieve upwards of 5670 megs/second on the 8130U.

Beating 1 byte/cycle

On my 5 year old mobile CPU, we see the 2.2ghz processor handling 2350 Megs/second, which is 1.068 bytes/cycle. If the data happens to already be loaded into cache after being created by another step, as is the case in many Redis snapshot operations, we see 2.6 bytes/cycle processed.

This speed puts us on par with the performance of "properly engineered" hashes that don't use any tables, and we are within a factor of 2 of the hardware CRC64 available in some more modern CPUs. For those about to ask, no we can't use that CRC in Redis or generally, it is not the right polynomial; if you run the functions on the data, you get different outputs.

I do not know if you can convert the output of CRC from one polynomial to another, but I suspect the answer is no. Might be an interesting question to ask a good mathematician.

Similar optimizations on 16 and 24-way (or more) processing can be done on CRC16 in Redis, or any other CRCs anywhere, as the 8 bytes at a time processing has already been implemented in CRCany. We skip this 16/24 byte optimization in Redis because CRC16 is typically used there for cluster slot identification, which is performed on keys that are generally in the tens to maybe hundreds of bytes in size, and there isn't a benefit to the optimization until 128 bytes for 16-way, and ~2k for 24-way. On the other hand, CRC64 is typically performed over snapshots of keys and data, which can reach to tens of gigabytes in size.

I'll try to get the optimizations upstream to crcany soon.

Summary

That's a lot of information, but CRC64 in Redis is getting about 60-70% faster for old machines, and likely 2x or faster on even newer hardware. For CRC combining, we achieve ~50ns worst-case time.

References

Pull request: https://github.com/redis/redis/pull/11988

Banchmark outputs: https://gist.github.com/josiahcarlson/9a147d39cc492951faffb723149f7926

Mark Adler's crcany project: https://github.com/madler/crcany/

Sunday, May 15, 2016

rom Indexes and Search

So... some updates on rom.

The end of January, I posted a teaser picture about "Interleaved Indexes" in rom/Redis via tweet. If you haven't seen the picture, it's here:


Interleaved index?

I ended up building an interleaved index in Redis using a bit of Lua scripting and Python. No ZRANGEBYLEX, surprisingly. What is an interleaved index? It's what I was calling an indexing mode that has the same ordering and filtering options as a few different multi-dimensional indexes stored as tree-structures. Specifically some KD trees, Redshift interleaved sort keys, BSP trees, a specific implementation of crit-bit trees, and several others offer the specific functionality I was after.

Why

The simple reason why I was implementing an interleaved index was because I see some intersection options on data in Redis to be potentially less efficient than a proper multi-dimensional index would or could be. Long story short, it worked, but not without issues. I mentioned some of these issues in a series of tweets 1, 2, and 3, semi-abandoned the code in late-February, and now am ultimately not releasing it. Why? Because it doesn't actually add anything. It was complex, borrowed about 750 lines of code I wrote 5 1/2 years ago, and ... no.

A better option

There were a few very simple wins that I knew could be made with the query optimizer, including a fix on my side for a bug when calling TYPE from within a Lua script (which returns a table instead of a string). The ultimate result of that work is that some queries on numeric ranges can be hundreds or thousands of times faster on large indexes in theory. Partly due to starting with the correct set/sorted set to start, but also implementing a direct scan of an index instead of intersect/union + delete outside ranges.

Sometimes being more specific for optimizations is worth it. Definitely is in this case. For one of my use-cases involving search, I'm seeing 10-20x faster queries in practice, and 150x faster in a simple canned test.

I also removed non-Lua writing mode code. Sorry for those of you living in pre-2.6 days, but you'll have to upgrade. Hell, even with Lua scripting turned off, the query optimizer still used Lua, so if this worked in Redis 2.4 recently, I'd be surprised.

So that's what's going on right now.

Rebuild your indexes

Yeah. And rebuild your indexes. I'm sorry. Whenever I'm using rom as a cache or index of some kind, I re-cache and re-index daily so things like this always eventually resolve themselves, especially immediately after a library upgrade. Not a new idea; Google did it with their bigtables for cleanup, Postgres does auto-vacuum. Call this a manual vacuum via re-indexing.

Once/day or however often, maybe during off hours:

# import all of your models first
# then...
from rom import columns, util
for model in columns.MODELS.values():
    util.show_progress(util.refresh_indices(model))

That will rebuild all of the indexes on all of your models.

Almost P.S. - Loadable modules in Redis?

Redisconf 2016 happened last week and loadable modules were announced. I think that for people who host their own Redis, it could be awesome. Think of it like an answer to Postgres plugins. Hope I can pay someone to run my loadable modules, if I ever get around to building any :P

Wednesday, July 29, 2015

Transactions in Redis

Over the last few months, I've been thinking about and implementing transactions for Lua scripting in Redis. Not everyone understands why I'm doing this, so let me explain with a bit of history.

MySQL and Postgres

In 1998-2003 if you wanted to start a serious database driven web site/service and didn't have money to pay Microsoft or Oracle for their databases, you picked either MySQL or Postgres. A lot of people picked MySQL because it was faster, and much of that was due to the MyISAM storage engine that traded performance for a lack of transaction capability - speed is speed. Some people went with Postgres because despite its measurably slower performance on the same hardware, you could rely on Postgres to not lose your data (to be fair, the data loss with MySQL was relatively rare, but data loss is never fun).

A lot of time has passed since then; MySQL moved on from MyISAM as the default storage engine to InnoDB (which has been available for a long time now), gained full transaction support in the storage engine, and more. At the same time, Postgres got faster, and added a continually expanding list of features to distinguish itself in the marketplace. And now the choice of whether to use MySQL or Postgres usually boils down to experience and preference, though occasionally business or regulatory needs dictate other choices.

TL;DR; data integrity

In a lot of ways, Redis up to now is a lot like MySQL was back before InnoDB was an option. There is already a reasonable best-effort to ensure data integrity (replication, AOF, etc.), and the introduction of Lua scripting in Redis 2.6 has helped Redis grow up considerably in its capabilities and the overall simplification of writing software that uses Redis.

Comparatively, Lua scripting operates very much like stored procedures in other databases, but script execution itself has a few caveats. The most important caveat for this post is that once a Lua script has written to the database, it will execute until any one of the following occurs:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script hits an error and exits in the middle, all writes that were done up to the error have occurred, but no more writes will be done from the script
  3. Redis is shut down without saving via SHUTDOWN NOSAVE
  4. You attach a debugger and "fix" your script to get it to do #1 or #2 (or some other heroic deed that allows you to not lose data)
To anyone who is writing software against a database, I would expect that you agree that only case #1 in that list is desirable. Cases #2, #3, and #4 are situations where you can end up with data corruption (cases #2 and #4) and/or data loss (cases #3 and #4). If you care about your data, you should be doing just about anything possible to prevent data corruption and loss. This is not philosophy, this is doing your job. Unfortunately, current Redis doesn't offer a lot of help here. I want to change that.

Transactions in Lua

I am seeking to eliminate cases #2, #3, and #4 above, replacing the entire list with:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script exits with an error, no changes have been made (all writes were rolled back)
No data loss. Either everything is written, or nothing is written. This should be the expectation of any database, and I intend to add it to the expectations that we all have about Redis.

The current pull request is a proof of concept. It does what it says it does, removing the need to lose data as long as you either a) explicitly run your scripts using the transactional variants, or b) force all Lua script calls to have transactional semantics with a configuration option.

There are many ways the current patch can be made substantially better, and I hope for help from Salvatore (the author of Redis) and the rest of the community.

Wednesday, November 26, 2014

Introduction to rate limiting with Redis [Part 2]

This article first appeared on November 3, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

In Introduction to rate limiting with Redis [Part 1], I described some motivations for rate limiting, as well as provided some Python and Lua code for offering basic and intermediate rate limiting functionality. If you haven’t already read it, you should, because I’m going to discuss several points from the article. In this post, I will talk about and address some problems with the previous methods, while also introducing sliding window functionality and-cost requests.

Problems with previous methods

The last rate limiting function that we wrote was over_limit_multi_lua(), which used server-side Lua scripting in Redis to do the heavy lifting of actually performing the rate limiting calculations. It is included below with the Python wrapper as a reference.

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count = redis.call('INCR', key)
        redis.call('EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
        end
    end
end
return 0
'''

Hidden inside this code are several problems that can limit its usefulness and correctness when used for its intended purpose. These problems and their solutions are listed below.

Generating keys in the script

One of the first problems you might notice was mentioned in a comment by a commenter named Tobias on the previous post, which is that we are constructing keys inside the Lua script. If you’ve read the Redis documentation about Lua scripting, you should know that we are supposed to be passing all keys to be used in the script from outside when calling it.

The requirement to pass keys into the script is how Redis attempts to future-proof Lua scripts that are being written, as Redis Cluster (currently in beta) distributes keys across multiple servers. By having your keys known in advance, you can calculate which Redis Cluster server the script should run on, and if keys are on multiple Cluster servers, that the script can’t run properly.

Our first problem is that generating keys inside the script can make the script violate Redis Cluster assumptions, which makes it incompatible with Redis Cluster, and generally makes it incompatible with most key-based sharding techniques for Redis.

To address this issue for Redis Cluster and other client-sharded Redis setups, we must use a method that handles rate limiting with a single key. Unfortunately, this can prevent atomic execution for multiple identifiers for Redis Cluster, but you can either rely on a single identifier (user id OR IP address, instead of both), or stick with non-clustered and non-sharded Redis in those cases.

What we count matters

Looking at our function definition, we can see that our default limits were 10 requests per second, 120 requests per minute, and 240 requests per hour. If you remember from the “Counting correctly” section, in order for our rate limiter to complete successfully, we needed to only increment one counter at a time, and we needed to stop counting if that counter went over the limit.

But if we were to reverse the order that the limits were defined, resulting in us checking our per-hour, then per-minute, then per-second limits (instead of per-second, minute, then hour), we would have our original counting problem all over again. Unfortunately, due to details too involved to explain here, just sorting by bucket size (smallest to largest) doesn’t actually solve the problem, and even the original order could result in requests failing that should have succeeded. Ultimately our problem is that we are counting all requests, both successful and unsuccessful (those that were prevented due to being over the limit).

To address the issue with what we count, we must perform two passes while rate limiting. Our first pass checks to see if the request would succeed (cleaning out old data as necessary), and the second pass increments the counters. In previous rate limiters, we were basically counting requests (successful and unsuccessful). With this new version, we are going to only count successful requests.
Stampeding elephants

One of the most consistent behaviors that can be seen among APIs or services that have been built with rate limiting in mind is that usually request counts get reset at the beginning of the rate limiter’s largest (and sometimes only) time slice. In our example, at every hour on the hour, every counter that had been incremented is reset.

One common result for APIs with these types of limits and limit resets is what’s sometimes referred to as the “stampeding elephants” problem. Because every user has their counts reset at the same time, when an API offers access to in-demand data, many requests will occur almost immediately after limits are reset. Similarly, if the user knows that they have outstanding requests that they can make near the end of a time slice, they will make those requests in order to “use up” their request credit that they would otherwise lose.

We partially addressed this issue by introducing multiple bucket sizes for our counters, specifically our per-second and per-minute buckets. But to fully address the issue, we need to implement a sliding-window rate limiter, where the count for requests that come in at 6:01PM and 6:59PM aren’t reset until roughly an hour later at 7:01PM and 7:59PM, respectively, not at 7:00PM. Further details about sliding windows are a little later.

Bonus feature: variable-cost requests

Because we are checking our limits before incrementing our counts, we can actually allow for variable-cost requests. The change to our algorithm will be minor, adding an increment for a variable weight instead of 1.

Sliding Windows

The biggest change to our rate limiting is actually the process of changing our rate limiting from individual buckets into sliding windows. One way of understanding sliding window rate limiting is that each user is given a number of tokens that can be used over a period of time. When you run out of tokens, you don't get to make any more requests. And when a token is used, that token is restored (and can be used again) after the the time period has elapsed.

As an example, if you have 240 tokens that can be used in an hour, and you used 20 tokens at 6:05PM, you would only be able to make up to another 220 requests until 7:04PM. At 7:05PM, you would get those 20 tokens back (and if you made any other requests between 6:06PM and 7:05PM, those tokens would be restored later).

With our earlier rate limiting, we basically incremented counters, set an expiration time, and compared our counters to our limits. With sliding window rate limiting, incrementing a counter isn’t enough; we must also keep history about requests that came in so that we can properly restore request tokens.

One way of keeping a history, which is the method that we will use, is to imagine the whole window as being one large bucket with a single count (the window has a ‘duration’), similar to what we had before, with a bunch of smaller buckets inside it, each of which has their own individual counts. As an example, if we have a 1-hour window, we could use smaller buckets of 1 minute, 5 minutes, or even 15 minutes, depending on how precise we wanted to be, and how much memory and time we wanted to dedicate (more smaller buckets = more memory + more cleanup work). We will call the sizes of the smaller buckets their “precision.” You should notice that when duration is the same as precision, we have regular rate limits. You can see a picture of various precision buckets in a 1 hour window below.


As before, we can consider the smaller buckets to be labeled with individual times, say 6:00PM, 6:01PM, 6:02PM, etc. But as the current time becomes 7:00PM, what we want to do is to reset the count on the 6:00PM bucket to 0, adjust the whole window’s count, and re-label the bucket to 7:00PM. We would do the same thing to the 6:01PM bucket at 7:01PM, etc.

Data representation

We’ve now gotten to the point where we need to start talking about data representation. We didn’t really worry about representation before simply because we were storing a handful of counters per identifier. But now, we are no longer just storing 1 count for a 1 hour time slice, we could store 60 counts for a 1 hour time slice (or more if you wanted more precision), plus a timestamp that represents our oldest mini-bucket label.

For a simpler version of sliding windows, I had previously used a Redis LIST to represent the whole window, with each item in the LIST including both a time label, as well as the count for the smaller buckets. This can work for limited sliding windows, but restricts our flexibility when we want to use multiple rate limits (Redis LISTs have slow random access speeds).

Instead, we will use a Redis HASH as a miniature keyspace, which will store all count information related to rate limits for an identifier in a single HASH. Generally, for a sliding window of a specified duration and precision for an identifier, we will have the HASH stored at the key named by the identifier, with contents of the form:

<duration>:<precision>:o --> <timestamp of oldest entry>
<duration>:<precision>: --> <count of successful requests in this window>
<duration>:<precision>:<ts> --> <count of successful requests in this bucket>

For sliding windows where more than one sub-bucket has had successful requests, there can be multiple <duration>:<precision>:<ts> entries that would each represent one of the smaller buckets. For regular rate limits (not sliding window), the in-Redis schema is the same, though there will be at most one <duration>:<precision>:<ts> key, and duration is equal to precision for regular rate limits (as we mentioned before).

Because of the way we named the keys in our HASH, a single HASH can contain an arbitrary number of rate limits, both regular and windowed, without colliding with one another.

Putting it all together

And finally, we are at the fun part; actually putting all of these ideas together. First off, we are going to use a specification for our rate limits to simultaneously support regular and sliding window rate limits, which looks a lot like our old specification.

One limit is: [duration, limit, precision], with precision being optional. If you omit the precision option, you get regular rate limits (same reset semantics as before). If you include the precision option, then you get sliding window rate limits. To pass one or more rate limits to the Lua script, we just wrap the series of individual limits in a list: [[duration 1, limit 1], [duration 2, limit 2, precision 2], ...], then encode it as JSON and pass it to the script.

Inside the script we need to make two passes over our limits and data. Our first pass cleans up old data while checking whether this request would put the user over their limit, the second pass increments all of the bucket counters to represent that the request was allowed.

To explain the implementation details, I will be including blocks of Lua that can be logically considered together, describing generally what each section does after. Our first block of Lua script will include argument decoding, and cleaning up regular rate limits:

local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
local weight = tonumber(ARGV[3] or '1')
local longest_duration = limits[1][1] or 0
local saved_keys = {}
-- handle cleanup and limit checks
for i, limit in ipairs(limits) do

    local duration = limit[1]
    longest_duration = math.max(longest_duration, duration)
    local precision = limit[3] or duration
    precision = math.min(precision, duration)
    local blocks = math.ceil(duration / precision)
    local saved = {}
    table.insert(saved_keys, saved)
    saved.block_id = math.floor(now / precision)
    saved.trim_before = saved.block_id - blocks + 1
    saved.count_key = duration .. ':' .. precision .. ':'
    saved.ts_key = saved.count_key .. 'o'
    for j, key in ipairs(KEYS) do

        local old_ts = redis.call('HGET', key, saved.ts_key)
        old_ts = old_ts and tonumber(old_ts) or saved.trim_before
        if old_ts > now then
            -- don't write in the past
            return 1
        end

        -- discover what needs to be cleaned up
        local decr = 0
        local dele = {}
        local trim = math.min(saved.trim_before, old_ts + blocks)
        for old_block = old_ts, trim - 1 do
            local bkey = saved.count_key .. old_block
            local bcount = redis.call('HGET', key, bkey)
            if bcount then
                decr = decr + tonumber(bcount)
                table.insert(dele, bkey)
            end
        end

        -- handle cleanup
        local cur
        if #dele > 0 then
            redis.call('HDEL', key, unpack(dele))
            cur = redis.call('HINCRBY', key, saved.count_key, -decr)
        else
            cur = redis.call('HGET', key, saved.count_key)
        end

        -- check our limits
        if tonumber(cur or '0') + weight > limit[2] then
            return 1
        end
    end
end

Going section by section though the code visually, where a blank line distinguishes individual sections, we can see 6 sections in the above code:
  1. Argument decoding, and starting the for loop that iterates over all rate limits
  2. Prepare our local variables, prepare and save our hash keys, then start iterating over the provided user identifiers (yes, we still support multiple identifiers for non-clustered cases, but you should only pass one identifier for Redis Cluster)
  3. Make sure that we aren’t writing data in the past
  4. Find those sub-buckets that need to be cleaned up
  5. Handle sub-bucket cleanup and window count updating
  6. Finally check the limit, returning 1 if the limit would have been exceeded
Our second and last block of Lua operates under the precondition that the request should succeed correctly, so we only need to increment a few counters and set a few timestamps:

-- there is enough resources, update the counts
for i, limit in ipairs(limits) do
    local saved = saved_keys[i]

    for j, key in ipairs(KEYS) do
        -- update the current timestamp, count, and bucket count
        redis.call('HSET', key, saved.ts_key, saved.trim_before)
        redis.call('HINCRBY', key, saved.count_key, weight)
        redis.call('HINCRBY', key, saved.count_key .. saved.block_id, weight)
    end
end

-- We calculated the longest-duration limit so we can EXPIRE
-- the whole HASH for quick and easy idle-time cleanup :)
if longest_duration > 0 then
    for _, key in ipairs(KEYS) do
        redis.call('EXPIRE', key, longest_duration)
    end
end

return 0

Going section by section one last time gets us:
  1. Start iterating over the limits and grab our saved hash keys
  2. Set the oldest data timestamp, and update both the window and buckets counts for all identifiers passed
  3. To ensure that our data is automatically cleaned up if requests stop coming in, set an EXPIRE time on the keys where our hash(es) are stored
  4. Return 0, signifying that the user is not over the limit

Optional fix: use Redis time

As part of our process for checking limits, we fetch the current unix timestamp in seconds. We use this timestamp as part of the sliding window start and end times and which sub-bucket to update. If clients are running on servers with reasonably correct clocks (within 1 second of each other at least, within 1 second of the true time optimally), then there isn’t much to worry about. But if your clients are running on servers with drastically different system clocks, or on systems where you can’t necessarily fix the system clock, we need to use a more consistent clock.

While we can’t always be certain that the system clock on our Redis server is necessarily correct (just like we can’t for our other clients), if every client uses the time returned by the TIME command from the same Redis server, then we can be reasonably assured that clients will have fairly consistent behavior, limited to the latency of a Redis round trip with command execution.

As part of our function definition, we will offer the option to use the result of the TIME command instead of system time. This will result in one additional round trip between the client and Redis to fetch the time before passing it to the Lua script.

Add in our Python wrapper, which handles the optional Redis time and request weight parameters, and we are done:

def over_limit_sliding_window(conn, weight=1, limits=[(1, 10), (60, 120), (3600, 240, 60)], redis_time=False):
    if not hasattr(conn, 'over_limit_sliding_window_lua'):
        conn.over_limit_sliding_window_lua = conn.register_script(over_limit_sliding_window_lua_)

    now = conn.time()[0] if redis_time else time.time()
    return conn.over_limit_sliding_window_lua(
        keys=get_identifiers(), args=[json.dumps(limits), now, weight])

If you would like to see all of the rate limit functions and code in one place, including the over_limit_sliding_window() Lua script with wrapper, you can visit this Github gist.

Wrap up and conclusion

Congratulations on getting this far! I know, it was a slog through problems and solutions, followed by a lot of code, and now after seeing all of it I get to tell you what you should learn after reading through all of this.

Obviously, the first thing you should get out of this article is an implementation of sliding window rate limiting in Python, which is trivially ported to other languages -- all you need to do is handle the wrapper. Just be careful when sending timestamps, durations, and precision values to the script, as the EXPIRE call at the end expects all timestamp values to be in seconds, but some languages natively return timestamps as milliseconds instead of seconds.

You should also have learned that performing rate limiting with Redis can range from trivial (see our first example in part 1) to surprisingly complex, depending on the features required, and how technically correct you want your rate limiting to be. It also turns out that the problems that were outlined at the beginning of this article aren’t necessarily deal-breakers for many users, and I have seen many implementations similar to the over_limit_multi_lua() method from part 1 that are perfectly fine for even heavy users*. Really it just means that you have a choice about how you want to rate limit.

And finally, you may also have learned that you can use Redis hashes as miniature keyspaces to collect data together. This can be used for rate limiting as we just did, as well as a DB row work-alike (the hash keys are like named columns, with values the row content), unique (but unsorted) indexes (i.e. email to user id lookup table, id to encoded data lookup table, ...), sharded data holders, and more.

For more from me on Redis and Python, you can check out the rest of my blog at dr-josiah.com.

* When Twitter first released their API, they had a per-hour rate limit that was reset at the beginning of every hour, just like our most basic rate limiter from part 1. The current Twitter API has a per-15 minute rate limit, reset at the beginning of every 15 minute interval (on the hour, then 15, 30, and 45 minutes after the hour) for many of their APIs. (I have no information on whether Twitter may or may not be using Redis for rate limiting, but they have admitted to using Redis in some capacity by virtue of their release of Twemproxy/Nutcracker).

Monday, November 3, 2014

Introduction to rate limiting with Redis [Part 1]

This article first appeared on October 9, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

Over the years, I've written several different rate limiting methods using Redis for both commercial and personal projects. This two-part tutorial intends to cover two different but related methods of performing rate limiting in Redis using standard Redis commands and Lua scripting. Each method expands the number of use-cases for rate limiting, and cleans up some of the rougher edges of previous rate limiters.

This post assumes some experience with Python and Redis, and to a lesser extent Lua, but new users still reading docs should be okay.

Why rate limit?


Most uses of rate limiting on the web today are generally intended to limit the effect that someone can have on a given platform. Whether it is API limits at Twitter, posting limits at Reddit, or posting limits at StackOverflow, some limit resource utilization, and others limit the effect a spammer account can have. Whatever the reason, let's start with saying that we need to count actions as they happen, and we need to prevent an action from happening if the user has reached or gone over their limit. Let's start with the plan of building a rate limiter for an API where we need to restrict users to 240 requests per hour per user.

We know that we need to count and limit a user, so let's get some utility code out of the way. First, we need to have a function that gives us one or more identifiers for the user performing an action. Sometimes that is just a user id, other times it's the remote IP address; I usually use both when available, and at least IP address if the user hasn't logged in yet. Below is a function that gets the IP address and user id (when available) using Flask with the Flask-Login plugin.

from flask import g, request

def get_identifiers():
    ret = ['ip:' + request.remote_addr]
    if g.user.is_authenticated():
        ret.append('user:' + g.user.get_id())
    return ret

Just use a counter

Now that we have a function that returns a list of identifiers for an action, let's start counting and limiting. One of the simplest rate limiting methods available in Redis starts by taking the times of the actions as they happen, and buckets actions into ranges of times, counting them as they occur. If the number of actions in a bucket exceeds the limit, we don't allow the action. Below is a function that performs the rate limiting using an automatically-expiring counter that uses 1 hour buckets.

import time

def over_limit(conn, duration=3600, limit=240):
    bucket = ':%i:%i'%(duration, time.time() // duration)
    for id in get_identifiers():
        key = id + bucket

        count = conn.incr(key)
        conn.expire(key, duration)
        if count > limit:
            return True

    return False

This function shouldn't be too hard to understand; for each identifier we increment the appropriate key in Redis, set the key to expire in an hour, and if the count is more than the limit, we return True, signifying that we are over the limit. Otherwise we return False.

And that's it. Well, sort of. This gets us past our initial goal of having a basic rate limiter to limit each user to 240 requests per hour. But reality has a tendency to catch us when we aren't looking, and clients using the API have noticed that their limit is reset at the top of every hour. Now users have started making all 240 requests in the first few seconds they can, so all of our work limiting requests is wasted, right?

Multiple bucket sizes

Our initial rate limiting on a per-hour basis was successful in that it limited users on an hourly basis, but users started using all of their API requests as soon as they could (at the beginning of the hour). Looking at the problem, it seems almost obvious that in addition to a per-hour rate limit, we should probably also have a per-second and/or per-minute rate limit to smooth out peak request rates.

Let's say that we determined that 10 requests per second, 120 requests per minute, and 240 requests per hour were fair enough to our users, and let us better distribute requests over time. We could simply re-use our earlier over_limit() function to offer this functionality.

def over_limit_multi(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    for duration, limit in limits:
        if over_limit(conn, duration, limit):
            return True
    return False

This will work for our intended use, but with 3 rate limit calls, which can result in two counter updates and two expire calls (one for each of IP and user keys), and we may need to perform 12 total round trips to Redis just to say whether someone is over their limit. One common method of minimizing the number of round trips to Redis is to use what is called 'pipelining'. Pipelining in the Redis context will send multiple commands to Redis in a single round trip, which can reduce overall latency.

Coincidentally, our over_limit() function is written in such a way that we could easily replace our INCR and EXPIRE calls with a single pipelined request to increment the count and update the key expiration. The updated function can be seen below, and cuts our number of round trips from 12 to 6 when combined with over_limit_multi().

def over_limit(conn, duration=3600, limit=240):
    # Replaces the earlier over_limit() function and reduces round trips with
    # pipelining.
    pipe = conn.pipeline(transaction=True)
    bucket = ':%i:%i'%(duration, time.time() // duration)
    for id in get_identifiers():
        key = id + bucket

        pipe.incr(key)
        pipe.expire(key, duration)
        if pipe.execute()[0] > limit:
            return True

    return False

Halving the number of round trips to Redis is great, but we are still performing 6 round trips just to say whether a user can make an API call. We could write a replacement over_limit_multi() that makes all increment and expire operations at once, checking the limits after, but the obvious implementation actually has a counting bug that can prevent users from being able to make 240 successful requests in an hour (in the worst-case, a client may experience 10 successful requests in an hour, despite making over 100 requests per second for the entire hour). This counting bug can be fixed with a second round trip to Redis, but lets instead shift our logic into Redis.

Counting correctly

Instead of trying to fix a fully pipelined version, we can use the ability to execute Lua scripts inside Redis to perform the same operation while also keeping to one round trip. The specific operations we are going to perform in Lua are almost the exact same operations as we were originally performing in Python. We are going to iterate over the limits themselves, and for each identifier, we are going to increment a counter, update the expiration time of the updated counter, then check to see if we are over the limit. We will also use a small Python wrapper around our Lua to handle argument conversion and to hide the details of script loading.

import json

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count = redis.call('INCR', key)
        redis.call('EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
        end
    end
end
return 0
'''

With the section of code starting with 'local bucket', you will notice that our Lua looks very much like and performs the same operations as our original over_limit() function, with the remaining code handling argument unpacking and iterating over the individual limits.

Conclusion

At this point we have built a rate limiting method that handles multiple levels of timing granularity, can handle multiple identifiers for a single user, and can be performed in a single round trip between the client and Redis. We started from a single-bucket rate limiter to a rate limiter that can evaluate multiple limits simultaneously.

Any of the rate limiting functions discussed in this post are usable for many different applications. In part two, I'll cover a different way of approaching rate limiting, which rounds out the remaining rough edges in our rate limiter. Read it over on Binpress.

More detailed information on Lua scripting can be found in the help for the EVAL command at Redis.io.

Thursday, October 24, 2013

Multi-column (SQL-like) sorting in Redis

Recently, I received an email from a wayward Redis user asking about using Redis Sets and Sorted Sets to sort multiple columns of data, with as close to the same semantics as a traditional SQL-style "order by" clause. Well it is possible, with limitations, keep reading to find out how.

What is Redis?


For those people who don't quite know what Redis is already, the TLDR version is: an in-memory data structure server that maps from string keys to one of 5 different data structures, providing high-speed remote access to shared data, and optional on-disk persistence. In a lot of ways, you can think of Redis like a version of Memcached where your data doesn't disappear if your machine restarts, and which supports a wider array of commands to store, retrieve, and manipulate data in different ways.

The setup


With that out of the way, our intrepid Redis user had come to me with a pretty reasonable problem to have; he needed to build an application to display a listing of businesses, sorted by several criteria. In his case, he had "price", "distance[1]", and "rating". In many cases that we have all seen in recent years with individual retailer searches, never mind restaurant searches on Yelp and similar applications, when searching for something in the physical world, there are a few things you care about primarily. These usually break down preferentially as lowest distance, lowest price, highest rating. In a relational database/SQL world, these fields would all be columns in a table (or spread out over several tables or calculated in real-time), so we are going to be referring to them as "sort columns" from here on.

Now, depending on preferences, you can sometimes get column preference and ascending/descending changes, which is why we need to build a system that can support reordering columns *and* switching the order of each individual column. Say that we really want the highest rating, lowest distance, lowest price? We need to support that too, and we can.

The concept


Because we are dealing with sort orders, we have two options. We can either use the Redis SORT command, or we can use sorted sets. There are ways of building this using the SORT command, but it is much more complicated and requires quite a bit of precomputation, so we'll instead use sorted sets.

We will start by making sure that every business has an entry in each of 3 different sorted sets representing price, distance, and rating. If a business has an "id" of 5, has a price of 20, distance of 4, and a rating of 8, then at some point the commands "ZADD price 20 5", "ZADD distance 4 5", and "ZADD rating 8 5" will have been called.

Once all of our data is in Redis, we then need to determine the maximum value of each of our sort columns. If you have ranges that you know are fixed, like say that you know that all of your prices and distance will all be 0 to 100, and your rating will always be 0 to 10, then you can save yourself a round-trip. We'll go ahead and build this assuming that you don't know your ranges in advance.

We are trying to gather our data range information in advance in order to carve up the 53 bits of precision [2] available in the floating-point doubles that are available in the sorted set scores. If we know our data ranges, then we know how many "bits" to dedicate to each column, and we know whether we can actually sort our data exactly, without losing precision.

If you remember our price, distance, and range information, you can imagine that (borrowing our earlier data) if we have price=20, distance=4, rating=8, and we want to sort by distance, price, -rating, we want to construct a "score" that will sort the same as the "tuple" comparison (20, 4, -8). By gathering range information, we could (for example) translate that tuple into a score of "20042", which you can see is basically the concatenation of "20", "04", and 10-8 (we subtract from 10 here because the rating column is reversed, and it helps to understand how we got the values).

Note: because of our construction, scores that are not whole numbers may not produce completely correct sorts.

The code


Stepping away from the abstract and into actual code, we are going to perform computationally what I just did above with some string manipulation.  We are going to numerically shift our data into columns, accounting for the magnitude of the data, as well as negative values in the columns (which won't affect our results). As a bonus, this method will even tell you if it believes that you could have a lower-quality sort because your data range is too wide[3].

import math
import warnings

def sort_zset_cols(conn, result_key, sort=('dist', 'price', '-score')):
    current_multiplier = 1
    keys = {result_key: 0}
    sort = list(reversed(sort))

    # Gets the max/min values in a sort column
    pipe = conn.pipeline(True)
    for sort_col in sort:
        pipe.zrange(sort_col, 0, 0, withscores=True)
        pipe.zrange(sort_col, -1, -1, withscores=True)
    ranges = pipe.execute()

    for i, sort_col in enumerate(sort):
        # Auto-scaling for negative values
        low, high = ranges[i*2][1], ranges[i*2+1][1]
        maxv = int(math.ceil(max(abs(low), abs(high))))

        # Adjusts the weights based on the magnitude and sort order of the
        # column
        old_multiplier = current_multiplier
        desc = sort_col.startswith('-')
        sort_col = sort_col.lstrip('-')
        current_multiplier *= maxv

        # Assign the sort key a weight based on all of the lower-priority
        # sort columns
        keys[sort_col] = -old_multiplier if desc else old_multiplier

    if current_multiplier >= 2**53:
        warnings.warn("The total range of your values is outside the "
            "available score precision, and your sort may not be precise")

    # The sort results are available in the passed result_key
    return conn.zinterstore(result_key, keys)

If you prefer to check the code out at Github, here is the gist. Two notes about this code:
  • If the maximum or minimum values in any of the indexed columns becomes more extreme between the data range check and the actual query execution, some entries may have incorrect ordering (this can be fixed by translating the above to Lua and use Redis 2.6 and later support for Lua scripting)
  • If any of your data is missing in any of the indexes, then that entry will not appear in the results
Within the next few weeks, I'll be adding this functionality to rom, my Python Redis object mapper.

Interested in more tips and tricks with Redis? My book, Redis in Action (Amazon link), has dozens of other examples for new and seasoned users alike.


[1] For most applications, the distance criteria is something that would need to be computed on a per-query basis, and our questioning developer already built that part, so we'll assume that is available already.
[2] Except for certain types of extreme-valued doubles, you get 52 bits of actual precision, and 1 bit of implied precision. We'll operate under the assumption that we'll always be within the standard range, so we'll always get the full 53 bits.
[3] There are ways of adjusting the precision of certain columns of the data (by scaling values), but that can (and very likely would) result in scores with fractional components, which may break our sort order (as mentioned in the notes).

Edit on 2015-08-09: Changed the "maxv =" assignment above to be correct in all cases, and made sure the revered sort (for calculating ranges) is a list for repeated-iteration.

Monday, June 3, 2013

What's going on with rom?

This post will talk about the open source Redis object mapper that I've released called rom. I will talk about what it is, why I wrote it, and what I'm planning on doing with it. I've posted several articles about Redis in the past, and you can buy my book, Redis in Action, now (hard copy will be available on/around June 10, 2013 - about a week from now) - enter the code dotd0601au at checkout for half off!

What is rom?


Rom is an "active record"-style object mapper intended as an interface between somewhat-intelligent Python objects and behavior, and data stored in Redis. Early versions (everything available now, and available for the coming few months) are purposefully simplified with respect to what is possible with Redis so that rom's capabilities can grow into what is necessary/desired, rather than trying to build functionality to support any/all possible use-cases up front.

An example use for storing users* can be seen below.

from hashlib import sha256
import os

import rom

def hash_pw(password, salt=None):
    salt = salt or os.urandom(16)
    hash = sha256(salt + password).digest()
    for i in xrange(32768):
        hash = sha256(hash + password).digest()
    return salt, hash

class User(rom.Model):
    name = rom.String(indexed=True)
    email = rom.String(required=True, unique=True, indexed=True)
    salt = rom.String()
    hash = rom.String()

    def update_password(self, password):
        self.salt, self.hash = hash_pw(password)

    def check_password(self, password):
        hash = hash_pw(password, self.salt)[1]
        pairs = zip(map(ord, hash), map(ord, self.hash or ''))
        p1 = sum(x ^ y for x, y in pairs)
        p2 = (len(hash) ^ len(self.hash))
        if p1 | p2:
            raise Exception("passwords don't match")

    @property
    def contact(self):
        return '"%s" <%s>'%(self.name, self.email)

Aside from the coding overhead of handling the hashing of passwords in a somewhat secure manner, setting up attributes and adding simple behaviors on top of models is pretty much what you should expect in an object mapper used in Python.

Why did I write rom?


As is the case with many things I build, it was meant to scratch an itch. I was working on an as-of-yet unreleased personal project and I needed a database. This database would only ever need to hold a few megabytes of data (maybe up to the tens of megabytes), but it also may need to perform several thousand reads/second and several hundred writes/second on an under-powered machine, and would need to persist any updated data. Those requirements eliminated a standard database in a typical configuration. However, a relational database instructed to store data in-memory would work, except for the persisting to disk part (and Postgres with some fsync tuning wouldn't be unreasonable, except maybe for the read volume). Given that I have a lot of experience with Redis, and my requirements fit with many of the typical use-cases for Redis, my project seemed to be begging to use Redis.

But after deciding to use Redis for my project, then what? I've built ad-hoc data storage methods on Redis before (roughly 2 dozen different mechanisms that are or have run in production, never mind the dozens that I've advised others to use), and I feel bad every time I do it. So I started by looking through several of the object mappers for Redis in Python (some of which call themselves 'object Redis mappers' to stick with the 'orm' theme), but I didn't like the way they either exposed or hid the Redis internals. In particular, most of the time, users shouldn't care what some database is doing under the covers, and they shouldn't need to change the way they think about the world to get the job done.

Now, I know what you are thinking. You're thinking: "Josiah, Redis is a database that requires that you change the way you think about the world in order to make it work." Ahh, but that's where you are wrong. The purpose of rom is to abstract away about 90% of the strangeness of Redis to a new user (at least on the Python side). Almost everything works the way most users of SQLAlchemy, Django's ORM, or Appengine's datastore would expect. About the only thing that rom doesn't do that those other libraries offer on top of relational databases is: 1) composite indices and 2) ordered indices on string columns.

With Redis and the way rom handles its indices, there might be some advantages to offering composite indices on the performance side of things for certain queries. But those queries are very limited, and there is just under 64 bits of usable space in any index entry. Ordered indices on string columns is also tough, running into a limit of just under 64 bits to offer ordering there.

There is a method that would increase the limit beyond 64 bits, but that method would be incompatible with the other indices that rom already uses. So, long story short, don't expect composite indices, and don't expect ordered indices on strings that play nicely with other rom indices.

What is in rom's future?


If you've read everything above, you know that composite indices and sorted indices on strings that play nicely with the other indices is not going to happen. But what if you don't care about playing nicely with other indices? Well, that is a whole other story.

With ordered indices on strings, one really nice feature is that you can perform prefix lookups on strings - which makes autocomplete-like problems very easy. Expect that in the future.

At some point I'll also be switching to using Lua scripting to handle updating the data in Redis. That will offer fast and easy support for multiple unique index columns, while simultaneously offering point-in-time atomic updates without retries. All of the major logic would be on the Python side, leaving simple updates to be done by Lua. I haven't done it yet because the performance and feature advantages are not drastically better to necessitate it at this point. With a little work, it would even be possible to implement "check and update" behavior to ensure that data hasn't been manipulated by other clients.

I've also been thinking about deferring attribute validation until just before data is serialized and sent to Redis, as it can be a significant performance advantage. The only reason I didn't do that from the start is because a TypeError on commit() can be a nightmare. Hunting down the answer to, "how did that data get there?" some time after the write occurred can be an exercise in futility. By performing the validation on attribute write (and on data load from Redis), you can at least know when you are writing the wrong data, you will be notified of it right away. As such, deferring validation until commit() may be a documented but discouraged feature in the future.

Redis-native structure access


I know that by now, some of you are looking at your screen asking, "When can I get raw access to lists, sets, hashes, and sorted sets? Those are really what my application needs!" And my answer to that is: at some point down the line. I know, that is a wishy-washy answer. And it's wishy-washy because there are three ways of building the functionality: 1) copy data out of Redis, manipulate in Python, then write changes back on commit(), 2) create objects that manipulate the in-Redis data directly, or 3) offer smart objects like #2, but write data back on commit().

If we keep with the database-style, then #1 is the right answer, as it allows us to perform all writes at commit() time. But if people have large lists, sets, hashes, or sorted sets in Redis, then #1 is definitely not the right answer, as applications might be copying out a lot of information. But with #2, updates to other structures step outside the somewhat expected commit() behavior (this entity and all related data have been written). Really, the right answer is #3. Direct access to reading, but masking write operations until commit().

Talking about building native access is easy, but actually building it to be robust is no small task. For now, you're probably better suited writing manual commands against your structures if you need native structure access.

How can you help?


Try rom out. Use it. If you find bugs, report them at Github. If you have fixes for bugs, post the bug with a pull request. If you have a feature request, ask. If your feature request is in the range of things that are reasonable to do and which fit in with what I want rom to be, I'll build it (possibly with a delay). Tell your colleagues about it. And if you are feeling really generous, buy my book: Redis in Action. If you are feeling really generous (or lost), I also do phone consultations.

Thank you for taking the time to read, and I hope that rom helps you.

* Whether or not you should store users in rom and Redis is a question related to whether Redis is suitable for storing such data in your use-case. This is used as a sample scenario that most people that have developed an application with users should be able to recognize (though not necessarily use).

Tuesday, April 30, 2013

Want to get a deal on Redis in Action?

As we come into the home stretch of the release of Redis in Action, a lot of work has been going on behind the scenes. Easily 3-4 complete read-throughs over the last few months, a crew of editors poking and prodding, countless major and minor edits, and for those of you who have already downloaded version 11 of the MEAP (released last week), you will notice that it has taken a pass through the typesetters. This pass has resulted in a level of spit and polish that I couldn't imagine a year ago when the first MEAP went live.

There are still a few issues to be fixed in the week since version 11 of the MEAP was released, but they are getting cleaned up as I type this.

To commemorate this final push and to give everyone a chance to learn more about their data, Manning is offering 50% off Redis in Action, Lucene in Action 2nd Edition, and Machine Learning in Action. How do you get 50% off? Follow those links and enter dotd0501au on checkout.


When I find time in the coming weeks, you can look forward to two blog posts. The first on Steganography in Python, which I presented as a lightning talk on April 1 at the Southern California Python Interest Group meetup. And in the second, I'll talk about the whys, whats, and hows of building rom, a Redis Object Mapper in Python.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Thursday, January 12, 2012

Creating a Lock with Redis

This post is a partially rewritten excerpt from Redis in Action, the book that I am writing for Manning Publications, which does not currently have a release date. Over time, I hope to include other excerpts from the book as time allows.

Redis Transactions

In the world of concurrent and parallel programming, there will be some point where you will write something that requires access to a resource without any other thread or process accessing it. In Redis this is actually very common; multiple readers, multiple writers, all dealing with the same data structures. Redis includes a combination of five commands for handling optimistic data locking. Those commands are WATCH, UNWATCH, MULTI, EXEC, and DISCARD.

The typical path for code that intends to modify data in Redis that requires checking values before updating will have a 4-step process. First you WATCH data that you want to modify/check on. Next you check your data. If it isn't what you need to continue, you UNWATCH and return. If it was what you were looking for, you start a MULTI transaction, send your commands, then EXEC them. Below is a simple example that transfers money between two accounts, making sure to prevent overdrafts.

def transfer_funds(conn, sender, recipient, amount):
    pipe = conn.pipeline(True)
    while 1:
        try:
            pipe.watch(sender)
            if pipe.hget(sender, 'money') < amount:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.hincrby(sender, 'money', -amount)
            pipe.hincrby(recipient, 'money', amount)
            pipe.execute()
            return True
        except redis.exceptions.WatchError:
            pass
The only command not used above is the DISCARD command, which will discard any commands passed after MULTI. As you can see, we will attempt to loop through the above money transfer until it either fails due to no money, or succeeds. This is fine, but if you have some account that is updated often, or if you have some shared data structure that is constantly being updated, you can fall into the except clause and have to retry the transaction repeatedly.

Locking

In other software, rather than using WATCH/MULTI/EXEC, there is a primitive called a Lock, which allows us to gain exclusive access to a resource. You acquire the lock, perform your operation, then release the lock. Because of the variety of commands in Redis, you can build a lock using the available tools. Unfortunately, using those tools correctly is not easy. In my research about locks in Redis, I haven't found a single implementation available online that is 100% correct (until now). Some of the problems with locks that I have seen are as follows:
  1. A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
  2. A process acquired a lock for an operation that takes a long time, and crashed. Other processes that want the lock don't know what process had the lock, so cannot detect that the process failed, and wastes time waiting for the lock to be released.
  3. One process had a lock, but it timed out. Other processes try to acquire the lock simultaneously, and multiple processes are able to get the lock.
  4. Because of a combination of #1 and #3, many processes now hold the believed exclusive lock, leading to data corruption or other incorrect behavior.
Even if each of these problems had a 1 in a million chance of occurring, Redis' high performance (typically 100,000 to 225,000 operations/second) can cause those problems to occur under heavy load surprisingly often (up to a few times per minute), so it is important to get locking right.

Building a mostly correct lock in Redis is pretty easy. Building a completely correct lock in Redis isn't actually much more difficult, but it requires being extra careful about the operations we use to build it (in the book, we first build a simple lock to show the basics, then add in the full functionality, here we will jump to the fully-featured lock).

The first part of making sure that no other code can run is to "acquire" the lock. The natural building block to use for acquiring a lock is the SETNX command, which will only set a value if the value doesn't already exist. We will set the value to be a unique identifier to ensure that no other process can get the lock. If we were able to set the value (we have acquired the lock), then we immediately set the expiration of the key to ensure that if we take too long with our operation, the lock is eventually released. But if our client happens to crash (and the worst place for it to crash for us is between SETNX and EXPIRE), we still want the lock to eventually time out. To handle that situation, any time a client fails to get the lock, the client will check the expiration on the lock, and if it's not set, set it. Because clients are going to be checking and setting timeouts if they fail to get a lock, the lock will always have a timeout, and will eventually expire, letting other clients get a timed out lock.

But what if multiple clients set the expiration time at the same time? That is fine, they will be running essentially at the same time, so the expiration will be set for the same time.

def acquire_lock(conn, lockname, identifier, atime=10, ltime=10):
    end = time.time() + atime
    while end > time.time():
        if conn.setnx(lockname, identifier):
            conn.expire(lockname, ltime)
            return identifier
        elif not conn.ttl(lockname):
            conn.expire(lockname, ltime)

        time.sleep(.001)

    return False
To release the lock, we have to be at least as careful as when acquiring the lock. Between the time when we acquired the lock and when we are trying to release it, someone may have done bad things to the lock. To release the lock, we actually need to WATCH the lock key, then check to make sure that the value is still the same as what we set it to before we delete it. This also prevents us from releasing a lock multiple times.
def release_lock(conn, lockname, identifier):
    pipe = conn.pipeline(True)

    while True:
        try:
            pipe.watch(lockname)
            if pipe.get(lockname) == identifier:
                pipe.multi()
                pipe.delete(lockname)
                pipe.execute()
                return True

            pipe.unwatch()
            break

        except redis.exceptions.WatchError:
            pass

    # we lost the lock
    return False
And as a bonus, a Python context manager with the updated transfer_funds() code using it...
import contextlib, uuid

class LockException(Exception):
    pass

@contextlib.contextmanager
def redis_lock(conn, lockname, atime=10, ltime=10):
    identifier = str(uuid.uuid4())
    if acquire_lock(**locals()) != identifier:
        raise LockException("could not acquire lock")
    try:
        yield identifier
    finally:
        if not release_lock(conn, lockname, identifier):
            raise LockException("lock was lost")

def transfer_funds(conn, sender, recipient, amount):
    with redis_lock(conn, 'lock:' + sender):
        if conn.hget(sender, 'money') < amount:
            return False

        pipe = conn.pipeline(True)
        pipe.hincrby(sender, 'money', -amount)
        pipe.hincrby(recipient, 'money', amount)
        pipe.execute()
        return True
If you generated your identifier correctly (UUID like we did, or IP address + process id + thread id, etc.), this lock is correct. Not almost correct, not mostly correct, not a little bit wrong. Completely correct. Other locks that I've seen for Redis have one of a couple different mistakes, usually either accidentally resetting the timeout when you shouldn't, or deleting the lock when you shouldn't. Those kinds of errors lead to the 4 problems I listed earlier.

Simplifying Locking

After reading this, you may think that this is a little more work than should be necessary to build a basic lock with timeouts. You are not alone. Two requests have been made to try to help the situation. The first is allowing SETNX to take an optional 3rd argument that is the expiration time if the item was set. That reduces the Redis commands in acquire_lock() to one command that is looped over (the 10 lines above turn into 4 lines). The second is a new command DELC <key> <value>, which will only delete the given key if the current value is the same as the passed value. This reduces the commands in release_lock() to one command that is executed once in the body of the function (the 15 lines above turn into 2 lines). You can read and follow the discussion on the Redis mailing list.


Thank You

If you liked this article about Redis, the section on locks that will be included in the book expands the discussion and includes more examples. If Redis changes to incorporate the new commands and features to make locking easier, you can look forward to another article revisiting classic Redis locking behavior. If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Thursday, September 15, 2011

Improving Performance by 1000x

Whenever possible at Adly, we like to use existing software to solve as many problems as we can. From PostgreSQL for our database, to syslog-ng and Flask for our logging, to Redis for search, caching, and other fast-data operations, ActiveMQ for our message broker, etc. But there are times when existing software just isn't fast enough, takes too much memory, or just isn't suited for the task. This is one of those times.


In our Adly Analytics product, we calculate a number for pairs of Twitter users called "Also Follows". For two twitter users, it is the number of followers that follow both of those users. We had tried using Redis for this directly, but we were looking at roughly 700 megs of memory to store one set of 7.5 million followers. Ugh! Straight Python did a bit better at 500 megs, but that was still far more than we could reasonably use to store the data. Combine that with Redis intersections taking multiple seconds to run, and the fact that we needed to intersect 1,000 sets against 10,000 sets every week (these were due to the nature of the kinds of updates we were performing), and we'd need to have 33 processors plus roughly 300 gigs of memory to just perform this operation fast enough. That's just too expensive (over $6k/month just for machines in EC2 for the memory).

When faced with seemingly insurmountable costs, we did what any tech-directed company would do; we wrote custom software to solve our problem. What did we do?


First, we got rid of hash tables. They are great for random lookups, but in the case for fixed-size data (we had 32 bit integers for Twitter user ids), a sorted array would be roughly 1/20th the size of the set in Redis, while offering a method for random lookup (if necessary).

When we are fetching the follower lists, we get them in chunks of 5000 at a time, unsorted. We write them to temporary files on disk while we are reading them, then when done, we read the list into memory, then sort it using the C standard library quicksort. After sorting, we scan the sequence for any duplicate values and discard them. Once we have a sorted sequence of unique integers, we write them to disk.

Once they are on disk, we signal our intersection daemon to memory map the file*, letting the operating system cache the file as necessary. When we need to perform the "also follows" calculation, we run a custom intersection algorithm that is tuned to rely on memory bandwidth, which allows us to intersect a set of 5 million and 7.5 million integers with roughly 2 million shared integers in roughly 6 milliseconds on a 3ghz Core 2 Duo (using one processor). That is roughly 2 billion integers examined every second. Compare that to Redis taking roughly 7 seconds to perform that same operation (due to the random hash table lookups, allocating and copying data, etc.), and we've improved our performance by 1000x.

Our results:
For the data we need to store in memory, we went from over 300 gigs to under 15 gigs. We also went from taking around 45 minutes to intersect 1 item against 10,000, to taking 3 seconds**.  While we still believe in using existing software as much as possible, sometimes it is just worth it to write what you need to in C/C++ to get the speed you need.

Along the way, we tried to use scipy.weave.inline to inline C into Python. Sadly, due to some threading issues, we got some nasty behavior and segfaults. I didn't dig too far into it and just wrote what we needed as a plain Python C extension. It's pretty easy.

* Memory mapped files are an interesting and crazy thing, available in Posix (Linux, BSD, OS X, etc.), Windows, and just about any other mature system out there. They allow you to open a file in such a way that you can access the contents of the file as if it were a C array. You can map the entire file or parts of the file. The operating system caches the data as necessary, and if you have multiple processes memory mapping the same file, the memory is shared between them.

** Theoretically it was dropped to 3 seconds, but due to processor caching behavior, Python overhead, etc., this is actually around 45-50 seconds. In an hour, we sustain roughly 2.4 trillion integers examined, or roughly 670 million integers intersected against each other every second on average.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Update: You can read why we didn't use a bloom filter in a followup post.

Friday, September 2, 2011

Building for Uptime, Expecting Failure talk now available

Way back in June, I posted an article on building Adly's real-time ad network. Well, on August 1, presented at the Los Angeles Dev Ops meetup, and I just now got it uploaded to YouTube. You can view Building for Uptime, Expecting Failure now.

The full slides are available via Google Docs.

Thursday, June 9, 2011

Building an Ad Network Ready for Failure

A little over a year ago in May of 2010, I was tasked with the job of creating a self-serve location-targeted ad network for distributing ads to web, desktop, and mobile applications that typically displayed Twitter and Facebook status messages. This post will describe the architecture and systems that our team put together for ad network that ran 24/7 for over 6 months, serving up to 200 ad calls per second, targeting under 200ms response time, despite network hiccups and VPS failures, and how you can do the same with your architecture.

Our Rambo Architecture

Before we had started on our adventure to build an ad network, we had suffered hiccups with Amazon's EC2 systems. While most of the time the VPSes that we had created months prior had suffered no issues, occasionally we would have a VPS disappear from the network for some time, EBS volumes would hang on us, EBS volumes would be unable to be re-assigned to another VPS, VPS instances would be stuck in a state of being restarted, all of the things that people have come to expect from EC2. Given these intermittent failures, we understood that our entire system needed to be able to recover from a hang/failure of just about any set of machines. Some months after running the system, we read about Neflix' experience, their Rambo Architecture, and their use of Chaos Monkey. We never bothered to implement a Chaos Monkey.

Load Balancing

Initially, we had used Apache as a load balancer behind an Amazon elastic IP. Before switching live, we decided to go with HAProxy due to it's far better performance characteristics, it's ability to detect slow/failing hosts, and it's ability to send traffic to different hosts based on weights (useful for testing a new configuration, pushing more to beefier hosts, etc.). The *only* unfortunate thing about HAProxy was that when you reloaded the configuration (to change host weights, etc.), all counters were reset. Being able to pass SIGHUP to refresh the configuration would have been nice. Other than that minor nit, HAProxy was phenominal.

Worker Machines

After running through our expected memory use on each of our worker machines (I will describe their configurations in a bit), we determined that we could get by with a base of 750 megs + 300 megs/processor for main memory to max out the throughput on any EC2 instance. We were later able to reduce this to 450 + 300/processor, but either way, our ad serving scaled (effectively) linearly as a function of processor speed and number. In EC2, this meant that we could stick with Amazon's Hi-CPU Medium VPS instances, which offered the best bang for the $ in terms of performance. We had considered the beefier processors in Hi-Memory instances, but we had literally zero use for the extra memory. If our service took off, we had planned on moving to fewer High CPU Extra Large instances, but for the short term, we stuck with the flexibility and experimentation opportunities that Medium instances offered (thanks to HAProxy's ability to send different traffic volumes to different hosts).

The actual software stack running on each box was relatively slim: Redis (ad targeting and db cache), ActiveMQ (task queues), Apache + mod_wsgi + Python (to serve the ads and interact with everything), and a custom Geocode server written in Python (for handling ip -> location, lat,lon -> location, city/state/country -> location).

Software Stack

The geocoding server was the biggest memory user of them all, starting out at 600 megs when we first pushed everything out. Subsequent refinements on how data was stored dropped that by 300 megs. This process was, in my opinion, the most interesting of the technology that we developed for the ad targeting system, as it was able to take a latitude,longitude pair and determine what country, what state, and what zipcode that location was in (as applicable) in under 1 ms, yet do so in 300 megs of memory using pure Python. It had a simple threaded HTTP server interface. Each VPS ran an instance to minimize round-trip time and to minimize dependencies on other machines.

We had originally run with RabbitMQ, but after running for a week or two, ran into the same crashing bug that Reddit ran into. Our switch to ActiveMQ took a little work (ActiveMQ wasn't disconnecting some of it's sockets, forcing us to write a Twisted server to act as a broker to minimize connections, but one of our engineers sent a patch upstream to fix ActiveMQ), but we've been pretty happy with it. In addition to the queue, we also had a task processor written in Python that processed impressions and clicks, updating the database and the ad index as necessary if/when an ad ran out of money.

Our Apache + mod_wsgi + Python server mostly acted as a broker between encoding/decoding requests/reponses, ad calls to Redis, cache pulls from Redis, geocoding requests, and logging results. This is where we scaled/processor, and the memory use was primarily the result of running so many threads to maximize IO between our servers and Redis. For a brief time, we were also performing content analysis for further targeting (simple bayesian categorization across 25 categories, matched against pre-categorized ads, calculated in Python). This was consistently the slowest part of our calls, amounting for ~40ms, which we later dropped to 5ms with a little bit of numpy. Sadly, we found that content analysis was less effective for the bottom line than just paying attention to a calculated CPM on individual ads (calculated via CPC * CTR), so we tossed it for the sake of simplicity.

The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple "slaveof" calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.

The master Redis was merely Redis that was populated from the database by a simple script, and kept up-to-date by the various distributed tasks processes, which syndicated off to the slaves automatically.

After Responding to a Request

After an ad call was completed, an impression tracking pixel was requested, or a click occurred, we would throw a message into our task queue to report on what happened. No database connection was ever established from the web server, and the web server only ever made requests to local resources. The task server would connect to the database, update rows there, handle logging (initially to the database, then briefly to mongodb, and finally to syslog-ng, which we use today for everything), and update the Redis master as necessary. In the case of database or Redis master failure, the task queue would merely stop processing the tasks, batching them up for later.

The Weak Points

Looking at this setup, any individual VPS could go down except for the load balancer and all other parts would continue to work. Early on we had experimented with Amazons load balancer, but found out that it wouldn't point to an elastic IP (I can't remember why this was important at the time), so we used a VPS with HAProxy. Thankfully, the load balancer VPS never went down, and we had a hot spare ready to go with an elastic IP update.

Any worker box could go down, and it wouldn't effect serving except for a small set of requests being dropped. Our Redis master or master DB could even go down, and some ads may/may not be served when they should/shouldn't. We did lose the Redis master once, due to a network hiccup (which caused replication to one of the slaves to hang, blow up memory use on the master, and subsequently get the master killed by the Linux OOM killer). But this caused zero downtime, zero loss of $, and was useful in testing our emergency preparedness. We had some API workers go down on occasion due to EBS latency issues, but we always had 2-3x the number necessary to serve the load.

Our true weakness was our use of PostgreSQL 8.4 in a single-server setup. Our write load (even with API calls coming in at a high rate) was low enough to not place much of a load on our database, so we were never felt pressured to switch to something with more built-in options (PostgreSQL 9 came out about 3-4 months after we had started running the system). But really, for this particular data, using Amazon's RDS with multi-AZ would have been the right thing to do.

Where is it now?

After 6 months of development, getting application developers to start using our API, getting reasonable traffic, getting repeat ad spends, etc., we determined that the particular market segment we were going for was not large enough to validate continuing to run the network at the level of hardware and personnel support necessary for it to be on 24/7, so we decided to end-of-life the product.

I'll not lie, it was a little painful at first (coincidentally, a project in which I was the only developer, YouTube Groups, had been end-of-lifed a couple weeks before), but we all learned quite a bit about Amazon as a platform, and my team managed to build a great product.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!