A Dash of Technology: sql

Showing posts with label sql. Show all posts

Thursday, October 24, 2013

Multi-column (SQL-like) sorting in Redis

Recently, I received an email from a wayward Redis user asking about using Redis Sets and Sorted Sets to sort multiple columns of data, with as close to the same semantics as a traditional SQL-style "order by" clause. Well it is possible, with limitations, keep reading to find out how.

What is Redis?

For those people who don't quite know what Redis is already, the TLDR version is: an in-memory data structure server that maps from string keys to one of 5 different data structures, providing high-speed remote access to shared data, and optional on-disk persistence. In a lot of ways, you can think of Redis like a version of Memcached where your data doesn't disappear if your machine restarts, and which supports a wider array of commands to store, retrieve, and manipulate data in different ways.

The setup

With that out of the way, our intrepid Redis user had come to me with a pretty reasonable problem to have; he needed to build an application to display a listing of businesses, sorted by several criteria. In his case, he had "price", "distance[1]", and "rating". In many cases that we have all seen in recent years with individual retailer searches, never mind restaurant searches on Yelp and similar applications, when searching for something in the physical world, there are a few things you care about primarily. These usually break down preferentially as lowest distance, lowest price, highest rating. In a relational database/SQL world, these fields would all be columns in a table (or spread out over several tables or calculated in real-time), so we are going to be referring to them as "sort columns" from here on.

Now, depending on preferences, you can sometimes get column preference and ascending/descending changes, which is why we need to build a system that can support reordering columns *and* switching the order of each individual column. Say that we really want the highest rating, lowest distance, lowest price? We need to support that too, and we can.

The concept

Because we are dealing with sort orders, we have two options. We can either use the Redis SORT command, or we can use sorted sets. There are ways of building this using the SORT command, but it is much more complicated and requires quite a bit of precomputation, so we'll instead use sorted sets.

We will start by making sure that every business has an entry in each of 3 different sorted sets representing price, distance, and rating. If a business has an "id" of 5, has a price of 20, distance of 4, and a rating of 8, then at some point the commands "ZADD price 20 5", "ZADD distance 4 5", and "ZADD rating 8 5" will have been called.

Once all of our data is in Redis, we then need to determine the maximum value of each of our sort columns. If you have ranges that you know are fixed, like say that you know that all of your prices and distance will all be 0 to 100, and your rating will always be 0 to 10, then you can save yourself a round-trip. We'll go ahead and build this assuming that you don't know your ranges in advance.

We are trying to gather our data range information in advance in order to carve up the 53 bits of precision [2] available in the floating-point doubles that are available in the sorted set scores. If we know our data ranges, then we know how many "bits" to dedicate to each column, and we know whether we can actually sort our data exactly, without losing precision.

If you remember our price, distance, and range information, you can imagine that (borrowing our earlier data) if we have price=20, distance=4, rating=8, and we want to sort by distance, price, -rating, we want to construct a "score" that will sort the same as the "tuple" comparison (20, 4, -8). By gathering range information, we could (for example) translate that tuple into a score of "20042", which you can see is basically the concatenation of "20", "04", and 10-8 (we subtract from 10 here because the rating column is reversed, and it helps to understand how we got the values).

Note: because of our construction, scores that are not whole numbers may not produce completely correct sorts.

The code

Stepping away from the abstract and into actual code, we are going to perform computationally what I just did above with some string manipulation. We are going to numerically shift our data into columns, accounting for the magnitude of the data, as well as negative values in the columns (which won't affect our results). As a bonus, this method will even tell you if it believes that you could have a lower-quality sort because your data range is too wide[3].

import math
import warnings

def sort_zset_cols(conn, result_key, sort=('dist', 'price', '-score')):
    current_multiplier = 1
    keys = {result_key: 0}
    sort = list(reversed(sort))

    # Gets the max/min values in a sort column
    pipe = conn.pipeline(True)
    for sort_col in sort:
        pipe.zrange(sort_col, 0, 0, withscores=True)
        pipe.zrange(sort_col, -1, -1, withscores=True)
    ranges = pipe.execute()

    for i, sort_col in enumerate(sort):
        # Auto-scaling for negative values
        low, high = ranges[i*2][1], ranges[i*2+1][1]
        maxv = int(math.ceil(max(abs(low), abs(high))))

        # Adjusts the weights based on the magnitude and sort order of the
        # column
        old_multiplier = current_multiplier
        desc = sort_col.startswith('-')
        sort_col = sort_col.lstrip('-')
        current_multiplier *= maxv

        # Assign the sort key a weight based on all of the lower-priority
        # sort columns
        keys[sort_col] = -old_multiplier if desc else old_multiplier

    if current_multiplier >= 2**53:
        warnings.warn("The total range of your values is outside the "
            "available score precision, and your sort may not be precise")

    # The sort results are available in the passed result_key
    return conn.zinterstore(result_key, keys)

If you prefer to check the code out at Github, here is the gist. Two notes about this code:

If the maximum or minimum values in any of the indexed columns becomes more extreme between the data range check and the actual query execution, some entries may have incorrect ordering (this can be fixed by translating the above to Lua and use Redis 2.6 and later support for Lua scripting)
If any of your data is missing in any of the indexes, then that entry will not appear in the results

Within the next few weeks, I'll be adding this functionality to rom, my Python Redis object mapper.

Interested in more tips and tricks with Redis? My book, Redis in Action (Amazon link), has dozens of other examples for new and seasoned users alike.

[1] For most applications, the distance criteria is something that would need to be computed on a per-query basis, and our questioning developer already built that part, so we'll assume that is available already.
[2] Except for certain types of extreme-valued doubles, you get 52 bits of actual precision, and 1 bit of implied precision. We'll operate under the assumption that we'll always be within the standard range, so we'll always get the full 53 bits.
[3] There are ways of adjusting the precision of certain columns of the data (by scaling values), but that can (and very likely would) result in scores with fractional components, which may break our sort order (as mentioned in the notes).

Edit on 2015-08-09: Changed the "maxv =" assignment above to be correct in all cases, and made sure the revered sort (for calculating ranges) is a list for repeated-iteration.

Monday, August 2, 2010

Databases on SSDs, Initial Ideas on Tuning

In the pair of papers on the RethinkDB web site [1] [2], RethinkDB has made a straightforward description of what they have built as a MySQL storage backend, and has alluded to a paper regarding append-only indexes which hasn't yet been posted. I'm not going to guess at what the paper includes, but I will describe what an applied theorist (like myself) would implement if given the descriptions that they provide. Note: I am unaffiliated with RethinkDB.

Overall, they have designed a database storage backend around a few primary principles regarding the performance and durability of commercially-available SSDs. 1) Re-writing an existing block of data is slow, 2) writing to an unused block is fast, 3) reading randomly is pretty fast, 4) SSDs have limited lifetimes (so writing/rewriting should be avoided as much as is reasonable), 5) the data storage system should be effectively impervious to failure, 6) recovery from failure should be essentially instantaneous, and 7) backup should be simple. They go further into the details of their testing on their blog [7], and especially into IO performance based on block size [8] and alignment [9].

Given these principles (some driving their choice, some derived from their choices), they seem to have chosen to use a persistent B-Tree to store their indexes, likely descended from the Sarnak/Tarjan paper regarding persistent search trees [3]. In particular, when writing a row to disk, they also write the B-Tree nodes that would have been modified for the indexes as well (the primary key -> row index, plus whatever other indexes are available), thus offering a "copy on write" semantic for the indexes. This is what allows for full query-able history of the database at any point in time, and works-around numbers 1 and 4, while offering 5-7, exploiting 2 and 3. The cost is that as new data and index blocks are written, a fairly large portion of previously-written data becomes historical (compared to more traditional db storage systems), and for most situations, effectively useless. This wastes space, doesn't allow for the OS to offer "TRIM" commands to the disk to free that historical data, and other annoying things.

One other issue that RethinkDB will likely run into when it starts filling disks (or contending with disks being used for more than just the DB) is write amplification; as blocks are re-used, they must be erased, and on nearly-full SSDs, a simple 4k (or 1 byte) write can result in 128-512k of data writes on the device (depending on the SSD).

To understand what is going on, and why RethinkDB uses 2 gigs to store the equivalent of 50 megs of data and indexes, we first need to look at a toy setup. Let us imagine that we have 1 million rows of data, each row consisting of two 64 bit integers; a primary key and a secondary data column. Further, there are two indexes constructed over this data; the primary key index, and an index on the data column. Given 4k block sizes, back-of-the-envelope calculations would put this at 1 million * 16 bytes for data, and an additional ~1 million * 16 bytes for each of the indexes* (ignoring some metadata and index block headers).

* Leaf nodes of the indexes are the dominant factor here, requiring the value to be indexed, as well as a pointer to the data, for 16 bytes (for a degree 256 B+Tree). Ancestors which point to leaf nodes account for a little over 1/2 of a bit for every leaf node in a full B+Tree. To keep our calculations simple, we assume as full a B+Tree as is possible on 1 million nodes.

We now have a database that is 48 megs, consisting of data and indexes. Let's update every key in the database by adding a random integer value to the data column. For each update, we need to write the new row, update the primary key index to point to the new data row, and update the data index. In order to update each index, we must copy and update all ancestors of the leaf node in the B+Tree node we update. To understand why, see the image below.

Ignoring sibling pointers (the d1/d2 block pointing to the d3/d4 block *), in order to write a new d3/d4 block to point to a new d3, we must write a new root block to point to the new d3/d4 block. For a full tree of 1 million data pointers, the tree is 3 nodes deep, so we must write 3 new nodes for each B+tree index. We'll give this tree the benefit of the doubt, and say that our increments of the data index are small enough so that, for example, d3/d4 swap places, instead of having d3 migrate to the d1/d2 block. This still results in 6 block updates for every updated row, even if we assume we can find a place for the new data row of 16 bytes somewhere.

* We can alter the B+Tree format to not have sibling pointers, knowing that any traversal of the index itself will necessarily keep context about the path used to arrive at any block.

After 1 million updates, we will have updated 6 million 4-kilobyte blocks, for a total of 24 gigs of data written to update the index and data. In a standard database, while more or less the same volume of data would have been written to disk, it would have over-written existing blocks on disk (modulo a transaction or intent log), and would still only use around 48 megs of data.

Reducing Data Writes, Part 1

Because of the read-only nature of the persistent search trees, effectively all updates to the database are transient; data and indexes will be later replaced with some other write. To reduce wasted space, we should minimize the total amount of data written, which will also minimize the amount of wasted space later. First, let's reduce the amount of data we write to the index itself, by reducing the size of a logical index block from 4k. This will result in a deeper tree, but because we can store multiple tree nodes in a single 4k disk block, we can end up updating fewer disk blocks per index update.

To get an idea of how far this can go, instead of using B+Trees, let's start with a standard binary tree: 64 bit key, 64 bit data pointer (we'll leave our data pointers internal to the tree), 64 bit left/right pointers, for 32 bytes per tree node. A balanced tree with 1 million nodes is 20 levels deep, which is 640 bytes of index that needs to be updated for every row update for each index. Heck, we could fit 6x index updates in a single 4k block if we used a binary tree... at the cost of requiring potentially 20 IOs per index lookup to determine our row pointer (10 with caching, based on random tree traversals), vs. 3 for a B+Tree (or 1-2 with caching).

Let's find a balance between a 4k block and a 32 byte block; 512 bytes. If we squeeze our index into a single 64 bit value, combined with a 64 bit pointer, a B+tree index with 512 byte blocks would be 4 levels deep for our 1 million rows. Those 4 index blocks of 512 bytes would be 2k, with the data index taking up the second 2k. Assuming we could find a place for the 16 bytes of data somewhere, those 1 million row updates now only use 4 gigs of storage instead of 24 gigs, and will generally result in 2 IOs when using 16k of cache for each index.

Reducing Data Writes, Part 2

On the one hand, not having or using a transaction log is very useful, it reduces IO, and makes recovery trivially easy: your database is your transaction log. But let's bring it back, and let's do something crazy: let's either put it on a spinnning disk so we don't worry about writing 16 bytes + metadata to the end of a file whenever we want, or let's put it on an inexpensive flash drive that we write and clear in cycles (alternating between a few of them, to allow for replacement over time), and only ever write full blocks. Let's also include arbitrary "checkpoint" regions (which I'll describe more later). We'll assume that we're using ZFS with RAIDZ, a RAID 5 array, or some other storage setup where we can rely on the data not being corrupted, while still offering high write throughput. This is our append-only log.

Whenever we have a row update/insert/delete, instead of modifying the main SSD database, let's instead merely append the modification we want to perform (row with primary key X becomes Y, insert row Y with primary key X, delete row with primary key X, etc.), along with a pointer to the most recent checkpoint, to our append-only log. Our database engine will keep a cache of all of the rows that have been changed, and for any index rows that we would have updated (as is necessary during writes), we'll instead update them in-memory, without flushing them to disk.

After a fixed number of inserts/updates/deletes, or after a fixed amount of data has been written to the disk (say 1000 rows or 1 meg of data, whichever comes first, which should be tuned on a per-system basis), we dump our row updates and our updated indexes to the SSD with a checkpoint block, and append a new checkpoint to the append-only log.

Whenever the system starts up (on failure, or otherwise), it finds the most recent checkpoint in the append-only log, reads all valid operations after that checkpoint, and replays them as if they were just happening. By tuning the number of rows/maximum amount of data, we can tune the level of pain we are willing to suffer at system startup. With the provided 1000 rows or 1 meg of data, even with 512 byte blocks, that is 8000 index read IOs on the SSD to re-play the worst-case log size, which is under 2 seconds on a modern SSD. Reading 1 meg of data from our append-only log on a spinning disk whould be on the order of 10-30 ms, depending on the latency of our initial seek (assuming a non-fragmented file).

What is to be gained with this append-only log? After 1000 updates, we reduce the number of root node copy/writes from 1000 to 1. With 512 byte blocks, we also reduce second level block updates from 1000 to 32. Third and fourth level block updates may also be reduced, but we won't count those. That takes us from 1000 4k writes down to 533 4k writes, or really 4megs down to ~2.3 megs. What if we had stuck with 4k index blocks? Root node copy/writes go from 1000 to 1, second-level updates go from 1000 to 256. Overall, from 6000 4k writes to 2514 4k writes, or 24 megs to 10 megs. Remember, this is for 1000 writes, we'll scale them back up in a minute.

There is a larger proportional difference for the 4k blocks because there are only 3 levels of index, the first of which gets writes reduced by a factor of 1000x, the second which gets reduced by a factor of 4, the third we don't count. For 512 byte blocks, we get the same 1000x reduction in the root, a better 32x reduction in the second level, but no reduction in the 3rd and 4th levels.

Over the course of those 1 million rows, using this second method on 512 byte blocks, we go from the non-append log 4k block writes of 24 gigs, down to 2.3 gigs, for a factor of better than 10x reduction in data writes to our SSD.

This example was chosen specifically to maximize the ratio of index to data to show effectively worst-case performance on a toy example. As your data columns grow relative to your index columns, index updates are reduced for insert/update/delete operations. Generally speaking, we are able to get a 10x or better reduction in data write overhead by combining smaller block sizes with an append-only log file for this toy example. Real-world use-cases will have different improvements.

Reclaiming Wasted Space

During the course of writing data to the SSD for inserts/updates/deletes, index blocks will become unused (as they are no longer being referenced by the newest index nodes). Assuming that we are using both earlier described optimizations, whenever we write a checkpoint to the append-only log file, we can include the current number of wasted index blocks. When applying the updates to create a new checkpoint, any time we are replacing an existing index block, we increment the wasted index block count, and after performing all of our updates, write the wasted index block count as part of the checkpoint.

Remembering our update of the d3/d4 block from before, when we replace the d3/d4 block, we increment the count by 1, and because we need to update the root as well, we increment the count by 1 again. Generally, this will be the depth of the B+Tree, except in merging nodes and some cases of splitting nodes.

Now that we know how much wasted space we have, and we also know the size of our database file (this is trivially monitored during runtime and during startup). Whenever we get to a point where we have too much wasted space, we can reclaim that wasted space as follows.

1. Disable checkpointing in the main process. This prevents us from having to re-apply the same set of operations on two SSD database files (thus reducing our wasted space and writes to the SSD).

2. In a second process, first copy all of the standard database metadata, then for each table, traverse our primary key index for that table, and write to the a new database file, alternating writing data and index blocks (write all the data for a leaf node in the index, then write that index leaf node, etc.). In the case of 512 byte blocks, we batch up 8 (or more) data or index blocks to fill our disk block.

3. Continue in the second process to copy all other B+Tree indexes.

4. Signal to the main process that the copy has completed.

5. In the main process, when there are spare cycles, re-apply all of the changes in the append-only log file to the new db in-memory (as per Part 2 above). Optionally pre-cache hot rows what is in the old database cache.

6. When all of the updates have been performed for the new database file (in memory), perform an atomic filesystem move (discarding the old database file), discard the old database's index cache (from Part 2 above), and discard all other cache.

7. Re-enable checkpointing.

At the point of step #7, the new database file will have everything the old database file had, serving as an online VACUUM. During this VACUUM operation, system failure is not catastrophic. At any time before step 5, we can discard the secondary db file. At any time on or after step 5, restarting the database can use the new database file instead of the old, thus minimizing wasted work.

If we have a database system with relatively low amounts of memory, large amounts of writes, and the memory used by the index cache (from Part 2 above) grows too large, the first process can internally apply a checkpoint to the SSD database (which may fragment the new database file), but not update the append-only file with checkpoint information. The copying process can then ignore any data written to the main DB file after it has started with it's VACUUM, and the main process just needs to work a little harder in step 5.

In Closing, and Other Notes

I hope that you have a better understanding about the kinds of enhancements that could be done to improve a database storage system to make something like RethinkDB work as well as it does. I'm guessing that they are using at least one or two ideas described in this post, and I hope that this post gets other people thinking about ways to make such a system even better. After having seen the performance numbers for a tuned SSD system (flash, drive controller, drive-level caching, kernel drivers, os-level caching, and application), it makes the current 10k Iops numbers for consumer-level devices look like a kids toy. That RethinkDB is looking for people with heavy DB and kernel experience makes me believe that they will have a great product soon.

Incidentally, after outlining the content of this post, and between writing Part 2 and Reclaiming Wasted Space, I noticed that SQLite 3.7.0 had been released [5], which includes support for WAL files [6]. The WAL file in SQLite 3.7.0 is the rough equivalent of the append-only log that I describe in Part 2, and is more or less an Intent Log [10] in the world of database systems.

EDIT on Nov. 25, 2014: While looking around for some related information, I went digging through the RethinkDB blog and ran into this post describing methods very similar to what I described above for reducing data writes. And since this post was published in August 2010, RethinkDB is no longer a better backend for MySQL, but a distributed non-relational json-document database.

References

[1] http://www.rethinkdb.com/papers/whitepaper.pdf

[2] http://www.rethinkdb.com/papers/rrcache.pdf

[3] http://portal.acm.org/citation.cfm?doid=6138.6151

[4] Image was created by Grundprinzip at Wikipedia, and was licensed under the Creative Commons Attribution 3.0 Unported license, originally downloaded from: http://upload.wikimedia.org/wikipedia/commons/3/37/Bplustree.png

[5] http://www.sqlite.org/draft/releaselog/3_7_0.html

[6] http://www.sqlite.org/draft/wal.html

[7] http://www.rethinkdb.com/blog/

[8] http://www.rethinkdb.com/blog/2009/10/rethinking-b-tree-block-sizes-on-ssds/

[9] http://www.rethinkdb.com/blog/2009/10/page-alignment-on-ssds/

[10] http://en.wikipedia.org/wiki/Intent_log