SQL or NoSQL, that is the question!

SQL or NoSQL, that is the question! October 2011 Andraž Tori, CTO at Zemanta @andraz andraz@zemanta.com

Answering - Why NoSQL? - What is NoSQL? - How does it work?

SQL is awesome! - Structured Query Language - ACID Atomicity, Consistency, Isolation, Durability - Predictable - Schema - Based on rational algebra - Standardized

No, really, it's awesome! - Hardened - Free and commercial choices - MySQL, PostgreSQL, Oracle, DB2, MS SQL... - Commercial support - Tooling - Everyone knows it - It's mature!

Why the heck would someone not want SQL?

Why not to use SQL? - Clueless self-thought programmers who use text files - NIH - Not Invented Here syndrome. And I want to design my own CPU! - Because it's hard! - I can't afford it - “This app was first ported from Clipper to DBase”

You are a big tech company, located on west coast of USA

You are... - big international web company based in San Francisco - 5 data centers around the world - Petabytes of data behind the service - A day of downtown costs you at least millions - And it's not question of when, but if

You want to - keep the service up no matter what - have it fast - deal with humongous amounts of data - enable your engineers to make great stuff

Some interesting constraints Amazon claim that just an extra one tenth of a second on their response times will cost them 1% in sales.

So... - Some pretty big and important problems - And brightest engineers in the world - Who loooove to build stuff - Sooner or later even Oracle RAC cluster is not enough

Numbers everybody should know! Jeff Dean at famous Stanford talk L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes w/ cheap algorithm 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns

Facebook circa 2009 - from 200GB (March 2008) to 4 TB of compressed new data added per day - 135TB of compressed data scanned per day - 7500+ Database jobs on production cluster per day - 80K compute hours per day - And that's just for data warehousing/analysis - plus thousands of MySQL machines acting as Key/Value stores

Big Data - Internet generates huge amounts of data - First encountered by big guys AltaVista, Google, Amazon … - Need to be handled - Classical storage solutions just don't fit/behave/scale anymore

So smart guys create solutions to these internal challenges

And then? - Papers: The Google File System (Google, 2003) MapReduce: Simplified Data Processing on Large Clusters (Google, 2004) Bigtable: A Distributed Storage System for Structured Data (Google, 2006) Amazon Dynamo (Amazon, 2007) - Projects (all open source): Hadoop (coming out of Nutch, Yahoo, 2008) Memcached (LiveJournal, 2003) Voldemort (Linkedin, 2008) Hive (Facebook, 2008) Cassandra (Facebook, 2008) MongoDB (2007) Redis, Tokyo Cabinet , CouchDB, Riak...

Four papers to rule them all Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “ The Google File System ”, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. Jeffrey Dean and Sanjay Ghemawat, “ MapReduce: Simplified Data Processing on Large Clusters ”, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, “ Bigtable: A Distributed Storage System for Structured Data ”, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “ Dynamo: Amazon's Highly Available Key-Value Store ”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

Total Sites Across All Domains August 1995 - October 2011, NetCraft

Yesterday's problem of biggest guys Is today's problem of garden variety startup

And so we end up with Cambrian explosion

These solutions don't have much in common, Except...

That's a hard question... - There is no standard - This is a new technology - new research - survival of the fittest - experimenting - They obviously fulfill some new needs - but we don't yet know which are real and which superficial - Most are extremely use-case specific

Example use-cases - Shopping cart on Amazon - PageRank calculation at Google - Streams stuff at Twitter - Extreme K/V store at bit.ly - Analytics at Facebook

At the core, it's a different set of trade-offs and operational constraints

Trade-offs and operational constraints - Consistent? Eventually consistent? - Highly available? Distributed across continents? - Fault tolerant? Partition tolerant? Tolerant to consumer grade hardware? - Distributed? Across 10, 100, 1000, 10000 machines?

More possibilities - All in memory? (disk is the new tape) - Batch processing? - tolerant to node failures? - Graph oriented? - No transactions? Programmer deals with inconsistencies? - No schemas? - BASE? (Basically Available, Soft state, Eventually Consistent) - Horizontal scaling, with no downtime? - Self healing?

A consistent topic: CAP Theorem

CAP theorem (Eric Brewer, 2000, Symposium on Principles of Distributed Computing) - CAP = Consistency, Availability, Partition tolerance - Pick any two! - Distributed systems have to sacrifice something to be fast - Usually you drop: - consistency – all clients see the same data - availability – the service returns something - Sometimes can even tune the trade-offs!

"There is no free lunch with distributed data” – HP

Eventual Consistency - Different clients can read the data and write it, no locking or maybe partitioned nodes - What we know is that given enough time data is synchronized to the same state across all replicas

… you already are eventually consistent! :) If your database stores how many vases you have in your shop...

Eventual consistency - Conflict resolution: - Read time - Write time - Asynchronous - Possibilities: - client timestamps - vector clocks, when writing say what your original data version was - Conflict resolution can be server or client based

There are different kinds of consistencies - Read-your-writes consistency - Monotonic write / monotonic read consistency - Session consistency - Casual consistency

There's not even a proper taxonomy of features different NoSQL solutions offer

And this presentation is too short to present whole breadth of possibilities

Usual taxonomy of NoSQL Usual taxonomy: - Key/Value stores - Column stores - Document stores - Graph stores

Other attributes - In-memory / on-disk - Latency / throughput (batch processing) - Consistency / Availability

Key/Value stores - a.k.a. Distributed hashtables! - Amazon Dynamo - Redis, Voldemort, Cassandra, Tokyo Cabinet, Riak

Document databases - Similar to Key/Value, but value is a document - JSON or something similar, flexible schema - CouchDB, MongoDB, SimpleDB... - May support indexing or not - Usually support more complex queries

Column stores - one key, multiple attributes - hybrid row/column - BigTable, Hbase, Cassandra, Hypertable

Graph Databases - Neo4J, Maestro OpenLink, InfiniteGraph, HyperGraphDB, AllegroGraph - Whole semantic web shebang!

To make the situation even more confusing... - Fast pace of development - In-memory stores gain on-disk support overnight - Indexing capabilities are added

Two examples - Cassandra - Hadoop - Hive - Mahout

Cassandra - BigTable + Dynamo - P2P, horizontally scalable - No SPOF - Eventually consistent - Tunable tradeoffs between consistency and availability - number of replicas, writes, reads

Cassandra – writes - No reads - No seeks - Log oriented writes - Fast, atomic inside ColumnFamily - Always available for writing

Cassandra - Billions of rows - Mysql: ~ 300ms write ~ 350ms read - Cassandra: ~ 0.12ms write ~ 15ms read

Not enough time to go into data model...

Cassandra In production at: Facebook, Digg, Rackspace, Reddit, Cloudkick, Twitter - largest production cluster over 150TB and over 150 machines Other stuff: - pluggable partitioner (Random/OrderPerserving) - rack aware, datacenter aware

Experiences? - Works pretty good at Zemanta - user preferences store - extending to new use-cases - Digg had some problems - Don't necessary use it as primary store - Not very easy to back-up, situation is improving

Cassandra - queries - Column by key - Slices (of columns/supercolumns) - Range queries (when using OrderPerservingPartitioner to be efficient)

Hadoop - GFS + MapReduce - Fault tolerant - (massively) distributed - massive datasets - batch-processing (non real-time responses) - Written in Java - A whole ecosystem

Hadoop: Why? (Owen O’Malley, Yahoo Inc!, omalley@apache.org) • Need to process 100TB datasets with multi-day jobs • On 1 node: – scanning @ 50MB/s = 23 days – MTBF = 3 years • On 1000 node cluster: – scanning @ 50MB/s = 33 min – MTBF = 1 day • Need framework for distribution – Efficient, reliable, easy to use

Hadoop @ Facebook - Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. - Currently 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage. - Heavy users of both streaming as well as the Java apis. They built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/).

But also at smaller startups - Zemanta: 2 to 4 node cluster, 7TB - log processing - Hulu 13 nodes - log storage and analysis - GumGum 9 nodes - image and advertising analytics - Universities: Cornell – Generating web graphs (100 nodes) - It's almost everywhere

Hadoop Architecture - HDFS - HDFS provides a single distributed filesystem - Managed by a NameNode (SPOF) - Append-only filesystem - distributed by blocks (for example 64MB) - It's like one big RAID over all the machines - tunable replication - Rack aware, datacenter aware - It just works, really!

Hadoop Architecture - MapReduce - Based on an old concept from Lisp - Generally it's not just map-reduce, it's: Map -> shuffle (sort) -> merge-> reduce - Jobs can be partitioned - Jobs can be run and be restarted independently (parallelization, fault tolerance) - Aware of data-locality of HDFS - Speculative execution (toward the end, of tasks machines that stall)

Infamous word counting example - “One and one is two and one is three” - Two mappers: “One and one is”, “two and one is three” - Pretty “stupid” mappers, just output word and “1” Otuput Mapper1: One 1 And 1 One 1 Is 1 Output Mapper2: Two 1 And 1 One 1 Is 1 Three 1 And 1 And 1 Is 1 Is 1 One 1 One 1 One 1 Two 1 Three 1 And 2 Is 2 One 3 Two 1 Three 1 Sorter Reducer

Important to know - Mappers can output more than one output per input (or none) - Bucketing for reducers happens immediately after mapping output - Every reducer gets all input records for certain “key” - All parts are highly pluggable – readers, mapping, sorting, reducing … it's java

Hadoop - You can write your jobs in Java - You get used to thinking inside the constraints - You can use “Hadoop Streaming” to write jobs in any language - It's great not to have to think about the machines, but you can “peep” if you want to see how your job is doing.

Now, this is a bit wonky, right? - Word counting is a really bad example - However it's like “Hello world”, so get used to it - When you get to real problems it gets much more logical

Benchmarks, 2009 This doesn't help me much, but... Bytes Nodes Maps Reduces Replication Time 500000000000 1406 8000 2600 1 59 seconds 1000000000000 1460 8000 2700 1 62 seconds 100000000000000 3452 190000 10000 2 173 minutes 1000000000000000 3658 80000 20000 2 975 minutes

Hive - A system built on top of Hive that mimics SQL - Hive Query Language - Built at Facebook, since writing MapReduce jobs in Java is tedious basic tasks - Every operation is one or multiple full index scans - Bunch of heuristics, query optimization

Hive – Why we love it at Zemanta - Don't need to transform your data on “load time” - Just copy your files to HDFS (preferably compressed and chunked) - Write your own deserializer (50 lines in Java) - And use your file as a table - Plus custom User Defined Functions

Mahout - Bunch of algorithms implemented Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) A vibrant community and many more cool stuff to come by this summer thanks to Google summer of code

Some observations - Non-fixed schemas are a blessing when you have to adapt constantly - that doesn't mean you should not have documentation and be thoughtful! - Denormalization is the way to scale - sorry guys - Clients get to manage things more precisely, but also have to manage things more precisely

Some internals, “fun” tricks - Bloom filter: Is data on this node? Maybe / Definitely not Maybe -> let's go to disk to check out - Vector clocks - Consistent hashing

Consistent hashing - key -> hash -> “coordinator node” - depending on replication the key is then stored in sequential N nodes - When new node gets added to the ring replication is relatively easy

And if you don't take anything else from this presentation...

This is the edge today - Tons of interesting research waiting to be made - Ability to leverage these solutions to process terabytes of data cheaply - Ability to seize new opportunities - Innovation is the only thing keeping you/us ahead - Are you preparing yourself for tomorrow's technologies? Tomorrow's research?

Images http://www.flickr.com/photos/60861613@N00/3526232773/sizes/m/in/photostream/ http://www.zazzle.com/sql_awesome_me_tshirt-235011737217980907 http://geekandpoke.typepad.com/geekandpoke/2011/01/nosql.html http://hadoop.apache.org/common/docs/current/hdfs_design.html http://www.flickr.com/photos/unitednationsdevelopmentprogramme/4273890959/

SQL or NoSQL, that is the question!

More Related Content

What's hot (20)

Similar to SQL or NoSQL, that is the question! (20)

More from Andraz Tori (10)

Recently uploaded (20)

SQL or NoSQL, that is the question!

Editor's Notes