SlideShare a Scribd company logo
2
Most read
4
Most read
20
Most read
Cassandra
Introduction & Key Features
Meetup Vienna Cassandra Users
13th of January 2014
philipp.potisk@geroba.com
Definition
Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available,
fault-tolerant, tuneably consistent, column-oriented
database that bases its distribution design on Amazon’s
Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it is now used at some of the most
popular sites on the Web [The Definitive Guide, Eben
Hewitt, 2010]
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

2
History
Dynamo, 2007

Bigtable, 2006

OpenSource, 2008

13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

3
Key Features

Distributed
and
Decentralized
High Performance

CQL – A SQL
like query
interface

Elastic
Scalability

Cassandra

Columnoriented
Key-Value
store
13/01/2014

High
Availability
and Fault
Tolerance

Tuneable
Consistency

Cassandra Introduction & Key Features by Philipp Potisk

4
Distributed and Decentralized
Datacenter 1

• Distributed: Capable of running
on multiple machines
• Decentralized: No single point of
failure
No master-slave issues due to
peer-to-peer architecture
(protocol "gossip")
Single Cassandra cluster may run
across geographically dispersed
data centers
13/01/2014

Datacenter 2

1

7

6

2

5

3

4

12

8

11

9
10

Read- and writerequests to any node

Cassandra Introduction & Key Features by Philipp Potisk

5
Elastic Scalability

1
8

1

• Cassandra scales horizontally,
adding more machines that have
all or some of the data on
• Adding of nodes increase
performance throughput linearly
• De-/ and increasing the
nodecount happen seamlessly

4 Performance
2
throughput = N
3

2

Performance
throughput = N x 2

7

4

6
5

Linearly scales to
terabytes and
petabytes of data
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

3

6
Scaling Benchmark By Netflix*
48, 96, 144 and 288
instances, with 10, 20,
30 and 60 clients
respectively. Each client
generated ~20.000w/s
having 400byte in size

Cassandra scales linearly far
beyond our current capacity
requirements, and very
rapid deployment
automation makes it easy to
manage. In particular,
benchmarking in the cloud
is fast, cheap and scalable,

*http://techblog.netflix.com/201
1/11/benchmarking-cassandrascalability-on.html
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

7
High Availability and Fault Tolerance
• High Availability?
Multiple networked computers
operating in a cluster
Facility for recognizing node
failures
Forward failing over requests to
another part of the system

1
6

2

5

3
4

• Cassandra has High Availability

No single point of failure
due to the peer-to-peer
architecture
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

8
Tunable Consistency
• Choose between strong and eventual
consistency
• Adjustable for read- and writeoperations separately
• Conflicts are solved during reads, as
focus lies on write-performance

TUNABLE

Available

Consistency

Use case dependent
level of consistency
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

9
When do we have strong consistency?
• Simple Formula:

jsmith

(nodes_written + nodes_read) >
replication_factor
jsmith

t1
t2

NW: 2
NR: 2
RF: 3

t1
t2

jsmith

t1

• Ensures that a read always
reflects the most recent write
• If not: Weak consistency
 Eventually consistent
jsmith

13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

t2
10
Column-oriented Key-Value Store
Row Key1

Column
Key1
Column
Value1

Column
Key2
Column
Value2

Column
Key3
Column
Value3

…
…

…

• Data is stored in sparse
multidimensional hash tables
• A row can have multiple columns –
not necessarily the same amount of
columns for each row
• Each row has a unique key, which
also determines partitioning
• No relations!

Stored sorted by row key *

Stored sorted by column key/value

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

11
CQL – An SQL-like query interface
• “CQL 3 is the default and primary interface into the Cassandra DBMS” *
• Familiar SQL-like syntax that maps to Cassandras storage engine and
simplifies data modelling
CRETE TABLE songs (
id uuid PRIMARY KEY,
title text,
album text,
artist text,
data blob,
tags set<text>
);

INSERT INTO songs
(id, title, artist,
album, tags)
VALUES(
'a3e64f8f...',
'La Grange',
'ZZ Top',
'Tres Hombres'‚
{'cool', 'hot'});

SELECT *
FROM songs
WHERE id = 'a3e64f8f...';

“SQL-like” but NOT
relational SQL

* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

12
High Performance
• Optimized from the ground up
for high throughput
• All disk writes are sequential,
append only operations
• No reading before writing
• Cassandra`s threading-concept is
optimized for running on
multiprocessor/ multicore
machines
13/01/2014

Optimized for writing,
but fast reads are
possible as well

Cassandra Introduction & Key Features by Philipp Potisk

13
Benchmark from 2011 (Cassandra 0.7.4)*
ops
Cassandra showed
outstanding throughput in
“INSERT-only” with 20,000
ops

Insert: Enter 50 million 1K-sized records
Read: Search key for a one hour period + optional update
Hardware: Nehalem 6 Core x 2 CPU, 16GB Memory
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

*NoSql Benchmarking by Curbit
http://www.cubrid.org/blog/de
v-platform/nosqlbenchmarking/
14
Benchmark from 2013 (Cassandra 1.1.6)*

* Benchmarking Top NoSQL Databases by End Point Corporation,
http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdf
Yahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

15
When do we need these features?
Lots of
Writes,
Statistics, and
Analysis

Geographical
Distribution

Large
Deployments

13/01/2014

Evolving
Applications

Cassandra Introduction & Key Features by Philipp Potisk

16
Who is using Cassandra?

13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

17
ebay Data Infrastructure*
•
•
•
•
•
•

Thousands of nodes
> 2K sharded logical host
> 16K tables
> 27K indexes
> 140 billion SQLs/day
> 5 PB provisioned

• 10+ clusters
• 100+ nodes
• > 250 TB provisioned
(local HDD + shared SSD)
• > 9 billion writes/day
• > 5 billion reads/day

• Hundreds of nodes
• Persistent & in-memory
• > 40 billion SQLs/day

Not replacing RDMBS but
complementing!

Hundreds of nodes
> 50 TB
> 2 billion ops/day

• Thousands of nodes
• The world largest cluster
with 2K+ nodes

*by Jay Patel, Cassandra Summit June 2013 San Francisco
13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

18
Cassandra Use Case at Ebay
Application/Use Case
• Time-series data and real-time insights
• Fraud detection & prevention
• Quality Click Pricing for affiliates
• Order & Shipment Tracking
•…
• Server metrics collection
• Taste graph-based next-gen recommendation
system
• Social Signals on eBay Product & Item pages
13/01/2014

Why Cassandra?
• Multi-Datacenter (active-active)
• No SPOF
• Easy to scale
• Write performance
• Distributed Counters

Cassandra Introduction & Key Features by Philipp Potisk

19
Cassandra/Hadoop Deployment

13/01/2014

Cassandra Introduction & Key Features by Philipp Potisk

20
Summary
• History
• Key features of Cassandra
•
•
•
•
•
•
•

Distributed and Decentralized
Elastic Scalability
High Availability and Fault Tolerance
Tunable Consistency
Column-oriented key-value store
CQL interface
High Performance

• Ebay Use Case
13/01/2014

Apache project: http://cassandra.apache.org

Community portal: http://planetcassandra.org

Documentation: http://www.datastax.com/docs

Cassandra Introduction & Key Features by Philipp Potisk

21

More Related Content

What's hot (20)

PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
PDF
MyRocks Deep Dive
Yoshinori Matsunobu
 
PDF
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
 
PDF
Scalability, Availability & Stability Patterns
Jonas Bonér
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Etsy Activity Feeds Architecture
Dan McKinley
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PPTX
Introduction to Redis
Arnab Mitra
 
PDF
Spark (v1.3) - Présentation (Français)
Alexis Seigneurin
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Mongodb basics and architecture
Bishal Khanal
 
PDF
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
PDF
Ceph and RocksDB
Sage Weil
 
PPTX
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
PDF
MongoDB Fundamentals
MongoDB
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
MyRocks Deep Dive
Yoshinori Matsunobu
 
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
 
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Etsy Activity Feeds Architecture
Dan McKinley
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Kafka replication apachecon_2013
Jun Rao
 
Introduction to Redis
Arnab Mitra
 
Spark (v1.3) - Présentation (Français)
Alexis Seigneurin
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Mongodb basics and architecture
Bishal Khanal
 
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
Ceph and RocksDB
Sage Weil
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
MongoDB Fundamentals
MongoDB
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

Viewers also liked (8)

PPTX
Apache Cassandra Developer Training Slide Deck
DataStax Academy
 
PDF
Cassandra Tutorial
mubarakss
 
PDF
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
PDF
Cassandra NoSQL Tutorial
Michelle Darling
 
PDF
facebook architecture for 600M users
Jongyoon Choi
 
PDF
NoSQL Essentials: Cassandra
Fernando Rodriguez
 
PPTX
An Overview of Apache Cassandra
DataStax
 
PDF
Cassandra Explained
Eric Evans
 
Apache Cassandra Developer Training Slide Deck
DataStax Academy
 
Cassandra Tutorial
mubarakss
 
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
Cassandra NoSQL Tutorial
Michelle Darling
 
facebook architecture for 600M users
Jongyoon Choi
 
NoSQL Essentials: Cassandra
Fernando Rodriguez
 
An Overview of Apache Cassandra
DataStax
 
Cassandra Explained
Eric Evans
 
Ad

Similar to Cassandra Introduction & Features (20)

PPT
NoSQL_Night
Clarence J M Tauro
 
PPTX
BigData Developers MeetUp
Christian Johannsen
 
PDF
5 Factors When Selecting a High Performance, Low Latency Database
ScyllaDB
 
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
PDF
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
PPTX
Appache Cassandra
nehabsairam
 
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
PPTX
Cassandra for mission critical data
Oleksandr Semenov
 
PPTX
Aujourd’hui la consolidation de bases de données Oracle c’est quoi ?
Swiss Data Forum Swiss Data Forum
 
PPTX
DBaaS - The Next generation of database infrastructure
Emiliano Fusaglia
 
PPTX
NoSQL Intro with cassandra
Brian Enochson
 
PPTX
cassandra_presentation_final
SergioBruno21
 
PPTX
Cassandra
Pooja GV
 
PPTX
Unit -3 _Cassandra-CRUD Operations_Practice Examples
chayapathiar1
 
PPTX
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
ssuser9d6aac
 
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
PPTX
Cassandra - A Basic Introduction Guide
Mohammed Fazuluddin
 
PDF
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
NoSQL_Night
Clarence J M Tauro
 
BigData Developers MeetUp
Christian Johannsen
 
5 Factors When Selecting a High Performance, Low Latency Database
ScyllaDB
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Database as a Service on the Oracle Database Appliance Platform
Maris Elsins
 
Appache Cassandra
nehabsairam
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Cassandra for mission critical data
Oleksandr Semenov
 
Aujourd’hui la consolidation de bases de données Oracle c’est quoi ?
Swiss Data Forum Swiss Data Forum
 
DBaaS - The Next generation of database infrastructure
Emiliano Fusaglia
 
NoSQL Intro with cassandra
Brian Enochson
 
cassandra_presentation_final
SergioBruno21
 
Cassandra
Pooja GV
 
Unit -3 _Cassandra-CRUD Operations_Practice Examples
chayapathiar1
 
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
ssuser9d6aac
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Cassandra - A Basic Introduction Guide
Mohammed Fazuluddin
 
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 

Recently uploaded (20)

PDF
Home Cleaning App Development Services.pdf
V3cube
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
PDF
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Home Cleaning App Development Services.pdf
V3cube
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
Digital Circuits, important subject in CS
contactparinay1
 
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 

Cassandra Introduction & Features

  • 1. Cassandra Introduction & Key Features Meetup Vienna Cassandra Users 13th of January 2014 [email protected]
  • 2. Definition Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web [The Definitive Guide, Eben Hewitt, 2010] 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 2
  • 3. History Dynamo, 2007 Bigtable, 2006 OpenSource, 2008 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 3
  • 4. Key Features Distributed and Decentralized High Performance CQL – A SQL like query interface Elastic Scalability Cassandra Columnoriented Key-Value store 13/01/2014 High Availability and Fault Tolerance Tuneable Consistency Cassandra Introduction & Key Features by Philipp Potisk 4
  • 5. Distributed and Decentralized Datacenter 1 • Distributed: Capable of running on multiple machines • Decentralized: No single point of failure No master-slave issues due to peer-to-peer architecture (protocol "gossip") Single Cassandra cluster may run across geographically dispersed data centers 13/01/2014 Datacenter 2 1 7 6 2 5 3 4 12 8 11 9 10 Read- and writerequests to any node Cassandra Introduction & Key Features by Philipp Potisk 5
  • 6. Elastic Scalability 1 8 1 • Cassandra scales horizontally, adding more machines that have all or some of the data on • Adding of nodes increase performance throughput linearly • De-/ and increasing the nodecount happen seamlessly 4 Performance 2 throughput = N 3 2 Performance throughput = N x 2 7 4 6 5 Linearly scales to terabytes and petabytes of data 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 3 6
  • 7. Scaling Benchmark By Netflix* 48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. Each client generated ~20.000w/s having 400byte in size Cassandra scales linearly far beyond our current capacity requirements, and very rapid deployment automation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable, *http://techblog.netflix.com/201 1/11/benchmarking-cassandrascalability-on.html 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 7
  • 8. High Availability and Fault Tolerance • High Availability? Multiple networked computers operating in a cluster Facility for recognizing node failures Forward failing over requests to another part of the system 1 6 2 5 3 4 • Cassandra has High Availability No single point of failure due to the peer-to-peer architecture 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 8
  • 9. Tunable Consistency • Choose between strong and eventual consistency • Adjustable for read- and writeoperations separately • Conflicts are solved during reads, as focus lies on write-performance TUNABLE Available Consistency Use case dependent level of consistency 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 9
  • 10. When do we have strong consistency? • Simple Formula: jsmith (nodes_written + nodes_read) > replication_factor jsmith t1 t2 NW: 2 NR: 2 RF: 3 t1 t2 jsmith t1 • Ensures that a read always reflects the most recent write • If not: Weak consistency  Eventually consistent jsmith 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk t2 10
  • 11. Column-oriented Key-Value Store Row Key1 Column Key1 Column Value1 Column Key2 Column Value2 Column Key3 Column Value3 … … … • Data is stored in sparse multidimensional hash tables • A row can have multiple columns – not necessarily the same amount of columns for each row • Each row has a unique key, which also determines partitioning • No relations! Stored sorted by row key * Stored sorted by column key/value Map<RowKey, SortedMap<ColumnKey, ColumnValue>> * Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 11
  • 12. CQL – An SQL-like query interface • “CQL 3 is the default and primary interface into the Cassandra DBMS” * • Familiar SQL-like syntax that maps to Cassandras storage engine and simplifies data modelling CRETE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob, tags set<text> ); INSERT INTO songs (id, title, artist, album, tags) VALUES( 'a3e64f8f...', 'La Grange', 'ZZ Top', 'Tres Hombres'‚ {'cool', 'hot'}); SELECT * FROM songs WHERE id = 'a3e64f8f...'; “SQL-like” but NOT relational SQL * http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 12
  • 13. High Performance • Optimized from the ground up for high throughput • All disk writes are sequential, append only operations • No reading before writing • Cassandra`s threading-concept is optimized for running on multiprocessor/ multicore machines 13/01/2014 Optimized for writing, but fast reads are possible as well Cassandra Introduction & Key Features by Philipp Potisk 13
  • 14. Benchmark from 2011 (Cassandra 0.7.4)* ops Cassandra showed outstanding throughput in “INSERT-only” with 20,000 ops Insert: Enter 50 million 1K-sized records Read: Search key for a one hour period + optional update Hardware: Nehalem 6 Core x 2 CPU, 16GB Memory 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk *NoSql Benchmarking by Curbit http://www.cubrid.org/blog/de v-platform/nosqlbenchmarking/ 14
  • 15. Benchmark from 2013 (Cassandra 1.1.6)* * Benchmarking Top NoSQL Databases by End Point Corporation, http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdf Yahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 15
  • 16. When do we need these features? Lots of Writes, Statistics, and Analysis Geographical Distribution Large Deployments 13/01/2014 Evolving Applications Cassandra Introduction & Key Features by Philipp Potisk 16
  • 17. Who is using Cassandra? 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 17
  • 18. ebay Data Infrastructure* • • • • • • Thousands of nodes > 2K sharded logical host > 16K tables > 27K indexes > 140 billion SQLs/day > 5 PB provisioned • 10+ clusters • 100+ nodes • > 250 TB provisioned (local HDD + shared SSD) • > 9 billion writes/day • > 5 billion reads/day • Hundreds of nodes • Persistent & in-memory • > 40 billion SQLs/day Not replacing RDMBS but complementing! Hundreds of nodes > 50 TB > 2 billion ops/day • Thousands of nodes • The world largest cluster with 2K+ nodes *by Jay Patel, Cassandra Summit June 2013 San Francisco 13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 18
  • 19. Cassandra Use Case at Ebay Application/Use Case • Time-series data and real-time insights • Fraud detection & prevention • Quality Click Pricing for affiliates • Order & Shipment Tracking •… • Server metrics collection • Taste graph-based next-gen recommendation system • Social Signals on eBay Product & Item pages 13/01/2014 Why Cassandra? • Multi-Datacenter (active-active) • No SPOF • Easy to scale • Write performance • Distributed Counters Cassandra Introduction & Key Features by Philipp Potisk 19
  • 21. Summary • History • Key features of Cassandra • • • • • • • Distributed and Decentralized Elastic Scalability High Availability and Fault Tolerance Tunable Consistency Column-oriented key-value store CQL interface High Performance • Ebay Use Case 13/01/2014 Apache project: http://cassandra.apache.org Community portal: http://planetcassandra.org Documentation: http://www.datastax.com/docs Cassandra Introduction & Key Features by Philipp Potisk 21