Apache Cassandra at the Geek2Geek Berlin

DataStax EMEA
Apache Cassandra and DataStax Enterprise

Agenda
2
1.Introduction
2.Apache Cassandra
3.Cassandra Query Language
4.Internet of Things / Data Modeling
5.DataStax Enterprise
6.What´s New

About me
3
Christian Johannsen
Solutions Engineer @ DataStax
@cjohannsen81

Introduction
A short introduction into the NoSQL Space
4

CAP Theorem
5
• In distributed systems, consistency, availability and
partition tolerance in a mutually dependent relationship
• Enhancing any two of these will dimmish the third

What is Apache Cassandra
7
• Apache Cassandra is a massively scalable and available NoSQL
database.
• Cassandra is designed to handle big data workloads across multiple
data center, with no single point of failure, providing enterprise
performance
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

What is Apache Cassandra
8
• Masterless Architecture with read/write anywhere design
• Continuous Availability with no single point of failure
• Multi-Data Center and Zone support
• Flexible data model for unstructured, semi-structured and structured data
• Linear scalable performance with online expansion (scale-out and scale-up)
• Security with integrated authentication
• Operationally simple
• CQL - Cassandra Query Language
100,000
txns/sec
200,000
txns/sec
400,000
txns/sec

Cassandra Adoption
9
Source: db-engines.com, Feb. 2014

Apache Cassandra - Important
10
• Cluster - A ring of Cassandra nodes
• Node - A Cassandra instance
• Replication-Factor (RF) - How many copies of your data?
• Replication-Strategy - SimpleStrategy vs. NetworkTopologyStrategy
• Consistency-Level (CL) - What Consistency should be ensured for
read/writes?
• Partitioner - Decides which node store which rows (Murmur3Partinioner
as default)
• Tokens - Hash values assigned to nodes
Follow-Up: http://planetcassandra.org/blog/introduction-to-cassandra-clusters/

• Client reads or writes to any node
• Node coordinates with others (gossip
protocol)
• Data read or replicated in parallel
• RF = 3 in this example
• Each node is strong 60% of the clusters
Data i.e. 3/5
Cassandra - Locally Distributed
11
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Node 2
2nd copy

Cassandra - Rack/Zone aware
12
Node 1
1st copy
Node 4
Node 2
Node 3
2nd copy
Rack 1
Rack 2Rack 2
Rack 3
Rack 1
Node 5
3rd copy
• Cassandra is aware of which rack or
zone each node resides in
• It will attempt to place each data copy in
a different rack
• RF=3 in this example

Cassandra - DC/Region aware
13
• Active Everywhere – reads/writes in multiple data centres
• Client writes local
• Data syncs across WAN
• Replication Factor per DC
• Different number of nodes per
data center
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
DC: EUROPEDC: USA

Cassandra - Tuneable Consistency
14
• Consistency Level (CL)
• Client specifies per operation
• Handles multi-data center operations
• ALL = All replicas ack
• QUORUM = > 51% of replicas ack
• LOCAL_QUORUM = > 51% in local DC ack
• ONE = Only one replica acks
• Plus more…. (see docs)
• Blog: Eventual Consistency != Hopeful Consistency
http://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-
consistency-by-christos-kalantzis/
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
500 μs ack
12 μs ack

Cassandra - Node failure
15
• A single node failure shouldn’t bring failure.
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
>51% ack – so request is a success

Cassandra - Node Recovery
16
• When a write is performed and a replica node for the row is unavailable the
coordinator will store a hint locally (3 hours)
• When the node recovers, the coordinator replays the missed writes.
• Note: a hinted write does not count the consistency level
• Note: you should still run repairs across your cluster
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Stores Hints while Node 3 is offline

Cassandra Rack/Zone Failure
17
• Cassandra will place the data in as many
different racks or availability zones as it can.
• This example:
• RF = 3
• CL = QUORUM
• AZ/Rack 2 fails
• Data copies still available in Node 1 and
Node 5
• Quorum can be honored i.e. > 51% ack
Node 1
1st copy
Node 4
Node 2
Node 3
2nd copy
Rack 1
Rack 2Rack 2
Rack 3
Rack 1
Node 5
3rd copy
request is a success

Cassandra is fast!
18
• University of Toronto study:

Why is Cassandra so fast?
19
• write-optimised -
sequential writes to
disk
• fast merging - when
SSTable big enough
merged with existing

Operational Simplicity
20
• Cassandra is a complete product – there is not a multitude
of components to install, set-up and monitor.
• Extremely simple to administer and deploy
• Backups are instantaneous and simple to restore
• Supports snapshots, incremental backups and point-in-time recovery.
• Cassandra can handle non-uniform hardware and disks.
o This enables the mixing of solid state and spinning disks in a single cluster and pinning tables
to workload-appropriate disks.
• No downtime is required in Cassandra for upgrades or
adding/removing servers from the cluster. Scale-Up and
Scale-Out are easy to manage.

CQL
22
• Cassandra Query Language
• CQL is intended to provide a common, simpler and easier to use
interface into Cassandra - and you probably already know it!
• e.g. SELECT * FROM users
• Usual statements:
• CREATE / DROP / ALTER TABLE / SELECT

CQLSH
23
• Command line interface comes with Cassandra
• Allows some other Statements
Command Description
CAPTURE Captures command output and appends it to a file
CONSISTENCY Shows the current consistency level, or given a level, sets
it
COPY Imports and exports CSV (comma-separated values) data
DESCRIBE Provides information about a Cassandra cluster or data
objects
EXIT Terminates cqlsh
SHOW Shows the Cassandra version, host, or data type

CQL Basics
24
CREATE KEYSPACE league WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘DataCentre1’:3,
‘DataCentre2’: 2};
USE league;
CREATE TABLE teams (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM teams WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;
INSERT INTO teams (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

Internet of Things / Data Models
26

It´s about the data
27
• Sensors
• CPU, Network Card, Electronic Power Meter, Resource Utilization,
Weather
• Clickstream data
• Historical trends
• Stock Ticker
• Anything that varies on a temporal basis
• Top Ten Most Popular Videos

Data Modeling
28
• Data modeling is a process that involves
• Collection and analysis of data requirements in an information
system
• Identification of participating entities and relationships among
them
• Identification of data access patterns
• A particular way of organizing and structuring data
• Design and specification of a database schema
• Schema optimization and data indexing techniques
• Data modeling = Science + Art

Why Cassandra for time series data
29
• Cassandra is based on BigTable storage model
• One key row and lots of (variable) columns
• Single layout on disk

Time series example
30
• Storing weather data
• One weather station
• Temperature measurement every minute

Time series example - query data
31
• Weather station id = Locality of a single node

Table Definition
32
• Data partitioned by weather station ID and time
• Timestamp goes in the clustered column
• Store the measurement as the non-clustered column(s)
CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text
PRIMARY KEY (weatherstation_id, event_time)
);

INSERT and QUERY data
33
• Simple to insert:
INSERT INTO temperature (weatherstation_id, event_time, temperature)
VALUES (‘1234abcd’, ‘2013-12-11 07:01:00’, ‘72F’);
• Simple to query
SELECT temperature from temperature WHERE weatherstation_id=‘1234abcd’
AND event_time > ‘2013-04-03 07:01:00’ AND event_time < ‘2013-04-03
07:04:00’

Time Series Partitioning
34
• With the previous table, you can end up with a very large row on 1 partition
i.e. PRIMARY KEY (weatherstation_id, event_time)
• This would have to fit on 1 node.
• Cassandra can store 2 billion columns per storage row.
• The solution is to have a composite partition key to split things up:
CREATE TABLE temperature (
weatherstation_id text,
date text,
event_time timestamp,
temperature text
PRIMARY KEY ((weatherstation_id, date), event_time)
);

Compound Keys
35
The Primary Key
• The key uniquely identifies a row.
• A compound primary key consists of:
• A partition key
• One or more clustering columns
e.g. PRIMARY KEY (partition key, cluster columns, ...)
• The partition key determines on which node the partition
resides
• Data is ordered in cluster column order within the partition

Data Modeling
36
• Any questions?
• Feel free to learn more about data modeling online:
Part 1: The Data Model is Dead, Long Live the Data Model
http://www.youtube.com/watch?v=px6U2n74q3g
Part 2: Become a Super Modeler
http://www.youtube.com/watch?v=qphhxujn5Es
Part 3: The World's Next Top Data Model
http://www.youtube.com/watch?v=HdJlsOZVGwM

DataStax at a glance
38
Founded in April 2010
~25 500+
Santa Clara, Austin, New York, London, Sydney
330+
Employees Percent Customers

DataStax delivers value
39
Certified,
Enterprise-ready
Cassandra
Security Analytics Search Visual
Monitoring
Management
Services
In-Memory
Dev.IDE&
Drivers
Professional
Services
Support&
Training
Commercial
Confidence
Enterprise
Functionality

Enterprise Integrations
40
• DataStax adds Enterprise Features like: Hadoop, Solr,
Spark

DataStax OpsCenter
41
• DataStax OpsCenter is a browser-based, visual management and
monitoring solution for Apache Cassandra and DataStax Enterprise
• Functionality is also exposed via HTTP APIs

Native Drivers
44
• Different Native Drivers available: Java, Python etc.
• Load Balancing Policies (Client Driver receives Updates)
• Data Centre Aware
• Latency Aware
• Token Aware
• Reconnection policies
• Retry policies
• Downgrading Consistency
• Plus others..
• http://www.datastax.com/download/clientdrivers

DevCenter 1.1
45
• Visual Query Tool for Developers and Administrators
• Easily create and run Cassandra Queries
• Visually navigate database objects
• Context-based suggestions

DataStax Office Demo
46
• 32 Raspberry Pi´s
• 16 per DataStax Enterprise 4.5 Cluster
• Managed in OpsCenter 5.0
• “Red Button” downs one DataCenter
• Not the Performance-Demo but
• Availability
• Commodity Hardware

DataStax Enterprise
47
Feature Open Source Datastax Enterprise
Database Software
Data Platform Latest Community Cassandra Production Certified Cassandra
Core security features Yes Yes
Enterprise security features No Yes
Built-in automatic management services No Yes
Integrated analytics No Yes
Integrated enterprise search No Yes
Workload/Workflow Isolation No Yes
Easy migration of RDBMS and log data No Yes
Certified Service Packs No Yes
Certified platform support No Yes
Management Software
OpsCenter Basic functionality Advanced functionality
Services
Community Support Yes Yes
Datastax 24x7x365 Support No Yes
Quarterly Performance Reviews No Yes

DataStax Comparison
48
Standard Pro Max
Server Data Management Components
Production-certified Cassandra Yes Yes Yes
Advanced security option Yes Yes Yes
Repair service Yes Yes Yes
Capacity planning service Yes Yes Yes
Enterprise search (built-in Solr) No Yes Yes
Analytics (built-in Hadoop) No No Yes
Management Tools
OpsCenter Enterprise Yes Yes Yes
Support Services
Expert Support 24x7x1 24x7x1 24x7x1
Partner Development Support Business
hours
Business hours Business
hours
Certified service packs Yes Yes Yes
Hot fixes Yes Yes Yes

Use-Cases
49
• Netflix
• preference data captured by Cassandra
• ComCast
• AppMessaging to track favourite team´s score while watching
a movie, playlists and recommendations
• Weather Channel
• stat tracking, caching data mashups and content generation
system powered by Cassandra

What is Spark?
51
• Apache Project since 2010 - Analytics Framework
• 10-100x faster than Hadoop MapReduce
• In-Memory Storage for Read&Write data
• Single JVM Processor per node
• Rich Scala, Java and Python API´s
• 2x-5x less code
• Interactive Shell

Why Spark on Cassandra?
52
• Data model independent queries
• cross-table operations (JOIN, UNION, etc.)!
• complex analytics (e.g. machine learning)
• data transformation, aggregation etc.
• stream processing (coming soon)
• all nodes are Spark workers
• by default resilient to worker failures
• first node promoted as Spark Master
• Standby Master promoted on failure
• Master HA available in Dactastax Enterprise

2.1 Release - User Defined Types
53
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
addresses map<text, address>
)
SELECT id, name, addresses.city, addresses.phones FROM users;
id | name | addresses.city | addresses.phones
--------------------+----------------+--------------------------
63bf691f | chris | Berlin | {’0201234567', ’0796622222'}

2.1 Release - Secondary Indexes on
collections
54
CREATE TABLE songs (
id uuid PRIMARY KEY,
artist text,
album text,
title text,
data blob,
tags set<text>
);
CREATE INDEX song_tags_idx ON songs(tags);
SELECT * FROM songs WHERE tags CONTAINS 'blues';
id | album | artist | tags | title
----------+---------------+-------------------+-----------------------+------------------
5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

How to start in production?
55
• DataStax Enterprise or Community
• Hardware:
• min. 8GB RAM - optimal price-performance sweet spot is 16GB to 64GB
• 8-Core CPU - Cassandra is so efficient in writing that the CPU is the
limiting factor
• SSD-Disks - Commitlog + 50% Compaction and ext3/4 or xfs file-system
• Nodes - Cluster recommendation is 3 nodes as minimum
• Alternative: Use the Amazon Images
(http://www.datastax.com/documentation/cassandra/2.0/cassandra/architectur
e/architecturePlanningEC2_c.html)

Apache Cassandra at the Geek2Geek Berlin

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Cassandra at the Geek2Geek Berlin (20)

Recently uploaded (20)

Apache Cassandra at the Geek2Geek Berlin