Couchbase and Apache Spark

Couchbase and Apache Spark
efficient data crunching in a fast moving world

©2015 Couchbase Inc. 2
Matt Ingenthron
Worked on large site scalability
problems at previous
company…
memcached contributor
Joined Couchbase very early
and helped define key parts of
system

A Quick Architectural
Introduction to Couchbase

Couchbase is a Document Oriented Database
High availability
cache
Key-value
store
Document
database
Embedded
database
Sync
management
Couchbase can be used a number of ways.
Developers often need a simple distributed hashtable, then grow to need secondary indexing
and are either mobile-first or need to address mobile deployment.

What makes Couchbase unique?
5
Performance &
scalability leader
Sub millisecond latency
with high throughput;
memory-centric
architecture
Multi-
purpose
Simplified
administration
Easy to deploy &
manage; integrated
Admin Console, single-
click cluster expansion
& rebalance
Cache, key value store,
document database,
and local/mobile
database in single
platform
Always-on
availability
Data replication across
nodes, clusters, and
data centers
Enterprises choose Couchbase for several key advantages
24x365

 Consolidated cache and
database
 Tune memory required based
on application requirements
Multi-purpose database supports many uses
6
6
Tunable built-in
cache
Flexible schemas
with JSON
Couchbase Lite
 Represent data with varying
schemas using JSON on the
server or on the device
 Index and query data with
Javascript views
 Light weight embedded DB for
always available apps
 Sync Gateway syncs data
seamlessly with Couchbase
Server

Couchbase leads in performance and scalability
Auto
Sharding
Memory-memory
XDCR
Single
NodeType
 No manual sharding
 Database manages data
movement to scale out – not
the user
 Market’s only memory-to-
memory database replication
across clusters and geos
 Provides disaster recover /
data locality
 Hugely simplifies management
of clusters
 Easy to scale clusters by adding
any number of nodes

24x365
Couchbase delivers always-on availability
8
High
Availability
Disaster
Recovery
Backup &
Restore
 In-memory replication with
manual or automatic fail over
 Rack-zone awareness to
minimize data unavailability
 Memory-to-memory cross
cluster replication across data
centers or geos
 Active-active topology with bi-
directional setup
 Full backup or Incremental
backup with online restore
 Delta node catch-ups for faster
recovery after failures

Simplified administration for exceptional ease of use
Online upgrades and
operations
Built-in enterprise
class admin console
RestfulAPIs
 Online software, hardware and
DB upgrades
 Indexing, compaction,
rebalance, backup & restore
 Perform all administrative
tasks with the click of a button
 Monitor status of the system
visual at cluster level, database
level, server level
 All admin operations available
via UI, REST APIs or CLI
commands
 Integrate third party
monitoring tools easily using
REST

Couchbase Server Architecture
Single-node type means easier
administration and scaling
 Single installation
 Two major components/processes:
Data manager cluster manager
 Data manager:
 C/C++
 Layer consolidation of caching and
persistence
 Cluster manager:
 Erlang/OTP
 Administration UI’s
 Out-of-band for data requests

APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
Write Operation
11
DOC 1
DOC 1DOC 1
Single-node type means easier
administration and scaling
 Writes are async by default
 Application gets
acknowledgement when
successfully in RAM and can trade-
off waiting for replication or
persistence per-write
 Replication to 1, 2 or 3 other nodes
 Replication is RAM-based so
extremely fast
 Off-node replication is primary
level of HA
 Disk written to as fast as possible –
no waiting

ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
12
SHARD
5
SHARD
2
SHARD
9
SHARD SHARD SHARD
SHARD
4
SHARD
7
SHARD
8
SHARD SHARD SHARD
SHARD
1
SHARD
3
SHARD
6
SHARD SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
Application has single logical connection
to cluster (client object)
• Data is automatically sharded resulting in even
document data distribution across cluster
• Each vbucket replicated 1, 2 or 3 times (“peer-to-peer”
replication)
• Docs are automatically hashed by the client to a shard
• Cluster map provides location of which server a shard
is on
• Every read/write/update/delete goes to same node for
a given key
• Strongly consistent data access (“read your own
writes”)
• A single Couchbase node can achieve 100k’s ops/sec so
no need to scale reads

Cache Ejection
13
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
Single-node type means
easier administration and
scaling
 Layer consolidation means read
through and write through cache
 Couchbase automatically removes
data that has already been
persisted from RAM

APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
Cache Miss
14
GET
DOC 1
DOC 1
DOC 1
Single-node type means
easier administration and
scaling
 Layer consolidation means 1
single interface for App to talk to
and get its data back as fast as
possible
 Separation of cache and disk
allows for fastest access out of
RAM while pulling data from disk
in parallel

Add Nodes to Cluster
15
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
SHARD
7
SHARD
SHARD
6
SHARD
SHARD
8
SHARD
9
SHARD
READ/WRITE/UPDATE
Application has single
logical connection to
cluster (client object)
 Multiple nodes added or
removed at once
 One-click operation
 Incremental movement of
active and replica vbuckets
and data
 Client library updated via
cluster map
 Fully online operation, no
downtime or loss of
performance

Node Unresponsive / Lost

Fail Over Node
17
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD
SHARDSHARD
6
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD
SHARD
SHARD
7
SHARD
SHARD
6
SHARDSHARD
8
SHARD
9
SHARD
SHARD
3
SHARD
1
SHARD
3
SHARD
Application has single
logical connection to
cluster (client object)
 When node goes down,
some requests will fail
 Failover is either automatic
or manual`
 Client library is
automatically updated via
cluster map
 Replicas not recreated to
preserve stability
 Best practice to replace
node and rebalance

Big Data = Operational + Analytic (NoSQL + Hadoop)
20
 Online
 Web/Mobile/IoT apps
 Millions of
customers/consumers
 Offline
 Analytics apps
 Hundreds of business analysts

COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD

TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS

Apache Spark:The Big Picture

Apache Spark
… is a fast and general purpose engine for small and large scale data
processing …

Components: Spark Core
Resilient Distributed Datasets
Clustering
Execution

Components: Spark SQL
Structured through DataFrames
Distributed querying with SQL

Components: Spark Streaming
Fault-tolerant streaming applications

Components: Spark MLib
Built-In Machine Learning Algorithms

Components: Spark GraphX
Graph processing and graph-parallel
computations

How does it work?
Source: http://spark.apache.org/docs/latest/cluster-overview.html

Spark Benefits
 Linearly scalable to 1000+ worker nodes
 Simpler to use than Hadoop MR
 Only partial recompute on failure
 For developers and data scientists
– machine learning
– R integration
 Tight but not mandatory Hadoop integration
– Sources, Sinks
– Scheduler

Spark vs Hadoop
 Spark is RAM while Hadoop is mainly HDFS (disk) bound
 Fully compatible with Hadoop Input/Output
 Easier to develop against thanks to functional composition
 Hadoop certainly more mature, but Spark ecosystem growing fast

Couchbase in the Spark Landscape
 Transparent generation and persistence of
– RDDs
– DataFrames
– Dstreams
 Spark SQL and N1QL are a natural fit
 Linearly scale your data and application layer
 Share data between SparkApplications
The perfect storage companion for your spark applications.
Source: http://spark.apache.org/docs/latest/cluster-overview.html

Cluster Communication
STORAGE
Couchbase Server 1
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 2
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Couchbase Server 3
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Couchbase Server 4
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Couchbase Server 5
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Couchbase Server 6
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service
Spark Worker Spark Worker

Ecosystem Flexibility
RDBMS
Streams
Web APIs
DCP
KV
N1QL
Views
Batching
Data Archive
OLTP Data

Infrastructure Consolidation

The Connector

Couchbase Connector
 Spark Core
– Automatic Cluster and Resource Management
– Creating and Persisting RDDs
– Java APIs in addition to Scala
 Spark SQL
– Easy JSON handling and querying
– Tight N1QL Integration
 Spark Streaming
– Persisting DStreams
– DCP source (experimental)

Facts
 CurrentVersion: 1.0.0-beta
 Code: https://github.com/couchbaselabs/couchbase-spark-connector
 Docs until GA:
http://developer.couchbase.com/documentation/server/4.0/connectors/spark
-1.0/spark-intro.html

Connection Management

Creating RDDs

Persisting RDDs

Spark SQL Integration

Spark Streaming with DCP

What‘s next?

Couchbase Connector
 Learn More:
– Couchbase and Spark at Couchbase Connect 2015:
http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/
 1.1.0 plans
– Upgrade to Spark 1.5
– Stabilize DCP Support
– Extend, Optimze, Fix bugs…
 We need your feedback!

Couchbase and Apache Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Couchbase and Apache Spark (20)

Recently uploaded (20)

Couchbase and Apache Spark

Editor's Notes