Couchbase and Apache Spark
efficient data crunching in a fast moving world
©2015 Couchbase Inc. 2
Matt Ingenthron
Worked on large site scalability
problems at previous
company…
memcached contributor
Joined Couchbase very early
and helped define key parts of
system
A Quick Architectural
Introduction to Couchbase
©2015 Couchbase Inc. 4
Couchbase is a Document Oriented Database
High availability
cache
Key-value
store
Document
database
Embedded
database
Sync
management
Couchbase can be used a number of ways.
Developers often need a simple distributed hashtable, then grow to need secondary indexing
and are either mobile-first or need to address mobile deployment.
©2015 Couchbase Inc. 5
What makes Couchbase unique?
5
Performance &
scalability leader
Sub millisecond latency
with high throughput;
memory-centric
architecture
Multi-
purpose
Simplified
administration
Easy to deploy &
manage; integrated
Admin Console, single-
click cluster expansion
& rebalance
Cache, key value store,
document database,
and local/mobile
database in single
platform
Always-on
availability
Data replication across
nodes, clusters, and
data centers
Enterprises choose Couchbase for several key advantages
24x365
©2015 Couchbase Inc. 6
 Consolidated cache and
database
 Tune memory required based
on application requirements
Multi-purpose database supports many uses
6
6
Tunable built-in
cache
Flexible schemas
with JSON
Couchbase Lite
 Represent data with varying
schemas using JSON on the
server or on the device
 Index and query data with
Javascript views
 Light weight embedded DB for
always available apps
 Sync Gateway syncs data
seamlessly with Couchbase
Server
©2015 Couchbase Inc. 7
Couchbase leads in performance and scalability
Auto
Sharding
Memory-memory
XDCR
Single
NodeType
 No manual sharding
 Database manages data
movement to scale out – not
the user
 Market’s only memory-to-
memory database replication
across clusters and geos
 Provides disaster recover /
data locality
 Hugely simplifies management
of clusters
 Easy to scale clusters by adding
any number of nodes
©2015 Couchbase Inc. 8
24x365
Couchbase delivers always-on availability
8
High
Availability
Disaster
Recovery
Backup &
Restore
 In-memory replication with
manual or automatic fail over
 Rack-zone awareness to
minimize data unavailability
 Memory-to-memory cross
cluster replication across data
centers or geos
 Active-active topology with bi-
directional setup
 Full backup or Incremental
backup with online restore
 Delta node catch-ups for faster
recovery after failures
©2015 Couchbase Inc. 9
Simplified administration for exceptional ease of use
Online upgrades and
operations
Built-in enterprise
class admin console
RestfulAPIs
 Online software, hardware and
DB upgrades
 Indexing, compaction,
rebalance, backup & restore
 Perform all administrative
tasks with the click of a button
 Monitor status of the system
visual at cluster level, database
level, server level
 All admin operations available
via UI, REST APIs or CLI
commands
 Integrate third party
monitoring tools easily using
REST
©2015 Couchbase Inc. 10
Couchbase Server Architecture
Single-node type means easier
administration and scaling
 Single installation
 Two major components/processes:
Data manager cluster manager
 Data manager:
 C/C++
 Layer consolidation of caching and
persistence
 Cluster manager:
 Erlang/OTP
 Administration UI’s
 Out-of-band for data requests
©2015 Couchbase Inc. 11
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
Write Operation
11
DOC 1
DOC 1DOC 1
Single-node type means easier
administration and scaling
 Writes are async by default
 Application gets
acknowledgement when
successfully in RAM and can trade-
off waiting for replication or
persistence per-write
 Replication to 1, 2 or 3 other nodes
 Replication is RAM-based so
extremely fast
 Off-node replication is primary
level of HA
 Disk written to as fast as possible –
no waiting
©2015 Couchbase Inc. 12
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
12
SHARD
5
SHARD
2
SHARD
9
SHARD SHARD SHARD
SHARD
4
SHARD
7
SHARD
8
SHARD SHARD SHARD
SHARD
1
SHARD
3
SHARD
6
SHARD SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
Application has single logical connection
to cluster (client object)
• Data is automatically sharded resulting in even
document data distribution across cluster
• Each vbucket replicated 1, 2 or 3 times (“peer-to-peer”
replication)
• Docs are automatically hashed by the client to a shard
• Cluster map provides location of which server a shard
is on
• Every read/write/update/delete goes to same node for
a given key
• Strongly consistent data access (“read your own
writes”)
• A single Couchbase node can achieve 100k’s ops/sec so
no need to scale reads
©2015 Couchbase Inc. 13
Cache Ejection
13
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
Single-node type means
easier administration and
scaling
 Layer consolidation means read
through and write through cache
 Couchbase automatically removes
data that has already been
persisted from RAM
©2015 Couchbase Inc. 14
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
QUEUE
REPLICATION
QUEUE
DOC 1
Cache Miss
14
DOC 2 DOC 3 DOC 4 DOC 5
DOC 2 DOC 3 DOC 4 DOC 5
GET
DOC 1
DOC 1
DOC 1
Single-node type means
easier administration and
scaling
 Layer consolidation means 1
single interface for App to talk to
and get its data back as fast as
possible
 Separation of cache and disk
allows for fastest access out of
RAM while pulling data from disk
in parallel
©2015 Couchbase Inc. 15
Add Nodes to Cluster
15
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
SHARD
7
SHARD
SHARD
6
SHARD
SHARD
8
SHARD
9
SHARD
READ/WRITE/UPDATE
Application has single
logical connection to
cluster (client object)
 Multiple nodes added or
removed at once
 One-click operation
 Incremental movement of
active and replica vbuckets
and data
 Client library updated via
cluster map
 Fully online operation, no
downtime or loss of
performance
©2015 Couchbase Inc. 16
Node Unresponsive / Lost
©2015 Couchbase Inc. 17
Fail Over Node
17
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD
SHARDSHARD
6
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD
SHARD
SHARD
7
SHARD
SHARD
6
SHARDSHARD
8
SHARD
9
SHARD
SHARD
3
SHARD
1
SHARD
3
SHARD
Application has single
logical connection to
cluster (client object)
 When node goes down,
some requests will fail
 Failover is either automatic
or manual`
 Client library is
automatically updated via
cluster map
 Replicas not recreated to
preserve stability
 Best practice to replace
node and rebalance
Demo
What about Hadoop?
©2015 Couchbase Inc. 20
Big Data = Operational + Analytic (NoSQL + Hadoop)
20
 Online
 Web/Mobile/IoT apps
 Millions of
customers/consumers
 Offline
 Analytics apps
 Hundreds of business analysts
COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD
TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS
©2015 Couchbase Inc. 23
Apache Spark:The Big Picture
©2015 Couchbase Inc. 24
Apache Spark
… is a fast and general purpose engine for small and large scale data
processing …
©2015 Couchbase Inc. 25
Components: Spark Core
Resilient Distributed Datasets
Clustering
Execution
©2015 Couchbase Inc. 26
Components: Spark SQL
Structured through DataFrames
Distributed querying with SQL
©2015 Couchbase Inc. 27
Components: Spark Streaming
Fault-tolerant streaming applications
©2015 Couchbase Inc. 28
Components: Spark MLib
Built-In Machine Learning Algorithms
©2015 Couchbase Inc. 29
Components: Spark GraphX
Graph processing and graph-parallel
computations
©2015 Couchbase Inc. 30
How does it work?
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 31
Spark Benefits
 Linearly scalable to 1000+ worker nodes
 Simpler to use than Hadoop MR
 Only partial recompute on failure
 For developers and data scientists
– machine learning
– R integration
 Tight but not mandatory Hadoop integration
– Sources, Sinks
– Scheduler
©2015 Couchbase Inc. 32
Spark vs Hadoop
 Spark is RAM while Hadoop is mainly HDFS (disk) bound
 Fully compatible with Hadoop Input/Output
 Easier to develop against thanks to functional composition
 Hadoop certainly more mature, but Spark ecosystem growing fast
©2015 Couchbase Inc. 33
Couchbase in the Spark Landscape
 Transparent generation and persistence of
– RDDs
– DataFrames
– Dstreams
 Spark SQL and N1QL are a natural fit
 Linearly scale your data and application layer
 Share data between SparkApplications
The perfect storage companion for your spark applications.
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 34
Cluster Communication
STORAGE
Couchbase Server 1
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 2
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 3
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 4
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 5
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 6
SHARD
7
SHARD
9
SHARD
5
SHARDSHARDSHARD
Managed
Cache
Cluster
Manager
Cluster
Manager
Managed
Cache
Storage
Data Service
Index Service
Query Service
Spark Worker Spark Worker
©2015 Couchbase Inc. 35
Ecosystem Flexibility
RDBMS
Streams
Web APIs
DCP
KV
N1QL
Views
Batching
Data Archive
OLTP Data
©2015 Couchbase Inc. 36
Infrastructure Consolidation
©2015 Couchbase Inc. 37
The Connector
©2015 Couchbase Inc. 38
Couchbase Connector
 Spark Core
– Automatic Cluster and Resource Management
– Creating and Persisting RDDs
– Java APIs in addition to Scala
 Spark SQL
– Easy JSON handling and querying
– Tight N1QL Integration
 Spark Streaming
– Persisting DStreams
– DCP source (experimental)
©2015 Couchbase Inc. 39
Facts
 CurrentVersion: 1.0.0-beta
 Code: https://github.com/couchbaselabs/couchbase-spark-connector
 Docs until GA:
http://developer.couchbase.com/documentation/server/4.0/connectors/spark
-1.0/spark-intro.html
©2015 Couchbase Inc. 40
Connection Management
©2015 Couchbase Inc. 41
Connection Management
©2015 Couchbase Inc. 42
Creating RDDs
©2015 Couchbase Inc. 43
Persisting RDDs
©2015 Couchbase Inc. 44
Spark SQL Integration
©2015 Couchbase Inc. 45
Spark Streaming with DCP
©2015 Couchbase Inc. 46
What‘s next?
©2015 Couchbase Inc. 47
Couchbase Connector
 Learn More:
– Couchbase and Spark at Couchbase Connect 2015:
http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/
 1.1.0 plans
– Upgrade to Spark 1.5
– Stabilize DCP Support
– Extend, Optimze, Fix bugs…
 We need your feedback!

More Related Content

PPTX
Delta lake and the delta architecture
PPTX
Azure data bricks by Eugene Polonichko
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Family data sheet HP Virtual Connect(May 2013)
PPTX
PDF
ETL Made Easy with Azure Data Factory and Azure Databricks
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Delta lake and the delta architecture
Azure data bricks by Eugene Polonichko
Spark and Couchbase– Augmenting the Operational Database with Spark
Introduction SQL Analytics on Lakehouse Architecture
Family data sheet HP Virtual Connect(May 2013)
ETL Made Easy with Azure Data Factory and Azure Databricks
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

What's hot (20)

PPTX
Lambda architecture with Spark
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
PPTX
Future of data visualization
PDF
IEEE International Conference on Data Engineering 2015
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
PPTX
Using Visualization to Succeed with Big Data
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PPTX
Splice Machine Overview
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PPTX
Time-oriented event search. A new level of scale
PDF
Big Telco - Yousun Jeong
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Databricks Delta Lake and Its Benefits
PPTX
Azure Data Lake Analytics Deep Dive
PDF
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PPTX
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
What's new in SQL on Hadoop and Beyond
Lambda architecture with Spark
Data Engineer's Lunch #55: Get Started in Data Engineering
Future of data visualization
IEEE International Conference on Data Engineering 2015
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Using Visualization to Succeed with Big Data
Solr + Hadoop: Interactive Search for Hadoop
Splice Machine Overview
Powering Interactive BI Analytics with Presto and Delta Lake
Time-oriented event search. A new level of scale
Big Telco - Yousun Jeong
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks Delta Lake and Its Benefits
Azure Data Lake Analytics Deep Dive
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
What's new in SQL on Hadoop and Beyond

Viewers also liked (20)

PDF
Elasticsearch 2014/04/21 勉強会資料 「Couchbase と Elasticsearch が手を結んだら」
PPT
assignment 2
PDF
Bt pusher bluetooth marketing software system user guide
PPS
yildiz acemoglu
PPTX
ZŠ a MŠ Brezovica
PPTX
Мягкое управление командой проекта
PPTX
ZŠ a MŠ Nečtiny
PDF
Week11.Pre
PPTX
Startup agile (Ciklum Agile Saturday - Dnipropetrovsk) - in russian
PPTX
eTwiningový maraton - ZS Nemsova
PPS
Save antarctica
PPT
Was sagen die Sagen ... von unserer region
PDF
Chapter 4 english government
PDF
Vuorovaikutteinen viestintä ja merkityksien luominen (Sitran Maamerkit-ohjelma)
PPSX
Birthday presantation
PPT
Pascale Perry - #smib10 Presentation
PPTX
eTwinningový maraton ZŠ Gen. Píky, Ostrava
PPTX
E twinning plus contact seminar (1)
Elasticsearch 2014/04/21 勉強会資料 「Couchbase と Elasticsearch が手を結んだら」
assignment 2
Bt pusher bluetooth marketing software system user guide
yildiz acemoglu
ZŠ a MŠ Brezovica
Мягкое управление командой проекта
ZŠ a MŠ Nečtiny
Week11.Pre
Startup agile (Ciklum Agile Saturday - Dnipropetrovsk) - in russian
eTwiningový maraton - ZS Nemsova
Save antarctica
Was sagen die Sagen ... von unserer region
Chapter 4 english government
Vuorovaikutteinen viestintä ja merkityksien luominen (Sitran Maamerkit-ohjelma)
Birthday presantation
Pascale Perry - #smib10 Presentation
eTwinningový maraton ZŠ Gen. Píky, Ostrava
E twinning plus contact seminar (1)

Similar to Couchbase and Apache Spark (20)

PDF
Manuel Hurtado. Couchbase paradigma4oct
PPTX
Couchbase 101
PPTX
Couchbase Data Pipeline
PDF
Couchbase Day
PDF
Couchbase overview033113long
PDF
Couchbase overview033113long
PDF
Couchbase Sydney meetup #1 Couchbase Architecture and Scalability
PPTX
Introduction to couchbase
PDF
The Modern Database for Enterprise Applications
PDF
Introduction to NoSQL with Couchbase
PPTX
Introduction to Couchbase: Onomi
PDF
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
PDF
From 0 to syncing
PDF
How companies use NoSQL & Couchbase - NoSQL Now 2014
PDF
Couchbase - Yet Another Introduction
PDF
Enterprise Architecture vs. Data Architecture
PDF
Couchbase b jmeetup
PDF
SDEC2011 Using Couchbase for social game scaling and speed
ODP
Couchbase training basic
PDF
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Manuel Hurtado. Couchbase paradigma4oct
Couchbase 101
Couchbase Data Pipeline
Couchbase Day
Couchbase overview033113long
Couchbase overview033113long
Couchbase Sydney meetup #1 Couchbase Architecture and Scalability
Introduction to couchbase
The Modern Database for Enterprise Applications
Introduction to NoSQL with Couchbase
Introduction to Couchbase: Onomi
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
From 0 to syncing
How companies use NoSQL & Couchbase - NoSQL Now 2014
Couchbase - Yet Another Introduction
Enterprise Architecture vs. Data Architecture
Couchbase b jmeetup
SDEC2011 Using Couchbase for social game scaling and speed
Couchbase training basic
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!

Recently uploaded (20)

PPTX
lung disease detection using transfer learning approach.pptx
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
Bussiness Plan S Group of college 2020-23 Final
PPT
Technicalities in writing workshops indigenous language
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PDF
General category merit rank list for neet pg
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPTX
Capstone Presentation a.pptx on data sci
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
1.Introduction to orthodonti hhhgghhcs.pptx
PDF
Mcdonald's : a half century growth . pdf
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PDF
American Journal of Multidisciplinary Research and Review
PPT
What is life? We never know the answer exactly
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
lung disease detection using transfer learning approach.pptx
inbound6529290805104538764.pptxmmmmmmmmm
Stats annual compiled ipd opd ot br 2024
Bussiness Plan S Group of college 2020-23 Final
Technicalities in writing workshops indigenous language
NU-MEP-Standards معايير تصميم جامعية .pdf
General category merit rank list for neet pg
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
Capstone Presentation a.pptx on data sci
Chapter security of computer_8_v8.1.pptx
1.Introduction to orthodonti hhhgghhcs.pptx
Mcdonald's : a half century growth . pdf
DAA UNIT 1 for unit 1 time compixity PPT.pptx
American Journal of Multidisciplinary Research and Review
What is life? We never know the answer exactly
Power BI - Microsoft Power BI is an interactive data visualization software p...
DATA ANALYTICS COURSE IN PITAMPURA.pptx

Couchbase and Apache Spark

  • 1. Couchbase and Apache Spark efficient data crunching in a fast moving world
  • 2. ©2015 Couchbase Inc. 2 Matt Ingenthron Worked on large site scalability problems at previous company… memcached contributor Joined Couchbase very early and helped define key parts of system
  • 4. ©2015 Couchbase Inc. 4 Couchbase is a Document Oriented Database High availability cache Key-value store Document database Embedded database Sync management Couchbase can be used a number of ways. Developers often need a simple distributed hashtable, then grow to need secondary indexing and are either mobile-first or need to address mobile deployment.
  • 5. ©2015 Couchbase Inc. 5 What makes Couchbase unique? 5 Performance & scalability leader Sub millisecond latency with high throughput; memory-centric architecture Multi- purpose Simplified administration Easy to deploy & manage; integrated Admin Console, single- click cluster expansion & rebalance Cache, key value store, document database, and local/mobile database in single platform Always-on availability Data replication across nodes, clusters, and data centers Enterprises choose Couchbase for several key advantages 24x365
  • 6. ©2015 Couchbase Inc. 6  Consolidated cache and database  Tune memory required based on application requirements Multi-purpose database supports many uses 6 6 Tunable built-in cache Flexible schemas with JSON Couchbase Lite  Represent data with varying schemas using JSON on the server or on the device  Index and query data with Javascript views  Light weight embedded DB for always available apps  Sync Gateway syncs data seamlessly with Couchbase Server
  • 7. ©2015 Couchbase Inc. 7 Couchbase leads in performance and scalability Auto Sharding Memory-memory XDCR Single NodeType  No manual sharding  Database manages data movement to scale out – not the user  Market’s only memory-to- memory database replication across clusters and geos  Provides disaster recover / data locality  Hugely simplifies management of clusters  Easy to scale clusters by adding any number of nodes
  • 8. ©2015 Couchbase Inc. 8 24x365 Couchbase delivers always-on availability 8 High Availability Disaster Recovery Backup & Restore  In-memory replication with manual or automatic fail over  Rack-zone awareness to minimize data unavailability  Memory-to-memory cross cluster replication across data centers or geos  Active-active topology with bi- directional setup  Full backup or Incremental backup with online restore  Delta node catch-ups for faster recovery after failures
  • 9. ©2015 Couchbase Inc. 9 Simplified administration for exceptional ease of use Online upgrades and operations Built-in enterprise class admin console RestfulAPIs  Online software, hardware and DB upgrades  Indexing, compaction, rebalance, backup & restore  Perform all administrative tasks with the click of a button  Monitor status of the system visual at cluster level, database level, server level  All admin operations available via UI, REST APIs or CLI commands  Integrate third party monitoring tools easily using REST
  • 10. ©2015 Couchbase Inc. 10 Couchbase Server Architecture Single-node type means easier administration and scaling  Single installation  Two major components/processes: Data manager cluster manager  Data manager:  C/C++  Layer consolidation of caching and persistence  Cluster manager:  Erlang/OTP  Administration UI’s  Out-of-band for data requests
  • 11. ©2015 Couchbase Inc. 11 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE Write Operation 11 DOC 1 DOC 1DOC 1 Single-node type means easier administration and scaling  Writes are async by default  Application gets acknowledgement when successfully in RAM and can trade- off waiting for replication or persistence per-write  Replication to 1, 2 or 3 other nodes  Replication is RAM-based so extremely fast  Off-node replication is primary level of HA  Disk written to as fast as possible – no waiting
  • 12. ©2015 Couchbase Inc. 12 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 Basic Operation 12 SHARD 5 SHARD 2 SHARD 9 SHARD SHARD SHARD SHARD 4 SHARD 7 SHARD 8 SHARD SHARD SHARD SHARD 1 SHARD 3 SHARD 6 SHARD SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD Application has single logical connection to cluster (client object) • Data is automatically sharded resulting in even document data distribution across cluster • Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication) • Docs are automatically hashed by the client to a shard • Cluster map provides location of which server a shard is on • Every read/write/update/delete goes to same node for a given key • Strongly consistent data access (“read your own writes”) • A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads
  • 13. ©2015 Couchbase Inc. 13 Cache Ejection 13 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 DOC 2DOC 3DOC 4DOC 5 DOC 1 DOC 2 DOC 3 DOC 4 DOC 5 Single-node type means easier administration and scaling  Layer consolidation means read through and write through cache  Couchbase automatically removes data that has already been persisted from RAM
  • 14. ©2015 Couchbase Inc. 14 APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE DOC 1 Cache Miss 14 DOC 2 DOC 3 DOC 4 DOC 5 DOC 2 DOC 3 DOC 4 DOC 5 GET DOC 1 DOC 1 DOC 1 Single-node type means easier administration and scaling  Layer consolidation means 1 single interface for App to talk to and get its data back as fast as possible  Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel
  • 15. ©2015 Couchbase Inc. 15 Add Nodes to Cluster 15 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARD SHARD 8 SHARD 9 SHARD READ/WRITE/UPDATE Application has single logical connection to cluster (client object)  Multiple nodes added or removed at once  One-click operation  Incremental movement of active and replica vbuckets and data  Client library updated via cluster map  Fully online operation, no downtime or loss of performance
  • 16. ©2015 Couchbase Inc. 16 Node Unresponsive / Lost
  • 17. ©2015 Couchbase Inc. 17 Fail Over Node 17 ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARDSHARD 6 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARDSHARD 8 SHARD 9 SHARD SHARD 3 SHARD 1 SHARD 3 SHARD Application has single logical connection to cluster (client object)  When node goes down, some requests will fail  Failover is either automatic or manual`  Client library is automatically updated via cluster map  Replicas not recreated to preserve stability  Best practice to replace node and rebalance
  • 18. Demo
  • 20. ©2015 Couchbase Inc. 20 Big Data = Operational + Analytic (NoSQL + Hadoop) 20  Online  Web/Mobile/IoT apps  Millions of customers/consumers  Offline  Analytics apps  Hundreds of business analysts
  • 23. ©2015 Couchbase Inc. 23 Apache Spark:The Big Picture
  • 24. ©2015 Couchbase Inc. 24 Apache Spark … is a fast and general purpose engine for small and large scale data processing …
  • 25. ©2015 Couchbase Inc. 25 Components: Spark Core Resilient Distributed Datasets Clustering Execution
  • 26. ©2015 Couchbase Inc. 26 Components: Spark SQL Structured through DataFrames Distributed querying with SQL
  • 27. ©2015 Couchbase Inc. 27 Components: Spark Streaming Fault-tolerant streaming applications
  • 28. ©2015 Couchbase Inc. 28 Components: Spark MLib Built-In Machine Learning Algorithms
  • 29. ©2015 Couchbase Inc. 29 Components: Spark GraphX Graph processing and graph-parallel computations
  • 30. ©2015 Couchbase Inc. 30 How does it work? Source: http://spark.apache.org/docs/latest/cluster-overview.html
  • 31. ©2015 Couchbase Inc. 31 Spark Benefits  Linearly scalable to 1000+ worker nodes  Simpler to use than Hadoop MR  Only partial recompute on failure  For developers and data scientists – machine learning – R integration  Tight but not mandatory Hadoop integration – Sources, Sinks – Scheduler
  • 32. ©2015 Couchbase Inc. 32 Spark vs Hadoop  Spark is RAM while Hadoop is mainly HDFS (disk) bound  Fully compatible with Hadoop Input/Output  Easier to develop against thanks to functional composition  Hadoop certainly more mature, but Spark ecosystem growing fast
  • 33. ©2015 Couchbase Inc. 33 Couchbase in the Spark Landscape  Transparent generation and persistence of – RDDs – DataFrames – Dstreams  Spark SQL and N1QL are a natural fit  Linearly scale your data and application layer  Share data between SparkApplications The perfect storage companion for your spark applications. Source: http://spark.apache.org/docs/latest/cluster-overview.html
  • 34. ©2015 Couchbase Inc. 34 Cluster Communication STORAGE Couchbase Server 1 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 2 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 3 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 4 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 5 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service STORAGE Couchbase Server 6 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster Manager Cluster Manager Managed Cache Storage Data Service Index Service Query Service Spark Worker Spark Worker
  • 35. ©2015 Couchbase Inc. 35 Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views Batching Data Archive OLTP Data
  • 36. ©2015 Couchbase Inc. 36 Infrastructure Consolidation
  • 37. ©2015 Couchbase Inc. 37 The Connector
  • 38. ©2015 Couchbase Inc. 38 Couchbase Connector  Spark Core – Automatic Cluster and Resource Management – Creating and Persisting RDDs – Java APIs in addition to Scala  Spark SQL – Easy JSON handling and querying – Tight N1QL Integration  Spark Streaming – Persisting DStreams – DCP source (experimental)
  • 39. ©2015 Couchbase Inc. 39 Facts  CurrentVersion: 1.0.0-beta  Code: https://github.com/couchbaselabs/couchbase-spark-connector  Docs until GA: http://developer.couchbase.com/documentation/server/4.0/connectors/spark -1.0/spark-intro.html
  • 40. ©2015 Couchbase Inc. 40 Connection Management
  • 41. ©2015 Couchbase Inc. 41 Connection Management
  • 42. ©2015 Couchbase Inc. 42 Creating RDDs
  • 43. ©2015 Couchbase Inc. 43 Persisting RDDs
  • 44. ©2015 Couchbase Inc. 44 Spark SQL Integration
  • 45. ©2015 Couchbase Inc. 45 Spark Streaming with DCP
  • 46. ©2015 Couchbase Inc. 46 What‘s next?
  • 47. ©2015 Couchbase Inc. 47 Couchbase Connector  Learn More: – Couchbase and Spark at Couchbase Connect 2015: http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/  1.1.0 plans – Upgrade to Spark 1.5 – Stabilize DCP Support – Extend, Optimze, Fix bugs…  We need your feedback!

Editor's Notes

  • #3: Slide 2 – About Me
  • #5: KEY POINT: COUCHBASE PROVIDES A SET OF MULTI-PURPOSE, CORE CAPABILITIES THAT SUPPORT A BROAD RANGE OF APPLICATIONS AND USE CASES, ALL IN A SINGLE DATA MANAGEMENT PLATFORM. Couchbase provides a set of technology capabilities to support a broad range of applications and use cases: High Availability Cache: Couchbase provides an integrated managed object cache, so you can start out using Couchbase as a high availability cache on top of your existing relational database. For example, you can use Couchbase as a session store in front of your relational database, if your relational DB is struggling to keep up with the load required for online interactive applications. Key-Value Store: Many customers start with Couchbase as a cache and then broaden their usage to other capabilities, like using Couchbase as a Key-Value Store for things like Profile Management. Document Database: From there, you can grow into using Couchbase as a Document Database, where you can do more with capabilities like indexing and Cross Data Center Replication. Embedded Database: Couchbase also provides an embedded database called Couchbase Lite. It’s a purpose-built database for the device, so you can build applications that are always available and always work, whether offline or online. Sync Management: Finally, as part of our solution for mobile applications, we provide Couchbase Sync Gateway, which automatically synchronizes data on the device with Couchbase Server in the cloud so your developer doesn’t have to write code to manage the complex sync process. Starting with cache and then expanding to other capabilities is often a good way to learn the technology and get comfortable with Couchbase for a wider set of use cases.
  • #6: Couchbase has emerged as a leading NoSQL provider for number of reasons: Best in performance and scalability We’ve engineered Couchbase from the ground up for high performance and scalability Couchbase is designed to deliver sub-millisecond responsiveness with very high throughput for both reads and writes We consistently outperform competitors like MongoDB and DataStax in multiple independent benchmarks Our performance advantage is driven in large part by our memory-centric architecture, which includes an integrated managed object cache and stream-based replication Broad use case support We’re the only NoSQL provider that has consolidated distributed cache, key-value store, and a JSON-based document database in a single platform This means customers can use Couchbase for a much broader range of applications Integrated mobile solution We’re the only vendor that provides an end-to-end NoSQL mobile solution -- allows customers to easily build mobile apps that run great on or offline Includes a JSON database embedded on the device, along with a prebuilt syncing tier So apps run great on the device, even without a network connection or no connectivity at all Data on the device auto-syncs with the backend server when a connection is available Simplified administration We’ve designed Couchbase to be exceptionally easy to deploy and manage Features such as an integrated Admin Console and single-click cluster expansion & rebalance dramatically increase admin efficiency
  • #11: Each Couchbase node is exactly the same. All nodes are broken down into two components: A data manager (on the left) and a cluster manager (on the right). It’s important to realize that these are separate processes within the system specifically designed so that a node can continue serving its data even in the face of cluster problems like network disruption. The data manager is written in C and C++ and is responsible both for the object caching layer, persistence layer and querying engine. It is based off of memcached and so provides a number of benefits; -The very low lock contention of memcached allows for extremely high throughput and low latencies both to a small set of documents (or just one) as well as across millions of documents -Being compatible with the memcached protocol means we are not only a drop-in replacement, but inherit support for automatic item expiration (TTL), atomic incrementer. -We’ve increased the maximum object size to 20mb, but still recommend keeping them much smaller -Support for both binary objects as well as natively supporting JSON documents -All of the metadata for the documents and their keys is kept in RAM at all times. While this does add a bit of overhead per item, it also allows for extremely fast “miss” speeds which are critical to the operation of some applications….we don’t have to scan a disk to know when we don’t have some data. The cluster manager is based on Erlang/OTP which was developed by Ericsson to deal with managing hundreds or even thousands of distributed telco switches. This component is responsible for configuration, administration, process monitoring, statistics gathering and the UI and REST interface. Note that there is no data manipulation done through this interface.
  • #14: Now, as you fill up memory (click), some data that has already been written to disk will be ejected from RAM to make room for new data. (click) Couchbase supports holding much more data than you have RAM available. It’s important to size the RAM capacity appropriately for your working set: the portion of data your application is working with at any given point in time and needs very low latency, high throughput access to. In some applications this is the entire data set, in others it is much smaller. As RAM fills up, we use a “not recently used” algorithm to determine the best data to be ejected from cache.
  • #15: Should a read now come in for one of those documents that has been ejected (click), it is copied back from disk into RAM and sent back to the application. The document then remains in RAM as long as there is space and it is being accessed.
  • #21: KEY POINTS: BIG DATA IS NOT ONE THING – IT’S A COMBINATION OF OPERATIONAL (NOSQL) AND ANALYTICAL DATABASES. YOU NEED BOTH. COUCHBASE PROVIDES THE OPERATIONAL SOLUTION. Big data has two major pieces: Operational and Analytical Operational is about: Real time Online, interactive Customer/consumer facing Processing data at high velocity Analytical is about: Offline analytics Often batch oriented Takes time processing Directly touches relatively few users (business analysts) These two pieces together form “Big Data” There’s some overlap NoSQL can deliver some analytics Hadoop can deliver some operational But in general each technology designed for separate purposes Couchbase fits on the operational side, Hadoop on the analytics side
  • #22: The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  • #23: The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.