SlideShare a Scribd company logo
1 Atigeo Confidential
Lessons learned from embedding
Cassandra in the xPatterns Platform
Seattle Cassandra Users
April 2014
2 Atigeo Confidential
• Cassandra use within xPatterns
• What we had to build
• Data model optimization
• Robust REST API’s
• Geo-Replication
• Demo: Export to NoSql API
Agenda
3 Atigeo Confidential
xPatterns
The Cloud-based, Big Data Analytics Platform
Benefits
Intelligent apps in man-days
Differentiators
End-to-End Big Data Platform
Cutting-Edge Intelligence
Real-time unsupervised analyticsHybrid Intelligence System
Learning & Feedback Automated repair & inductive reasoning
Measurably, best-ever analytical performance
4 Atigeo Confidential
Tools Roles
Tools Roles
Data
Scientist
Tools Rolesconnect IaaS (INFRASTRUCTURE as a SERVICE)
Cooperative Distributed Inferencing (CDI)
Neural
Network
Inference Natural
Language
Topic
Modeling
Data Mining Prediction Optimization
Machine
Learning
Relevance
Meta
Learning
AaaS (ANALYTICS as a SERVICE)discover
Dashboards
• 40+ report types
• Live dashboards
• Self-serve Studio
Visualization
• 2D & 3D Viewer
• Interactive explorer
• Search & Connect
Web Services
• Rich query language
• Add & edit content
act SaaS (SOFTWARE as a SERVICE)
Admin
Consoles
Data
Integration
Studio
Data
Analyst
Application
Engineer
Dashboard
Studio
REST API’s
Experimentation
Platform
Ad-Hoc Queries
Virtual Private Cloud
Hadoop NoSQL Search
Streaming Batch / ELT Federated
Interactive
Metadata
Processing
Framework
Labeling Tools
Extrapolation
Platform
5 Atigeo Confidential
Provider Referral Network: An interactive big data visualization tool for investigating
upstream and downstream referral patterns among physicians, connecting physicians to
specialties and to other physicians’ practice details.
6 Atigeo Confidential
Cassandra multi DC ring – read latency
7 Atigeo Confidential
Cassandra multi DC ring – read latency
8 Atigeo Confidential
• Export to NoSQL demo
• Data model optimization
 Publishing from HDFS/Hive/Shark to Cassandra
• Robust REST API’s
 Instrumentation
 Throttling & auto-retries
• Geo-Replication
 Cross-data-center replication, encryption & failover
• Lessons Learned since 0.6 till 2.0.6
What we’d like to share tonight
9 Atigeo Confidential
10 Atigeo Confidential
11 Atigeo Confidential
VPC-to-VPC IPSEC Tunnel
12 Atigeo Confidential
Export to NoSql API
• Datasets in the warehouse need to be exposed to high-throughput low-latency real-time
APIs. Each application requires extra processing performed on top of the core
datasets, hence additional transformations are executed for building data marts inside the
warehouse
• Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive
table to a Cassandra Column Family, through a custom Spark job with configurable
throughput (configurable Spark processors against a Cassandra ring) (instrumentation
dashboard embedded, logs, progress and instrumentation events pushed though SSE)
• Data Modeling is driven by the read access patterns provided by an application engineer
building dashboards and visualizations: lookup key, columns (record fields to
read), paging, sorting, filtering
• The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-
replicated) that uses the underlying generated Cassandra data model and fuels the data in
the dashboards
• Configuration API provided for creating export jobs and executing them (ad-hoc or
scheduled).
13 Atigeo Confidential
14 Atigeo Confidential
Cassandra multi DC ring – write latency
15 Atigeo Confidential
Mesos/Spark cluster
16 Atigeo Confidential
Nagios monitoring
17 Atigeo Confidential
• NTP: synchronize ALL clocks (servers and clients)
• Reduce the number of CFs (avoid OOM)
• Rows not too skinny and not too wide (avoid OOM)
o Less memory pressure during high-throughput writes
o Reduced network I/O, less rows, more column slices
o Key cache & bloom filter index size affects perf
o Efficient compaction, avoid hot spots
• Custom serialization and dynamic columns for maximum perf gain
• Do not drop CFs before emptying them (truncate/compact first)
• Monitoring, instrumentation, automatic restarts
• ConsistencyLevel: ONE is best … for our use cases
• Key cache, Snappy compression
Lessons learned 0.6 - 2.0.6
18 Atigeo Confidential
Q & A
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this
presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided
after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related Content

What's hot (20)

PDF
Elastic Stack roadmap deep dive
Elasticsearch
 
PPTX
Ignite Your Big Data With a Spark!
Progress
 
PDF
Logging, Metrics, and APM: The Operations Trifecta
Elasticsearch
 
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
PDF
Automated Metadata Management in Data Lake – A CI/CD Driven Approach
Databricks
 
PPTX
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
In-Memory Computing Summit
 
PDF
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
PDF
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
PDF
Stsg17 speaker yousunjeong
Yousun Jeong
 
PDF
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Spark Summit
 
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Alluxio, Inc.
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
Elastic Stack roadmap deep dive
Elasticsearch
 
Ignite Your Big Data With a Spark!
Progress
 
Logging, Metrics, and APM: The Operations Trifecta
Elasticsearch
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
Automated Metadata Management in Data Lake – A CI/CD Driven Approach
Databricks
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
In-Memory Computing Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Stsg17 speaker yousunjeong
Yousun Jeong
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Spark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
Spark introduction and architecture
Sohil Jain
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Alluxio, Inc.
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 

Viewers also liked (12)

PPTX
Autonomous analytics on streaming data
Claudiu Barbura
 
PPTX
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Claudiu Barbura
 
PPTX
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
PPT
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Steve Kramer
 
DOCX
Reflexión 11
Nubia Renteria
 
PDF
What every body is saying (english )18
Cat Love
 
PPTX
Unti-Claims Handling Following Catastrophes 2013-10
Don Grauel
 
PPTX
Minimizing the threat of Ransomware with enterprise file services
David Finkelstein
 
PPT
Ryan-Special Events Insurance Considerations 2013-10
Don Grauel
 
PPTX
Портфоліо Оленюк М.О.
olgaf31
 
PPTX
Microsoft Azure Batch
Khalid Salama
 
PPTX
програма Iнновацiйного розвитку школи
tsurkan
 
Autonomous analytics on streaming data
Claudiu Barbura
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Claudiu Barbura
 
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Steve Kramer
 
Reflexión 11
Nubia Renteria
 
What every body is saying (english )18
Cat Love
 
Unti-Claims Handling Following Catastrophes 2013-10
Don Grauel
 
Minimizing the threat of Ransomware with enterprise file services
David Finkelstein
 
Ryan-Special Events Insurance Considerations 2013-10
Don Grauel
 
Портфоліо Оленюк М.О.
olgaf31
 
Microsoft Azure Batch
Khalid Salama
 
програма Iнновацiйного розвитку школи
tsurkan
 
Ad

Similar to Lessons learned from embedding Cassandra in xPatterns (20)

PPTX
Cassandra in xPatterns
DataStax Academy
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PDF
How to scale your PaaS with OVH infrastructure?
OVHcloud
 
PDF
SnappyData @ Seattle Spark Meetup
SnappyData
 
PDF
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PPTX
Scale Your Load Balancer from 0 to 1 million TPS on Azure
Avi Networks
 
PDF
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Tyler Nguyen
 
PPTX
Denver Big Data Analytics Day
Zivaro Inc
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
PDF
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Indrajit Poddar
 
PDF
IT Press Tour #17 - OpenIO & Technology
OpenIO Object Storage
 
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PDF
Red Hat Storage Roadmap
Red_Hat_Storage
 
Cassandra in xPatterns
DataStax Academy
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
How to scale your PaaS with OVH infrastructure?
OVHcloud
 
SnappyData @ Seattle Spark Meetup
SnappyData
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Introduction to Apache Apex
Apache Apex
 
Scale Your Load Balancer from 0 to 1 million TPS on Azure
Avi Networks
 
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Tyler Nguyen
 
Denver Big Data Analytics Day
Zivaro Inc
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Indrajit Poddar
 
IT Press Tour #17 - OpenIO & Technology
OpenIO Object Storage
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Red Hat Storage Roadmap
Red_Hat_Storage
 
Ad

Recently uploaded (20)

PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Digital water marking system project report
Kamal Acharya
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Digital water marking system project report
Kamal Acharya
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 

Lessons learned from embedding Cassandra in xPatterns

  • 1. 1 Atigeo Confidential Lessons learned from embedding Cassandra in the xPatterns Platform Seattle Cassandra Users April 2014
  • 2. 2 Atigeo Confidential • Cassandra use within xPatterns • What we had to build • Data model optimization • Robust REST API’s • Geo-Replication • Demo: Export to NoSql API Agenda
  • 3. 3 Atigeo Confidential xPatterns The Cloud-based, Big Data Analytics Platform Benefits Intelligent apps in man-days Differentiators End-to-End Big Data Platform Cutting-Edge Intelligence Real-time unsupervised analyticsHybrid Intelligence System Learning & Feedback Automated repair & inductive reasoning Measurably, best-ever analytical performance
  • 4. 4 Atigeo Confidential Tools Roles Tools Roles Data Scientist Tools Rolesconnect IaaS (INFRASTRUCTURE as a SERVICE) Cooperative Distributed Inferencing (CDI) Neural Network Inference Natural Language Topic Modeling Data Mining Prediction Optimization Machine Learning Relevance Meta Learning AaaS (ANALYTICS as a SERVICE)discover Dashboards • 40+ report types • Live dashboards • Self-serve Studio Visualization • 2D & 3D Viewer • Interactive explorer • Search & Connect Web Services • Rich query language • Add & edit content act SaaS (SOFTWARE as a SERVICE) Admin Consoles Data Integration Studio Data Analyst Application Engineer Dashboard Studio REST API’s Experimentation Platform Ad-Hoc Queries Virtual Private Cloud Hadoop NoSQL Search Streaming Batch / ELT Federated Interactive Metadata Processing Framework Labeling Tools Extrapolation Platform
  • 5. 5 Atigeo Confidential Provider Referral Network: An interactive big data visualization tool for investigating upstream and downstream referral patterns among physicians, connecting physicians to specialties and to other physicians’ practice details.
  • 6. 6 Atigeo Confidential Cassandra multi DC ring – read latency
  • 7. 7 Atigeo Confidential Cassandra multi DC ring – read latency
  • 8. 8 Atigeo Confidential • Export to NoSQL demo • Data model optimization  Publishing from HDFS/Hive/Shark to Cassandra • Robust REST API’s  Instrumentation  Throttling & auto-retries • Geo-Replication  Cross-data-center replication, encryption & failover • Lessons Learned since 0.6 till 2.0.6 What we’d like to share tonight
  • 12. 12 Atigeo Confidential Export to NoSql API • Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse • Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumentation dashboard embedded, logs, progress and instrumentation events pushed though SSE) • Data Modeling is driven by the read access patterns provided by an application engineer building dashboards and visualizations: lookup key, columns (record fields to read), paging, sorting, filtering • The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo- replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards • Configuration API provided for creating export jobs and executing them (ad-hoc or scheduled).
  • 14. 14 Atigeo Confidential Cassandra multi DC ring – write latency
  • 17. 17 Atigeo Confidential • NTP: synchronize ALL clocks (servers and clients) • Reduce the number of CFs (avoid OOM) • Rows not too skinny and not too wide (avoid OOM) o Less memory pressure during high-throughput writes o Reduced network I/O, less rows, more column slices o Key cache & bloom filter index size affects perf o Efficient compaction, avoid hot spots • Custom serialization and dynamic columns for maximum perf gain • Do not drop CFs before emptying them (truncate/compact first) • Monitoring, instrumentation, automatic restarts • ConsistencyLevel: ONE is best … for our use cases • Key cache, Snappy compression Lessons learned 0.6 - 2.0.6
  • 19. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Editor's Notes

  • #4: Introduce AtigeoWhy am I telling you about xPatterns?
  • #5: Not repeat the last slide – jump to highlighting where Cassandra fits in.
  • #6: Referral Provider Network: one of the applications we built for our healthcare customer using the xPatterns APIs and tools on the new beyond BDAS infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths.The dataset behind the app consists of 6.5 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra)While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations during the demo 
  • #7: Instrumentation dashboard showcasing the read latency measured during peak (40ms average, 60peak)
  • #8: Instrumentation dashboard showcasing the read latency measured after a few runs of a stress test (key cache and OS Buffer cache hit rate are high ) (20ms max … spike indicating a slower node .. Compacting maybe?)
  • #10: Cassandra is xPatterns: real-time database for user facing apis and dashboards applications, system of records for real-time analytics use-cases (Kafka/Storm/Cassandra), distribute in-memory cache store for configuration data, persistence store for user feedback in semantic search and dynamic ontology use cases (soldCloud/Cassandra/Zookeeper).
  • #11: The physical architecture diagram for our largest customer deployment, demonstrating the enterprise-grade attributes of the platform: scalability, high availability, performance, resilience, manageability while providing means for geo-failover (warehouse), geo-replication (real-time DB), data and system monitoring, instrumentation, backup & restore.Cassandra rings are DC-replicated across EC2 east and west coast regions, data between geo-replicas synchronized in real time through an ipsec tunnel (VPC-to-VPC).Geo-replicated apis behind an AWS Route 53 DNS service (latency based resource records sets) and ELBs ensures users requests are served from the closest geographical location. Failure to an entire region (happened to us during a big conference!) does not affect our availability and SLAs.User facing dashboards are served from Cassandra (real-time store), with data being exported from a data warehouse (Shark/Hive) build on top a Mesos-managed Spark/Hadoop cluster.Export jobs are instrumented and provide a throttling mechanism to control throughput.Export jobs run on the east-coast only, data is synchronized in real time with the west coast ring. Generated apis are automatically instrumented (Graphite) and monitored (Nagios).
  • #12: Security Architecture for the VPC-to-VPC hosting the DC-replicated rings.Openswan used on the VPN Instances in the public subnets for the ipsec tunnel encryptionhttp://aws.amazon.com/articles/5472675506466066
  • #13: Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehousePre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requestsExporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboardsExecution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  • #14: Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehousePre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requestsExporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboardsExecution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  • #15: Instrumentation dashboard showcasing the write latency measured during the export to noSql job (7ms max). Writes are performed against the east-coast DC … they are propagated to the west coast, however the JMX metric exposed (Write.Latency.OneMinuteRate) does not reflect it … need to build a new dashboard with different metrics!
  • #16: Mesos/Spark context (CoarseGrainedMode) with a fixed 120 cores spread out across 4 nodes
  • #17: Nagios monitoring for the geo-replicated, instrumented generated apis. The APIs (readers) and the Spark executors (writers) have a retry mechanism (AOP aspects) that implement throttling when Cassandra is under siege …
  • #18: Lessons learned over the past 3 years with operating Cassandra rings at scale.Custom serialization of objects instead of individually serializing column names/column values for object field names/field values, yields the most performance gains!Describe each tip in detail …