SlideShare a Scribd company logo
Vitalii Bondarenko
Eleks
Fast Data Platform
for Real-Time Analytics. Architecture And Approaches
Agenda
 Fast data vs Big Data
 Kafka overview
 Cassandra Internals and Programming
 Architectures and Approaches
 Lessons learned
Big Data Approach
RDBMS Approach
• Massive Parallel Processing (Scalability)
• In-memory DB (Streaming and
compressing)
• Colum stores (BI)
Big Data Approach
• Hadoop (HDFS + MapReduce)
• SQL on HDFS
• Scalable NoSQL
• Batch issue
HDFS (Hadoop Distributed File System)
• Data is spited into blocks and
distributed across the nodes
• Nodes are cheap
• Block size is 64 or 124 MB
• Replication
• Files are typical not updated
• Read data from the beginning
to the end
• Smaller number of larger files
Lambda architecture
Batch & Stream Processing
• Batch layer
• Stores master dataset
• Compute arbitrary views
• Horizontally Scalable
• Speed layer (Streaming)
• Fast, incremental algorithms
• Batch layer eventually overrides speed
layer
• Serving layer
• Random access to batch views
• Updated by batch and Streaming layer
Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying
Azure Streaming Analytics
• Easy to use
• Scalable
• Connectivity
• SQL, UDF, Reference Data
Fast Data Platform
• Real-time processing
• In-memory analytics
Fog Computing /
Service Bus
Kafka
Cluster
• Row Data fast writing
• Scalable
Connectors(Source)
Connectors(Sink)
Stream
Stream
Stream
Tasks
Stream
Tasks
. . .
Hadoop Cluster Cheep Data StorageBI Tools
Cassandra
Cluster
Cassandra Replication
spar
k
spar
k
spar
k
spar
k
spar
k
spar
k
1
2
3
4
5
n
1
2
3
4
5
nCassandra
Cluster
Write-Heavy
Stream
Analytics
Row Data
Cassandra
Cluster
Analytics
Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second
Brokers, Topics
• Distributed Service Bus
• Broker as virtual servers
• Topics as logical data storage
Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days
• Parallel read with consumer groups
• Confluent Platform
Data Sources Enterprise Analytic Suite Data Destination
Clickstream
Shopping
Behaviour
Orders
Purchases
Inventory
Routs
Catalog
Prices
Campaigns
Customers
LocationsDATA LAKE
FAST
DATA
ESB
DWH
Cost Optimization
Profit Maximization
Utility Maximization
DATA LAKE
FAST
DATA
ESB
DWH
Data Visualization
Dashboard
Service Bus
Kafka
Machine Learning Cluster Kubernetes /OpenShift
Kafka Stream API
DataConnectorsMapping
DataConnectorsMapping
Capacity
Planing
Product
Recommendation
Customer
Segmentation
Price
Recommendation
Routs
Optimization
Demand
Prediction
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
ML Cluster on Streaming Data
Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency
Nodes and distributions
• Data Centers and Racks, Gossip (each 1 sec)
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes
Coordinator Nodes
• Replication Strategy (SimpleStrategy,
NetworkTopologyStrategy)
• Replication factor (usually 3)
• Consistency Levels (One, Two, Three, Any, All,
Quorum, Local_Quorum, Local_One…)
• Tunable consistency, strong and eventural
• (R +W) > N
Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced
by a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes
CQL (Cassandra Query Language)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);
Internals
Memtable corresponds to CQL Tale
Commit Logs all data for data restoring
SSTables for data in immutable saves
Key Caches caching map of partitions keys
Row Caches for read access speed up
Hints for write request for failed nodes
Tombstones for deleted rows
TTL for deleting rows
Updates are Inserts
Inserts are Updates
Compaction for merging SSTables
CQL (Cassandra Query Language)
• Similar to SQL
• No Joins
• Keyspaces with replication factor
• Inserts vs Updates
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• DELETE is INSERT
• Ordering and Filtering is not working sometimes (always use partition key)
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.
Data Modeling: Query-First Design
Cassandra is not
A Data Lake
A Data Ocean
A Data Pond
A Data Warehouse
A In-memory Database
A Key-value store
A Magic database unicorn that fairs rainbow
Data
Center
Cassandra + Spark Cluster
Unstructed and
Structured
Data
Operational Data
Spark SQL
Dashboards
(Pentaho)
Model Traning
Framework (Python,
Anaconda, jupyter
BI Tools
(Power BI)
Hadoop Cluster
Scalable DWH
ML Results
Historial Data
HDFS
Spark
Hive
Redshift /
Postgres XL
Enterprise integration
Enterprise
Applications
ML ResultsESB
(WSO2)
ScoredData
Kubernetes Cluster
Transformation
Rules REDIS
Trained Models
REDIS
Kafka Sreams
API
Transformation
Kafka Sreams
API
ML Scoring
REST API
(Flask)
Trained Models
(Python)
Raw Data Stream
Srructured Data
Strem
Scored Data
Stream
Kafka Cluster
Confluent Schema Regestry
ConfluentConnectors
ConfluentConnectors
Casfcation,
Forecasting,
Clusterization
UnstructuredData(Tex)
StructuredData
Kubernetes Cluster
Unstructured Data
(Text)
Structured Data
Events
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Internet
Cassandra in ML Cluster
Cluster and results
Kafka Cluster
Nodes: 6
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms
Lessons learned
Design you DB carefully at the beginning for queries
Cassandra is not RDBMS, select by partition keys
Deep understanding of internals
Compaction is Hell.
Eventual consistency.
Were is my Disk Space?
Very Expensive! (lots of nodes)
www.eleks.comwww.eleks.com
Q&A
Vitalii Bondarenko
vitaliy.bondarenko@eleks.com

More Related Content

What's hot (20)

PPTX
HBaseConAsia2018 Track3-2: HBase at China Telecom
Michael Stack
 
PPTX
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Michael Stack
 
PDF
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
DataStax Academy
 
PDF
Cassandra in e-commerce
Alexander Solovyev
 
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
DataStax Academy
 
PPTX
Case studies session 2
HBaseCon
 
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
PPTX
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
PPTX
Webinar : Nouveautés de MongoDB 3.2
MongoDB
 
PDF
HBaseConAsia2018 Track3-6: HBase at Meituan
Michael Stack
 
PDF
Análisis del roadmap del Elastic Stack
Elasticsearch
 
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
PDF
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
KEY
MongoDB vs Mysql. A devops point of view
Pierre Baillet
 
PPTX
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
Michael Stack
 
PDF
tdtechtalk20160330johan
Johan Gustavsson
 
PPTX
From PoCs to Production
DataStax
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
Michael Stack
 
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Michael Stack
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
DataStax Academy
 
Cassandra in e-commerce
Alexander Solovyev
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
DataStax Academy
 
Case studies session 2
HBaseCon
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
Webinar : Nouveautés de MongoDB 3.2
MongoDB
 
HBaseConAsia2018 Track3-6: HBase at Meituan
Michael Stack
 
Análisis del roadmap del Elastic Stack
Elasticsearch
 
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
MongoDB vs Mysql. A devops point of view
Pierre Baillet
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
Michael Stack
 
tdtechtalk20160330johan
Johan Gustavsson
 
From PoCs to Production
DataStax
 

Similar to Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches." (20)

PPTX
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
PPTX
Data engineering
Parimala Killada
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PPTX
Cassandra training
András Fehér
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PPTX
Apache Cassandra introduction
fardinjamshidi
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
PPTX
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PPTX
Vitalii Bondarenko "Machine Learning on Fast Data"
DataConf
 
PPTX
Ai big dataconference_ml_fastdata_vitalii bondarenko
Olga Zinkevych
 
PPT
NoSQL_Night
Clarence J M Tauro
 
PDF
Apache Tajo - An open source big data warehouse
hadoopsphere
 
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
PDF
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Md Kamaruzzaman
 
PDF
Prague data management meetup 2018-03-27
Martin Bém
 
PPTX
SQL on Hadoop
Bigdatapump
 
PPTX
Azure DocumentDB Overview
Andrew Liu
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Data engineering
Parimala Killada
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Cassandra training
András Fehér
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Apache Cassandra introduction
fardinjamshidi
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Vitalii Bondarenko "Machine Learning on Fast Data"
DataConf
 
Ai big dataconference_ml_fastdata_vitalii bondarenko
Olga Zinkevych
 
NoSQL_Night
Clarence J M Tauro
 
Apache Tajo - An open source big data warehouse
hadoopsphere
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Md Kamaruzzaman
 
Prague data management meetup 2018-03-27
Martin Bém
 
SQL on Hadoop
Bigdatapump
 
Azure DocumentDB Overview
Andrew Liu
 
Ad

More from Fwdays (20)

PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
Fwdays
 
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
Fwdays
 
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
Fwdays
 
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Fwdays
 
PPTX
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
PPTX
"Confidential AI: zero trust concept", Hennadiy Karpov
Fwdays
 
PPTX
"Choosing Tensor Accelerators for Specific Tasks: Compute vs Memory Bound Mod...
Fwdays
 
PPTX
"Custom Voice Assistants: Infrastructure, Integrations, and Uniqueness", Yeho...
Fwdays
 
PPTX
"Different Facets of AI: Computer Vision and Large Language Models. How We De...
Fwdays
 
PPTX
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
PPTX
"Using AI to Automate Operational Processes at MK-Consulting", Maxim Korzhene...
Fwdays
 
PPTX
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
PDF
"Interactive problems", Yuri Artiukh. JavaScript
Fwdays
 
PPTX
Web Vitals: Try to Improve Me, Oleksandr Mostovenko
Fwdays
 
PPTX
May the blocks be with you – How to Integrate Crypto Payments Without Stress ...
Fwdays
 
PDF
Від KPI до OKR: як синхронізувати продажі, маркетинг і продукт, щоб бізнес ре...
Fwdays
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
"Scaling in space and time with Temporal", Andriy Lupa .pdf
Fwdays
 
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
Fwdays
 
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
Fwdays
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Fwdays
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
"Confidential AI: zero trust concept", Hennadiy Karpov
Fwdays
 
"Choosing Tensor Accelerators for Specific Tasks: Compute vs Memory Bound Mod...
Fwdays
 
"Custom Voice Assistants: Infrastructure, Integrations, and Uniqueness", Yeho...
Fwdays
 
"Different Facets of AI: Computer Vision and Large Language Models. How We De...
Fwdays
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
"Using AI to Automate Operational Processes at MK-Consulting", Maxim Korzhene...
Fwdays
 
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
"Interactive problems", Yuri Artiukh. JavaScript
Fwdays
 
Web Vitals: Try to Improve Me, Oleksandr Mostovenko
Fwdays
 
May the blocks be with you – How to Integrate Crypto Payments Without Stress ...
Fwdays
 
Від KPI до OKR: як синхронізувати продажі, маркетинг і продукт, щоб бізнес ре...
Fwdays
 
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Wondershare Filmora Crack Free Download 2025
josanj305
 
PDF
Home Cleaning App Development Services.pdf
V3cube
 
PPTX
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Role_of_Artificial_Intelligence_in_Livestock_Extension_Services.pptx
DrRajdeepMadavi
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Digital Circuits, important subject in CS
contactparinay1
 
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Wondershare Filmora Crack Free Download 2025
josanj305
 
Home Cleaning App Development Services.pdf
V3cube
 
Talbott's brief History of Computers for CollabDays Hamburg 2025
Talbott Crowell
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Role_of_Artificial_Intelligence_in_Livestock_Extension_Services.pptx
DrRajdeepMadavi
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

  • 1. Vitalii Bondarenko Eleks Fast Data Platform for Real-Time Analytics. Architecture And Approaches
  • 2. Agenda  Fast data vs Big Data  Kafka overview  Cassandra Internals and Programming  Architectures and Approaches  Lessons learned
  • 3. Big Data Approach RDBMS Approach • Massive Parallel Processing (Scalability) • In-memory DB (Streaming and compressing) • Colum stores (BI) Big Data Approach • Hadoop (HDFS + MapReduce) • SQL on HDFS • Scalable NoSQL • Batch issue
  • 4. HDFS (Hadoop Distributed File System) • Data is spited into blocks and distributed across the nodes • Nodes are cheap • Block size is 64 or 124 MB • Replication • Files are typical not updated • Read data from the beginning to the end • Smaller number of larger files
  • 5. Lambda architecture Batch & Stream Processing • Batch layer • Stores master dataset • Compute arbitrary views • Horizontally Scalable • Speed layer (Streaming) • Fast, incremental algorithms • Batch layer eventually overrides speed layer • Serving layer • Random access to batch views • Updated by batch and Streaming layer
  • 6. Kappa architecture Stream Processing with Scalable Storages • Everything is a stream • Immutable unstructured data sources • Single analytics framework • Windows on Streaming Layer • Linearly scalable Serving Layer • Interactive querying
  • 7. Azure Streaming Analytics • Easy to use • Scalable • Connectivity • SQL, UDF, Reference Data
  • 8. Fast Data Platform • Real-time processing • In-memory analytics Fog Computing / Service Bus Kafka Cluster • Row Data fast writing • Scalable Connectors(Source) Connectors(Sink) Stream Stream Stream Tasks Stream Tasks . . . Hadoop Cluster Cheep Data StorageBI Tools Cassandra Cluster Cassandra Replication spar k spar k spar k spar k spar k spar k 1 2 3 4 5 n 1 2 3 4 5 nCassandra Cluster Write-Heavy Stream Analytics Row Data Cassandra Cluster Analytics
  • 9. Apache Kafka • From LinkedIn, Open Source from 2012 • Service Bus • Small messages (events) • Scalable Broker System • Durable and Distributed • Very fast (parallelism on partitions) • No removes from queue, retention • Streaming processing capability LinkedIn: • 1400 brokers • 13M+ messages/sec • 2.75GB per second
  • 10. Brokers, Topics • Distributed Service Bus • Broker as virtual servers • Topics as logical data storage
  • 11. Writes and reads • Append Only • Commit log • Consumer offset (from beginning) • Commit read to Kafka topic _consumer_offsets • Commit when read data • Retention period 7 days • Parallel read with consumer groups • Confluent Platform
  • 12. Data Sources Enterprise Analytic Suite Data Destination Clickstream Shopping Behaviour Orders Purchases Inventory Routs Catalog Prices Campaigns Customers LocationsDATA LAKE FAST DATA ESB DWH Cost Optimization Profit Maximization Utility Maximization DATA LAKE FAST DATA ESB DWH Data Visualization Dashboard Service Bus Kafka Machine Learning Cluster Kubernetes /OpenShift Kafka Stream API DataConnectorsMapping DataConnectorsMapping Capacity Planing Product Recommendation Customer Segmentation Price Recommendation Routs Optimization Demand Prediction Docker Containers Docker Containers Docker Containers Docker Containers Docker Containers Docker Containers ML Cluster on Streaming Data
  • 13. Apache Cassandra • Multi-master, low-latency, shared nothing • Distributed • No single point of failure • Linearly Scalable • Multi-datacenter configuration • AP with tunable consistency
  • 14. Nodes and distributions • Data Centers and Racks, Gossip (each 1 sec) • Distributed by Tokens from -2^63 to 2^63-1 • Hash from partition key. Murmur3 • Virtual Nodes
  • 15. Coordinator Nodes • Replication Strategy (SimpleStrategy, NetworkTopologyStrategy) • Replication factor (usually 3) • Consistency Levels (One, Two, Three, Any, All, Quorum, Local_Quorum, Local_One…) • Tunable consistency, strong and eventural • (R +W) > N
  • 16. Cassandra Objects Column, which is a name/value pair Row, which is a container for columns referenced by a primary key Table, which is a container for rows Keyspace, which is a container for tables Cluster, which is a container for keyspaces that spans one or more nodes CQL (Cassandra Query Language) CREATE TABLE loads ( machine inet, cpu int, mtime timeuuid, load float, PRIMARY KEY ((machine, cpu), mtime) ) WITH CLUSTERING ORDER BY (mtime DESC);
  • 17. Internals Memtable corresponds to CQL Tale Commit Logs all data for data restoring SSTables for data in immutable saves Key Caches caching map of partitions keys Row Caches for read access speed up Hints for write request for failed nodes Tombstones for deleted rows TTL for deleting rows Updates are Inserts Inserts are Updates Compaction for merging SSTables
  • 18. CQL (Cassandra Query Language) • Similar to SQL • No Joins • Keyspaces with replication factor • Inserts vs Updates • TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/ • DELETE is INSERT • Ordering and Filtering is not working sometimes (always use partition key) /* Select Data within a range */ SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000; Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
  • 20. Cassandra is not A Data Lake A Data Ocean A Data Pond A Data Warehouse A In-memory Database A Key-value store A Magic database unicorn that fairs rainbow
  • 21. Data Center Cassandra + Spark Cluster Unstructed and Structured Data Operational Data Spark SQL Dashboards (Pentaho) Model Traning Framework (Python, Anaconda, jupyter BI Tools (Power BI) Hadoop Cluster Scalable DWH ML Results Historial Data HDFS Spark Hive Redshift / Postgres XL Enterprise integration Enterprise Applications ML ResultsESB (WSO2) ScoredData Kubernetes Cluster Transformation Rules REDIS Trained Models REDIS Kafka Sreams API Transformation Kafka Sreams API ML Scoring REST API (Flask) Trained Models (Python) Raw Data Stream Srructured Data Strem Scored Data Stream Kafka Cluster Confluent Schema Regestry ConfluentConnectors ConfluentConnectors Casfcation, Forecasting, Clusterization UnstructuredData(Tex) StructuredData Kubernetes Cluster Unstructured Data (Text) Structured Data Events Producers Crawl/Fetch App Producers Crawl/Fetch App Producers Crawl/Fetch App Internet Cassandra in ML Cluster
  • 22. Cluster and results Kafka Cluster Nodes: 6 Amazon instance type: m4.2xlarge CPU: 8 Memory: 32 Gb SSD: 100Gb Topics: 3 Partitions: 6 Replication Factor: 3 Producers: 6 Average message size: 1Kb 440 000 messages / second Cassandra Cluster Nodes: 12 Amazon instance type: m4.2xlarge CPU: 8 Memory: 32 Gb SSD: 800Gb Replication Factor: 3 Average write latency: 9 ms Average read latency: 52 ms
  • 23. Lessons learned Design you DB carefully at the beginning for queries Cassandra is not RDBMS, select by partition keys Deep understanding of internals Compaction is Hell. Eventual consistency. Were is my Disk Space? Very Expensive! (lots of nodes)