SlideShare a Scribd company logo
2
Most read
3
Most read
10
Most read
DATA ARCHITECTURE &
ROAD MAP
NEXT GENERATION DATA PLATFORM AND ARCHITECTURAL PATTERNS
BY SUDHEER KONDLA
SENIOR DATA PLATFORM ARCHITECTS
OVERVIEW
• Define a problem
• Understanding problem
• Articulate the problem
• Craft a solution
DATA ARCHITECTURE SOLUTION
• In order to solve real time high volume data problem with low latency response time, we need data
platform that has capable of capturing, ingesting , streaming and optionally storing data for batch
analytics. Most of the real time streaming data platforms will have short lived data after processing to
build predictive modelling that enable marketing to offer real time recommendations, the following
characteristics are expected
• Fast Data
• Require fast ingestion
• Real-time analytics
• Fast action
• Time to value
• Benefits
• Capture and use (or discard – time to live or purge)
• Insights real or near real-time
• Agile and Responsive
• Expressive
DATA ARCHITECTURE SOLUTION
• In order to solve real time high volume data problem with low latency response time, we need data
platform that has capable of capturing, ingesting , streaming and optionally storing data for batch
analytics. Most of the real time streaming data platforms will have short lived data after processing to
build predictive modelling that enable marketing to offer real time recommendations, the following
characteristics are expected
• Fast Data
• Require fast ingestion
• Real-time analytics
• Fast action
• Time to value
• Benefits
• Capture and use (or discard – time to live or purge)
• Insights real or near real-time
• Agile and Responsive
• Expressive
ECHO SYSTEM & INFRASTRUCTURE
• Multiple Data Sources:
• Web/Apps Logs, Twitter (trending), and other social media, blogs, SOR (internal systems), HDFS
• Ingestion/Streaming
• Apache Flume (log capture/aggregation), Kafka (event streaming, data pipelines & messaging)
• Stream Analytics
• Spark/Storm API
• Data Store/Persistence
• HDFS, Cassandra, S3, Hive
• Infrastructure
• IaaS (Cloud) or On-premise or Hybrid Private Cloud
• Orchestration
• Mesos
STREAM DATA ANALYTICS DATA FLOW
REAL-TIME DATA PIPELINES
Real-time data pipeline
Collect data into Kafka
(Channel Data)
Process micro-batches
(Aggregate, predict &
act)
Persist data for later use
(Historical, Analytics)
Kafka Spark Cassandra
DATA GOVERNANCE & DATA LIFE CYCLE
CHOOSING RIGHT ECHO SYSTEM
• Kafka:
• Distributed pub-sub messaging and data pipe line system
• Designed for processing real-time activity streams (logs, metrics)
• When to use: real-time decision making, working with streams of continuous data
• Why Kafka: Persistent messaging, High throughput, Fault tolerant.
• Spark:
• What is it: It’s a distributed computing framework that can scale, integrate real time data from many event
streams (Kafka, Flume, HDFS, S3, Twitter and other sources)
• Event Driven, Asynchronous, Scalable, Type-safe and fault tolerant
• Where does fit:
• When you need real time decision making - recommendation, fraud detection, real time forcasting
• Why spark streaming
• Provides high throughput, reliable for live data streams
• Batch, iterative and streaming on same platform
• Fits for machine learning
CHOOSING RIGHT ECHO SYSTEM
• Cassandra:
• What is it: Distributed database with high availability (multi-master, high write throughput)
• When to use: Scaling, data needed in multi-data centers (geo locations), Always available and fast response
times.
• Why Cassandra: Easy to scale out, High throughput, Continuous availability , no SPOFs. Easy to integrate with
Spark and supports Spark Streaming and Solr search.
Q & A
•Questions ?

More Related Content

What's hot (20)

PDF
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
The ABCs of Treating Data as Product
DATAVERSITY
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Master Data Management - Aligning Data, Process, and Governance
DATAVERSITY
 
PDF
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
PDF
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
Christopher Bradley
 
PPTX
Modern Data Architecture
Alexey Grishchenko
 
PPTX
Azure data platform overview
James Serra
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
PDF
Data Mesh
Piethein Strengholt
 
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
PDF
Data Catalog for Better Data Discovery and Governance
Denodo
 
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Data Lakehouse Symposium | Day 4
Databricks
 
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Modern Data architecture Design
Kujambu Murugesan
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Databricks Platform.pptx
Alex Ivy
 
The ABCs of Treating Data as Product
DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Master Data Management - Aligning Data, Process, and Governance
DATAVERSITY
 
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
DMBOK 2.0 and other frameworks including TOGAF & COBIT - keynote from DAMA Au...
Christopher Bradley
 
Modern Data Architecture
Alexey Grishchenko
 
Azure data platform overview
James Serra
 
Modernizing to a Cloud Data Architecture
Databricks
 
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Data Catalog for Better Data Discovery and Governance
Denodo
 

Similar to Data platform architecture (20)

PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PDF
The State of Streaming.pdf
AvinashUpadhyaya3
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPTX
Big Data_Architecture.pptx
betalab
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
BD_Architecture and Charateristics.pptx.pdf
eramfatima43
 
PPTX
Event Driven Architecture
Benjamin Joyen-Conseil
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PDF
Data Platform in the Cloud
Amihay Zer-Kavod
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Data Infrastructure for a World of Music
Lars Albertsson
 
PDF
Streaming Big Data & Analytics For Scale
Helena Edelson
 
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
PPTX
Big Data/Hadoop Option Analysis
zafarali1981
 
PDF
Data Streaming For Big Data
Seval Çapraz
 
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Big data service architecture: a survey
ssuser0191d4
 
Data streaming fundamentals
Mohammed Fazuluddin
 
The State of Streaming.pdf
AvinashUpadhyaya3
 
Building end to end streaming application on Spark
datamantra
 
Big Data_Architecture.pptx
betalab
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
BD_Architecture and Charateristics.pptx.pdf
eramfatima43
 
Event Driven Architecture
Benjamin Joyen-Conseil
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Data Platform in the Cloud
Amihay Zer-Kavod
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Data Infrastructure for a World of Music
Lars Albertsson
 
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
Big Data/Hadoop Option Analysis
zafarali1981
 
Data Streaming For Big Data
Seval Çapraz
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Big data service architecture: a survey
ssuser0191d4
 
Ad

More from Sudheer Kondla (8)

PDF
MongoDB cluster_on_aws_example
Sudheer Kondla
 
PDF
No sql
Sudheer Kondla
 
PDF
AWS multi-region DB design and deployment
Sudheer Kondla
 
PDF
Aws aurora scaling
Sudheer Kondla
 
PDF
Digital transformation is not about technology
Sudheer Kondla
 
PDF
Setting up mongodb sharded cluster in 30 minutes
Sudheer Kondla
 
PDF
Cloudera cluster setup and configuration
Sudheer Kondla
 
PDF
Setting up mongo replica set
Sudheer Kondla
 
MongoDB cluster_on_aws_example
Sudheer Kondla
 
AWS multi-region DB design and deployment
Sudheer Kondla
 
Aws aurora scaling
Sudheer Kondla
 
Digital transformation is not about technology
Sudheer Kondla
 
Setting up mongodb sharded cluster in 30 minutes
Sudheer Kondla
 
Cloudera cluster setup and configuration
Sudheer Kondla
 
Setting up mongo replica set
Sudheer Kondla
 
Ad

Recently uploaded (20)

PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 

Data platform architecture

  • 1. DATA ARCHITECTURE & ROAD MAP NEXT GENERATION DATA PLATFORM AND ARCHITECTURAL PATTERNS BY SUDHEER KONDLA SENIOR DATA PLATFORM ARCHITECTS
  • 2. OVERVIEW • Define a problem • Understanding problem • Articulate the problem • Craft a solution
  • 3. DATA ARCHITECTURE SOLUTION • In order to solve real time high volume data problem with low latency response time, we need data platform that has capable of capturing, ingesting , streaming and optionally storing data for batch analytics. Most of the real time streaming data platforms will have short lived data after processing to build predictive modelling that enable marketing to offer real time recommendations, the following characteristics are expected • Fast Data • Require fast ingestion • Real-time analytics • Fast action • Time to value • Benefits • Capture and use (or discard – time to live or purge) • Insights real or near real-time • Agile and Responsive • Expressive
  • 4. DATA ARCHITECTURE SOLUTION • In order to solve real time high volume data problem with low latency response time, we need data platform that has capable of capturing, ingesting , streaming and optionally storing data for batch analytics. Most of the real time streaming data platforms will have short lived data after processing to build predictive modelling that enable marketing to offer real time recommendations, the following characteristics are expected • Fast Data • Require fast ingestion • Real-time analytics • Fast action • Time to value • Benefits • Capture and use (or discard – time to live or purge) • Insights real or near real-time • Agile and Responsive • Expressive
  • 5. ECHO SYSTEM & INFRASTRUCTURE • Multiple Data Sources: • Web/Apps Logs, Twitter (trending), and other social media, blogs, SOR (internal systems), HDFS • Ingestion/Streaming • Apache Flume (log capture/aggregation), Kafka (event streaming, data pipelines & messaging) • Stream Analytics • Spark/Storm API • Data Store/Persistence • HDFS, Cassandra, S3, Hive • Infrastructure • IaaS (Cloud) or On-premise or Hybrid Private Cloud • Orchestration • Mesos
  • 7. REAL-TIME DATA PIPELINES Real-time data pipeline Collect data into Kafka (Channel Data) Process micro-batches (Aggregate, predict & act) Persist data for later use (Historical, Analytics) Kafka Spark Cassandra
  • 8. DATA GOVERNANCE & DATA LIFE CYCLE
  • 9. CHOOSING RIGHT ECHO SYSTEM • Kafka: • Distributed pub-sub messaging and data pipe line system • Designed for processing real-time activity streams (logs, metrics) • When to use: real-time decision making, working with streams of continuous data • Why Kafka: Persistent messaging, High throughput, Fault tolerant. • Spark: • What is it: It’s a distributed computing framework that can scale, integrate real time data from many event streams (Kafka, Flume, HDFS, S3, Twitter and other sources) • Event Driven, Asynchronous, Scalable, Type-safe and fault tolerant • Where does fit: • When you need real time decision making - recommendation, fraud detection, real time forcasting • Why spark streaming • Provides high throughput, reliable for live data streams • Batch, iterative and streaming on same platform • Fits for machine learning
  • 10. CHOOSING RIGHT ECHO SYSTEM • Cassandra: • What is it: Distributed database with high availability (multi-master, high write throughput) • When to use: Scaling, data needed in multi-data centers (geo locations), Always available and fast response times. • Why Cassandra: Easy to scale out, High throughput, Continuous availability , no SPOFs. Easy to integrate with Spark and supports Spark Streaming and Solr search.