SlideShare a Scribd company logo
Mladen Kovacevic, Senior Solutions Architect
Cloudera Inc.
Storage Engine
Considerations for your
Apache Spark Applications
#EUdev10
Outline
• Motivation – store your data – where exactly?
• Storage Capabilities:
– HDFS
– HBase
– Kudu
– Solr
• Asking the right questions
• Decide on right storage solution
2#EUdev10
Motivation
• Spark, SparkStreaming, SparkSQL – great for
processing – need a place to store content
• Integration with variety of storage systems
• Ingest and consumption requirements – use
case!
3#EUdev10
Design patterns
4#EUdev10
Choosing the right storage
for the use case
2006
2007
2016
2008
HDFS
• Distributed file system – cheap, scalable, storage
• Immutable – “record” changes are painful
• Columnar file formats - ideal for analytics
• SQL overlays (SparkSQL, Hive Metastore, more) to
define schema
Highlights
Very high throughput, painful random IO, batch oriented,
coding overhead (ie. dealing with small files problems), any
file
5#EUdev10
HDFS design pattern
df.write.parquet(“/data/person_table”)
6#EUdev10
• Small files accumulate
• External processes, or additional
application logic to manage these files
• Partition management
• Manage metadata carefully (depends
on ecosystem)
• Considerations- changing dimensions
(fast/slow)
• Late arriving data
HBase
• NoSQL engine, manages files on HDFS
• Key-value, distributed storage engine
• No data types – just binary fields
• Thousands to millions of columns
• Store entity data (profiles of people, devices, accounts)
Highlights
Very fast random IO, low throughput, NRT oriented,
challenging BI, no strict data types
7#EUdev10
HBase design pattern
8#EUdev10
• HBase Connection anywhere in
Spark/SparkStreaming app
• SparkSQL/DataFrames, Bulk Load
• Primary storage for ingestion or
complementary preserving state
• NoSQL store vs. structured
• Near-real-time
CDH hbase-spark: https://github.com/cloudera/hbase/tree/cdh5-1.2.0_5.13.0/hbase-spark
CDH HBase and Spark docs: http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.13.0/book.html#spark
Upstream hbase-spark (watch for updates in HBase 2.x release): https://github.com/apache/hbase/tree/master/hbase-spark/
Upstream HBase and Spark book: http://hbase.apache.org/book.html#spark
Analytic Gap
9#EUdev10
Kudu
• Storage system for tables of structured data
• Bring-your-own-SQL (SparkSQL, Impala), NoSQL-like
API, integration with Spark, MapReduce, more..
• Columnar, key partitioning by range and/or hash
• Limited number of columns (strongly typed)
Highlights
Fast random IO, fast throughput, NRT oriented, terrific for
BI, structured data
10#EUdev10
Kudu design pattern
df.write.options(kuduOptions).mode(“append”).kudu
OR
kuduContext.insertRows()
11#EUdev10
• DataFrame perfect match for Kudu
(structured)
• Data available immediately to SQL
engines (Impala, SparkSQL)
• Ideal case is append with moderate
updates
Kudu Integration with Spark: http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
Up and running with Apache Spark on Apache Kudu: https://blog.cloudera.com/blog/2017/02/up-and-running-with-apache-spark-on-apache-kudu/
Analytic Gap Filled
12#EUdev10
Solr
• Distributed index enabling search capabilities (Lucene)
• Typed, REST API based, search index query processing
• Search interface, faceting, integration with HBase storing
content (typically) in HDFS
Highlights
High random IO, low throughput, multi-faceted use cases,
NRT oriented, terrific for BI with the right tools (non-SQL),
loose schema, data types
13#EUdev10
Solr design pattern
14#EUdev10
• Prepare Solr document, add to
SolrCloud directly OR
• Write to HBase, leverage Lily
HBase Indexer service to update
Solr
• Store complete record in HBase,
while indexed fields for search in
Solr
• NRT availability (short soft
commits)
Questions we ask (1)
• How many voters have cast their ballots by city
thus far in the election, by the second?
– streaming data into ‘voter’ table, aggregate query,
immediate data availability : Kudu
• How many people watched last nights game
compared to the night before?
– daily batch, aggregate query : HDFS parquet
15#EUdev10
Questions we ask (2)
• What version is my device running and how
many dropped packets do I have?
– streaming entity profile data, metrics may change per
release, many updates, specific device, NRT: HBase
• Which tweets talk to the housing market, in the
21-30 age group?
– streaming, keyword search, facet filtering : Solr
16#EUdev10
Use case questionnaire
• Consumption interface: SQL (JDBC/ODBC) vs. API
• Near-real-time requirement for consumers
• Ingestion rate (can we keep up?)
• Entity vs. Events (time-based)
• Append-only vs moderate updates vs many updates
• Distinct values in dataset
17#EUdev10
Storage considerations (1)
18#EUdev10
Criteria
SQL interface
API interface
Near-real-time ingestion
Append-only + available for query
Appends with moderate updates
Mostly updates
Storage considerations (2)
19#EUdev10
Criteria
Entity based data
Event based data (time-series)
High distinct values
Many and unknown attributes
Binary data (Images, PDFs, etc)
Analytics
Wrap-up
• Review entire use-case end-to-end early
• Understand storage capabilities
• Ask the right questions (upstream/consumers)
• Consider security, architecture and development
costs
• Decide on the right storage solution
20#EUdev10

More Related Content

What's hot (20)

PPTX
ch 7 POSIX.pptx
sibokac
 
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon
 
PDF
Malicious Payloads vs Deep Visibility: A PowerShell Story
Daniel Bohannon
 
PDF
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
PPTX
ORC File - Optimizing Your Big Data
DataWorks Summit
 
PDF
JWT: jku x5u
snyff
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
KEY
Introduction to memcached
Jurriaan Persyn
 
PDF
Introducing Vault
Ramit Surana
 
PPTX
File permission in Linux
KrutikMandre1
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PPTX
Introduction to Redis
Maarten Smeets
 
PDF
NGINX: Basics and Best Practices EMEA
NGINX, Inc.
 
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
PPTX
JHipster presentation by Gaetan Bloch
Gaëtan Bloch
 
PDF
How Prometheus Store the Data
Hao Chen
 
PPTX
Docker Container Security
Suraj Khetani
 
PDF
Introduction to Redis
Dvir Volk
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
ch 7 POSIX.pptx
sibokac
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon
 
Malicious Payloads vs Deep Visibility: A PowerShell Story
Daniel Bohannon
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
ORC File - Optimizing Your Big Data
DataWorks Summit
 
JWT: jku x5u
snyff
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Introduction to memcached
Jurriaan Persyn
 
Introducing Vault
Ramit Surana
 
File permission in Linux
KrutikMandre1
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Introduction to Redis
Maarten Smeets
 
NGINX: Basics and Best Practices EMEA
NGINX, Inc.
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
JHipster presentation by Gaetan Bloch
Gaëtan Bloch
 
How Prometheus Store the Data
Hao Chen
 
Docker Container Security
Suraj Khetani
 
Introduction to Redis
Dvir Volk
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 

Viewers also liked (13)

PDF
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Summit
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Ad

Similar to Storage Engine Considerations for Your Apache Spark Applications with Mladen Kovacevic (20)

PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PPTX
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
inside-BigData.com
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
 
PPTX
Apache drill
MapR Technologies
 
PPTX
Hadoop ppt1
chariorienit
 
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PDF
Gunther hagleitner:apache hive & stinger
hdhappy001
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
PDF
Michael stack -the state of apache h base
hdhappy001
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
PPTX
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
inside-BigData.com
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
 
Apache drill
MapR Technologies
 
Hadoop ppt1
chariorienit
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Gunther hagleitner:apache hive & stinger
hdhappy001
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Michael stack -the state of apache h base
hdhappy001
 
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
big data eco system fundamentals of data science
arivukarasi
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 

Storage Engine Considerations for Your Apache Spark Applications with Mladen Kovacevic

  • 1. Mladen Kovacevic, Senior Solutions Architect Cloudera Inc. Storage Engine Considerations for your Apache Spark Applications #EUdev10
  • 2. Outline • Motivation – store your data – where exactly? • Storage Capabilities: – HDFS – HBase – Kudu – Solr • Asking the right questions • Decide on right storage solution 2#EUdev10
  • 3. Motivation • Spark, SparkStreaming, SparkSQL – great for processing – need a place to store content • Integration with variety of storage systems • Ingest and consumption requirements – use case! 3#EUdev10
  • 4. Design patterns 4#EUdev10 Choosing the right storage for the use case 2006 2007 2016 2008
  • 5. HDFS • Distributed file system – cheap, scalable, storage • Immutable – “record” changes are painful • Columnar file formats - ideal for analytics • SQL overlays (SparkSQL, Hive Metastore, more) to define schema Highlights Very high throughput, painful random IO, batch oriented, coding overhead (ie. dealing with small files problems), any file 5#EUdev10
  • 6. HDFS design pattern df.write.parquet(“/data/person_table”) 6#EUdev10 • Small files accumulate • External processes, or additional application logic to manage these files • Partition management • Manage metadata carefully (depends on ecosystem) • Considerations- changing dimensions (fast/slow) • Late arriving data
  • 7. HBase • NoSQL engine, manages files on HDFS • Key-value, distributed storage engine • No data types – just binary fields • Thousands to millions of columns • Store entity data (profiles of people, devices, accounts) Highlights Very fast random IO, low throughput, NRT oriented, challenging BI, no strict data types 7#EUdev10
  • 8. HBase design pattern 8#EUdev10 • HBase Connection anywhere in Spark/SparkStreaming app • SparkSQL/DataFrames, Bulk Load • Primary storage for ingestion or complementary preserving state • NoSQL store vs. structured • Near-real-time CDH hbase-spark: https://github.com/cloudera/hbase/tree/cdh5-1.2.0_5.13.0/hbase-spark CDH HBase and Spark docs: http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.13.0/book.html#spark Upstream hbase-spark (watch for updates in HBase 2.x release): https://github.com/apache/hbase/tree/master/hbase-spark/ Upstream HBase and Spark book: http://hbase.apache.org/book.html#spark
  • 10. Kudu • Storage system for tables of structured data • Bring-your-own-SQL (SparkSQL, Impala), NoSQL-like API, integration with Spark, MapReduce, more.. • Columnar, key partitioning by range and/or hash • Limited number of columns (strongly typed) Highlights Fast random IO, fast throughput, NRT oriented, terrific for BI, structured data 10#EUdev10
  • 11. Kudu design pattern df.write.options(kuduOptions).mode(“append”).kudu OR kuduContext.insertRows() 11#EUdev10 • DataFrame perfect match for Kudu (structured) • Data available immediately to SQL engines (Impala, SparkSQL) • Ideal case is append with moderate updates Kudu Integration with Spark: http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark Up and running with Apache Spark on Apache Kudu: https://blog.cloudera.com/blog/2017/02/up-and-running-with-apache-spark-on-apache-kudu/
  • 13. Solr • Distributed index enabling search capabilities (Lucene) • Typed, REST API based, search index query processing • Search interface, faceting, integration with HBase storing content (typically) in HDFS Highlights High random IO, low throughput, multi-faceted use cases, NRT oriented, terrific for BI with the right tools (non-SQL), loose schema, data types 13#EUdev10
  • 14. Solr design pattern 14#EUdev10 • Prepare Solr document, add to SolrCloud directly OR • Write to HBase, leverage Lily HBase Indexer service to update Solr • Store complete record in HBase, while indexed fields for search in Solr • NRT availability (short soft commits)
  • 15. Questions we ask (1) • How many voters have cast their ballots by city thus far in the election, by the second? – streaming data into ‘voter’ table, aggregate query, immediate data availability : Kudu • How many people watched last nights game compared to the night before? – daily batch, aggregate query : HDFS parquet 15#EUdev10
  • 16. Questions we ask (2) • What version is my device running and how many dropped packets do I have? – streaming entity profile data, metrics may change per release, many updates, specific device, NRT: HBase • Which tweets talk to the housing market, in the 21-30 age group? – streaming, keyword search, facet filtering : Solr 16#EUdev10
  • 17. Use case questionnaire • Consumption interface: SQL (JDBC/ODBC) vs. API • Near-real-time requirement for consumers • Ingestion rate (can we keep up?) • Entity vs. Events (time-based) • Append-only vs moderate updates vs many updates • Distinct values in dataset 17#EUdev10
  • 18. Storage considerations (1) 18#EUdev10 Criteria SQL interface API interface Near-real-time ingestion Append-only + available for query Appends with moderate updates Mostly updates
  • 19. Storage considerations (2) 19#EUdev10 Criteria Entity based data Event based data (time-series) High distinct values Many and unknown attributes Binary data (Images, PDFs, etc) Analytics
  • 20. Wrap-up • Review entire use-case end-to-end early • Understand storage capabilities • Ask the right questions (upstream/consumers) • Consider security, architecture and development costs • Decide on the right storage solution 20#EUdev10