SlideShare a Scribd company logo
Arvind Heda, Kapil Malik
Indicium: Interactive
Querying at Scale
#EUeco9
What’s in the session …
• Unified Data Platform on Spark
– Single data source for all scheduled / ad hoc jobs and interactive
lookup / queries
– Data Pipeline
– Compute Layer
– Interactive Queries?
• Indicium: Part 1 (managed context pool)
• Indicium: Part 2 (smart query scheduler)
2#EUeco9
Unified Data Platform
3#EUeco9
Unified Data Platform…(for anything / everything)
• Common Data Lake for storing
– Transactional data
– Behavioral data
– Computed data
• Drives all decisions / recommendations / reporting / analysis from
same store.
• Single data source for all Decision Edges, Algorithms, BI tools and
Ad Hoc and interactive Query / Analysis tools
• Data Platform needs to support
– Scale – Store everything from summary to raw data.
– Concurrency – Handle multiple requests in acceptable user response time.
– Ad Hoc Drill down to any level – query, join, correlation on any dimension.
4#EUeco9
Unified Data Platform
5#EUeco9
Query UI
Spark
Context
(Yarn)
HDFSHDFSHDFS / S3
Spark
Context
(Yarn)
Scheduled
Jobs
Compute
jobs
BI
sched
uled
report
s
Data
Collection
service
Real
time
lookup
Interactive Query
Compute Layer
Data Pipeline
Features
6#EUeco9
Features Details Approach
Data Persistence Store Large Data Volume of Txn, Behavioural and Computed
data;
Spark – Parquet format
on S3 / HDFS
Data Transformations Transformation / Aggregation – co relations and enrichments Batch Processing -
Kafka / Java / Spark
Jobs
Algorithmic Access Aggregated / Raw Data Access for scheduledAlgorithms Spark Processes with
SQL Context based data
access
Decision Making Aggregated Data Access for decision in real time In memory cache of
aggregated data
Reporting BI / Ad Hoc
Query
Aggregated / Raw Data Access for scheduled reports (BI)
Aggregated / Raw Data Access forAd Hoc Queries
BI tool with defined
scheduled spark SQL
queries on Data store;
Interactive Queries Drill down data access on BI tools for concurrent users
Ad hoc Query / Analysis on data for concurrent users
S c a l i n g
c h a l l e n g e s f o r
S p a r k S Q L ?
Data Pipeline
• Kafka / Sqoop based data collection
• Live lookup store for real time decisions
• Tenant / Event and time based data partition
• Time based compaction to optimize query on sparse data
• Summary Profile data to reduce Joins
• Shared compute resources but different context for Scheduled / Ad
Hoc jobs or for Algorithmic / Human touchpoints
7#EUeco9
Compute Layer
• No real ‘real time’ queries -- FIFO scheduling for user
tasks
• Static or rigid resource allocation between scheduled
and ad hoc queries / jobs
• Short lived and stateless context - no sticky ness for user
defined views like temp tables.
• Interactive queries ?
8#EUeco9
What was needed for Interactive query…
• SQL like Query Tool for Ad Hoc Analysis.
• Scalability for concurrent users,
– Fair Scheduling
– Responsiveness
• High Availability
• Performance – specifically for scans and Joins
• Extensibility – User Views / Datasets / UDF’s
9#EUeco9
Indicium ?
10#EUeco9
Indicium: Part 1
Managed Context Pool
11#EUeco9
Managed Context Pool
12#EUeco9
Apache
Zeppelin
SQL Context
(Yarn)
HDFS
HDFS
HDFS
Spark
Job-server
Managed Context Pool
Apache Zeppelin 0.6
• SQL like Query tool and a notebook
• Custom interpreter
- Configuration: SJS server + context
- Statement execution: Make asynchronousREST calls to SJS
• Concurrency - Multiple interpreters and notebooks
Spark Job-Server 0.6.x
• Custom SQL context with catalog override
• Custom application to execute queries
• High Availability: Multiple SJS servers and multiple contexts per server
13#EUeco9
Managed Context Pool
Features
• Familiar SQL interface on notebooks
• Concurrent multi-user support
• Visualization Dashboards
• Long running Spark Job – to support User Defined Views
• Access control on Spark APIs
• Custom SQL context with custom catalog
– Intercept lookupTable calls to query actual data
– Table wrappers for time windows - like select count(*) from `lastXDays(table)`
14#EUeco9
Managed Context Pool
Issues
• Interpreter hard wired to a context
• FIFO scheduling: Single statement per interpreter-context pair –
across notebooks / across users
• No automated failure handling
– Detecting a dead context / SJS server
– Recovery from the context / server failure
• No dynamic scheduling / load balancing
– No way of identify an overloaded context
• Incompatible with Spark 2.x
15#EUeco9
Indicium: Part 2
Smart Query Scheduler
16#EUeco9
Smart Query Scheduler
17#EUeco9
Apache
Zeppelin
SQL Context
(Yarn)
HDFS
HDFS
HDFS
Spark
Job-server
Smart
Query
Scheduler
Smart Query Scheduler
Zeppelin 0.7
• Supports per notebook statement execution
SJS 0.7 Custom Fork
• Support for Spark 2.x
Smart Query Scheduler:
• Scheduling: API to dynamically bind SJS server + context for every job / query
Other Optimizations:
• Monitoring: Monitor jobs running per context
• Availability: Track Health of SJS servers and contexts and ensures healthy context in
pool
18#EUeco9
Smart Query Scheduler
Dynamic scheduling for every query
• Zeppelin interpreter agnostic of actual SJS / context
• Load balancing of jobs per context
• Query Classification and intelligent routing
• Dynamic scaling / de-scaling the pool size
• Shared Cache
• User Defined Views
• Workspaces or custom time window view for every interpreter
19#EUeco9
Query Classification / routing
Custom resource configurations for context dedicated for
complex or asynchronous queries / jobs:
• Classify queries based on heuristics / historic data into
light / heavy queries and route them to different context.
• Separate contexts for interactive vs background queries
– An export table call does not starve an interactive SQL query
20#EUeco9
Spark Dynamic Context
Elastic scaling of contexts, co-existing on same cluster as
scheduled batch jobs
• Scale up in day time, when user load is high
• Scale down in night, when overnight batch jobs are
running
• Scaling also helped to create reserved bandwidth for any
set of users, if needed.
21#EUeco9
Shared Cache
Alluxio to store common datasets
• Single cache for common datasets across contexts
– Avoids replication across contexts
– Cached data safe from executor / context crashes
• Dedicated refresh thread to release / update data
consistently across contexts
22#EUeco9
Persistent User Defined Views
• Users can define a temp view for a SQL query
• Replicated across all SJS servers + contexts
• Definitions persisted in DB so that a context restart is
accompanied by temp views’ registration.
• Load on start to warm up load of views
• TTL support for expiry
23#EUeco9
Workspaces
• Support for multiple custom catalogs in SQL context for
table resolution
• Custom time range / source / caching
– Global
– Per catalog
– Per table
• Configurable via Zeppelin interpreter
• Decoupled time range from query syntax
– Join a behavior table(refer to last 30 days) with lookup table
(fetch complete data)
24#EUeco9
Automated Pool Management
• Monitoring scripts to track and restart unhealthy / un-
responsive SJS servers / contexts
• APIs on SJS to stop / start / refresh context / SJS
• APIs to refresh cached tables / views;
• APIs on Router Service to reconfigure routing / pool size
and resource allocation
25#EUeco9
Thank You !
26#EUeco9
Questions & Answers
kapil.ee06@gmail.com
arvind_heda@yahoo.com
References
• Apache Zeppelin: https://zeppelin.apache.org/
• Spark Job-server: https://github.com/spark-jobserver/spark-
jobserver
• Alluxio: http://www.alluxio.org/
27#EUeco9
Scale ….
• Data
– ~ 100 TB
– ~ 1000 Event Types
• 100+ Active concurrent users
• 30+ Automated Agents
• 10000+ Scheduled / 3000+ Ad Hoc Analysis
• Avg data churn per Analysis > 200 GB
28#EUeco9

More Related Content

What's hot (20)

PDF
Spark Summit EU talk by Jorg Schad
Spark Summit
 
PDF
Apache Spark Performance: Past, Future and Present
Databricks
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PDF
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
PDF
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Databricks
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
PDF
Scaling Apache Spark at Facebook
Databricks
 
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Spark Summit EU talk by Jorg Schad
Spark Summit
 
Apache Spark Performance: Past, Future and Present
Databricks
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Scaling Apache Spark at Facebook
Databricks
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 

Viewers also liked (15)

PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Spark Summit
 
PDF
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Spark Summit
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Building Machine Learning Algorithms on Apache Spark with William Benton
Spark Summit
 
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
PPTX
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Spark Summit
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Building Machine Learning Algorithms on Apache Spark with William Benton
Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Ad

Similar to Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik (20)

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PPTX
Boosting big data with apache spark
InfoFarm
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Apache spark its place within a big data stack
Junjun Olympia
 
PDF
02. UBER - BIG DATA CASE STUDY.pdf
Prasanth193441
 
PDF
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Big data should be simple
Dori Waldman
 
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
PDF
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
PDF
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PPTX
Integrating Apache Phoenix with Distributed Query Engines
DataWorks Summit
 
PDF
Big Data Architecture
Guido Schmutz
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Boosting big data with apache spark
InfoFarm
 
A look ahead at spark 2.0
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Unified Big Data Processing with Apache Spark
C4Media
 
Apache spark its place within a big data stack
Junjun Olympia
 
02. UBER - BIG DATA CASE STUDY.pdf
Prasanth193441
 
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Big data should be simple
Dori Waldman
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Integrating Apache Phoenix with Distributed Query Engines
DataWorks Summit
 
Big Data Architecture
Guido Schmutz
 
Ad

More from Spark Summit (19)

PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
PDF
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Spark Summit
 
PDF
Lucid—A Genetic Programming Library for Apache Spark with Jakub Guner
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Spark Summit
 
Lucid—A Genetic Programming Library for Apache Spark with Jakub Guner
Spark Summit
 

Recently uploaded (20)

PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik

  • 1. Arvind Heda, Kapil Malik Indicium: Interactive Querying at Scale #EUeco9
  • 2. What’s in the session … • Unified Data Platform on Spark – Single data source for all scheduled / ad hoc jobs and interactive lookup / queries – Data Pipeline – Compute Layer – Interactive Queries? • Indicium: Part 1 (managed context pool) • Indicium: Part 2 (smart query scheduler) 2#EUeco9
  • 4. Unified Data Platform…(for anything / everything) • Common Data Lake for storing – Transactional data – Behavioral data – Computed data • Drives all decisions / recommendations / reporting / analysis from same store. • Single data source for all Decision Edges, Algorithms, BI tools and Ad Hoc and interactive Query / Analysis tools • Data Platform needs to support – Scale – Store everything from summary to raw data. – Concurrency – Handle multiple requests in acceptable user response time. – Ad Hoc Drill down to any level – query, join, correlation on any dimension. 4#EUeco9
  • 5. Unified Data Platform 5#EUeco9 Query UI Spark Context (Yarn) HDFSHDFSHDFS / S3 Spark Context (Yarn) Scheduled Jobs Compute jobs BI sched uled report s Data Collection service Real time lookup Interactive Query Compute Layer Data Pipeline
  • 6. Features 6#EUeco9 Features Details Approach Data Persistence Store Large Data Volume of Txn, Behavioural and Computed data; Spark – Parquet format on S3 / HDFS Data Transformations Transformation / Aggregation – co relations and enrichments Batch Processing - Kafka / Java / Spark Jobs Algorithmic Access Aggregated / Raw Data Access for scheduledAlgorithms Spark Processes with SQL Context based data access Decision Making Aggregated Data Access for decision in real time In memory cache of aggregated data Reporting BI / Ad Hoc Query Aggregated / Raw Data Access for scheduled reports (BI) Aggregated / Raw Data Access forAd Hoc Queries BI tool with defined scheduled spark SQL queries on Data store; Interactive Queries Drill down data access on BI tools for concurrent users Ad hoc Query / Analysis on data for concurrent users S c a l i n g c h a l l e n g e s f o r S p a r k S Q L ?
  • 7. Data Pipeline • Kafka / Sqoop based data collection • Live lookup store for real time decisions • Tenant / Event and time based data partition • Time based compaction to optimize query on sparse data • Summary Profile data to reduce Joins • Shared compute resources but different context for Scheduled / Ad Hoc jobs or for Algorithmic / Human touchpoints 7#EUeco9
  • 8. Compute Layer • No real ‘real time’ queries -- FIFO scheduling for user tasks • Static or rigid resource allocation between scheduled and ad hoc queries / jobs • Short lived and stateless context - no sticky ness for user defined views like temp tables. • Interactive queries ? 8#EUeco9
  • 9. What was needed for Interactive query… • SQL like Query Tool for Ad Hoc Analysis. • Scalability for concurrent users, – Fair Scheduling – Responsiveness • High Availability • Performance – specifically for scans and Joins • Extensibility – User Views / Datasets / UDF’s 9#EUeco9
  • 11. Indicium: Part 1 Managed Context Pool 11#EUeco9
  • 12. Managed Context Pool 12#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server
  • 13. Managed Context Pool Apache Zeppelin 0.6 • SQL like Query tool and a notebook • Custom interpreter - Configuration: SJS server + context - Statement execution: Make asynchronousREST calls to SJS • Concurrency - Multiple interpreters and notebooks Spark Job-Server 0.6.x • Custom SQL context with catalog override • Custom application to execute queries • High Availability: Multiple SJS servers and multiple contexts per server 13#EUeco9
  • 14. Managed Context Pool Features • Familiar SQL interface on notebooks • Concurrent multi-user support • Visualization Dashboards • Long running Spark Job – to support User Defined Views • Access control on Spark APIs • Custom SQL context with custom catalog – Intercept lookupTable calls to query actual data – Table wrappers for time windows - like select count(*) from `lastXDays(table)` 14#EUeco9
  • 15. Managed Context Pool Issues • Interpreter hard wired to a context • FIFO scheduling: Single statement per interpreter-context pair – across notebooks / across users • No automated failure handling – Detecting a dead context / SJS server – Recovery from the context / server failure • No dynamic scheduling / load balancing – No way of identify an overloaded context • Incompatible with Spark 2.x 15#EUeco9
  • 16. Indicium: Part 2 Smart Query Scheduler 16#EUeco9
  • 17. Smart Query Scheduler 17#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server Smart Query Scheduler
  • 18. Smart Query Scheduler Zeppelin 0.7 • Supports per notebook statement execution SJS 0.7 Custom Fork • Support for Spark 2.x Smart Query Scheduler: • Scheduling: API to dynamically bind SJS server + context for every job / query Other Optimizations: • Monitoring: Monitor jobs running per context • Availability: Track Health of SJS servers and contexts and ensures healthy context in pool 18#EUeco9
  • 19. Smart Query Scheduler Dynamic scheduling for every query • Zeppelin interpreter agnostic of actual SJS / context • Load balancing of jobs per context • Query Classification and intelligent routing • Dynamic scaling / de-scaling the pool size • Shared Cache • User Defined Views • Workspaces or custom time window view for every interpreter 19#EUeco9
  • 20. Query Classification / routing Custom resource configurations for context dedicated for complex or asynchronous queries / jobs: • Classify queries based on heuristics / historic data into light / heavy queries and route them to different context. • Separate contexts for interactive vs background queries – An export table call does not starve an interactive SQL query 20#EUeco9
  • 21. Spark Dynamic Context Elastic scaling of contexts, co-existing on same cluster as scheduled batch jobs • Scale up in day time, when user load is high • Scale down in night, when overnight batch jobs are running • Scaling also helped to create reserved bandwidth for any set of users, if needed. 21#EUeco9
  • 22. Shared Cache Alluxio to store common datasets • Single cache for common datasets across contexts – Avoids replication across contexts – Cached data safe from executor / context crashes • Dedicated refresh thread to release / update data consistently across contexts 22#EUeco9
  • 23. Persistent User Defined Views • Users can define a temp view for a SQL query • Replicated across all SJS servers + contexts • Definitions persisted in DB so that a context restart is accompanied by temp views’ registration. • Load on start to warm up load of views • TTL support for expiry 23#EUeco9
  • 24. Workspaces • Support for multiple custom catalogs in SQL context for table resolution • Custom time range / source / caching – Global – Per catalog – Per table • Configurable via Zeppelin interpreter • Decoupled time range from query syntax – Join a behavior table(refer to last 30 days) with lookup table (fetch complete data) 24#EUeco9
  • 25. Automated Pool Management • Monitoring scripts to track and restart unhealthy / un- responsive SJS servers / contexts • APIs on SJS to stop / start / refresh context / SJS • APIs to refresh cached tables / views; • APIs on Router Service to reconfigure routing / pool size and resource allocation 25#EUeco9
  • 27. References • Apache Zeppelin: https://zeppelin.apache.org/ • Spark Job-server: https://github.com/spark-jobserver/spark- jobserver • Alluxio: http://www.alluxio.org/ 27#EUeco9
  • 28. Scale …. • Data – ~ 100 TB – ~ 1000 Event Types • 100+ Active concurrent users • 30+ Automated Agents • 10000+ Scheduled / 3000+ Ad Hoc Analysis • Avg data churn per Analysis > 200 GB 28#EUeco9