SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Faster Batch Processing with
Hive-on-Spark
Santosh Kumar | Cloudera
Rui Li | Intel
2© Cloudera, Inc. All rights reserved.
Agenda
• What is Hive-on-Spark?
• Using Hive-on-Spark
• Performance Metrics
• Configuration & Tuning
• What’s Next?
• Q&A
3© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala, Java,
and Python
• Interactive shell
• APIs for different types
of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory processing
and caching
4© Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
• Resilient Distributed Datasets (RDD)
• In-memory data-structure partitioned across a set of machines
• Can fall back to disk when data-set does not fit in memory
• Created by parallel transformations on data in stable storage
• Provides fault-tolerance through concept of lineage
5© Cloudera, Inc. All rights reserved.
Introduction
• Enables Hive to use Spark as underlying execution engine
• Motivations
• Consolidation of Spark as execution engine
• Better performance
• Increased adoption of Hive (e.g. for Spark users)
• Community effort by Cloudera, IBM, Intel, MapR, and others
6© Cloudera, Inc. All rights reserved.
Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processing
BI and
SQL Analytics
Procedural
Development
SQLOR
Impala
7© Cloudera, Inc. All rights reserved.
Current State of Hive-on-Spark (HoS)
• Fully supported production release in C5.7
• Functional parity with Hive-on-MapReduce (HoMR)
• Average 3x performance improvement vs HoMR
• Automatic configuration and optimizations via Cloudera Manager
• Strong early user base
• Early commitment for future collaboration from Intel and others
8© Cloudera, Inc. All rights reserved.
Design Principles
• Minimize impact on existing code path
• Minimizes functional and performance impact
• Minimizes maintenance
• Maximizes support for Hive features – current as well as future
• Spark invoked only at execution layer
• HoS produces similar logical operators plan as HoMR
• Logical plan runs on low-level Spark primitives
• Minimizes usage of advanced Spark primitives
9© Cloudera, Inc. All rights reserved.
Getting Started with Hive-on-Spark
10© Cloudera, Inc. All rights reserved.
Configuration
• Minimal configurations needed
• Via Cloudera Manager: Set “Spark on YARN Service” (internally sets
spark.master=yarn-cluster)
• Set hive.execution.engine=spark per service or query
• Only yarn-cluster is supported
• Cloudera Manager auto-configures most configurations
• Configuration & Tuning Guide available on Docs
11© Cloudera, Inc. All rights reserved.
Performance
Avg. ~3X faster than Hive-on-MapReduce
More Suitable Less Suitable
Complex workloads w/ multiple MR stages e.g. filter
followed by JOIN followed by GROUP BY
Simple workloads e.g. select *
Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs
Workloads requiring mins to hours for completion Workloads typically requiring <1 min
12© Cloudera, Inc. All rights reserved.
Query Execution: Background
Input
status_updates( userid int,status string,ds string)
profiles(userid int,school string,gender int)
Output
school_summary(school string,cnt int,ds string)
gender_summary(gender int,cnt int,ds string)
13© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
14© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
15© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
FileSinkOperator (disk write) and TableScanOperator (disk read)
are very costly
16© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
17© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
18© Cloudera, Inc. All rights reserved.
Optimization for Resource Management:
Long-Live Executors (LLE)
• MR: Each query an independent YARN application
• Spark: Each SQL session is a long-lived YARN application
• First query of a session spawns a YARN app
• Subsequent queries re-use same YARN app as well as containers
• Session disconnect shuts down YARN app and releases container resources
19© Cloudera, Inc. All rights reserved.
Long-Lived Executors Details
• Hive User Session will submit Spark Application to YARN
• Spark YARN Application:
• YARN container = Spark Executors live in YARN containers
• YARN Application Master = RemoteDriver
• Submits Spark ‘jobs’, aka Hive queries, to Spark executors
• Connects back to HS2 to report job progress from Spark executors
User1
User2
HiveServer2
Session1
Session2
YARN Cluster
AM (RemoteDriver1) Containers (Executors)
AM (RemoteDriver2) Containers (Executors)
20© Cloudera, Inc. All rights reserved.
Configuration and Tuning
Hive-on-Spark
21© Cloudera, Inc. All rights reserved.
Spark Configuration
• Size of executors
• Bigger and fewer executors
• Threads contention
• GC pressure
• Smaller and more executors
• Less memory efficient
• Bigger start-up overhead
22© Cloudera, Inc. All rights reserved.
Spark Configuration
• CPU
• Around 5-7 cores per executor
• Memory
• Leave 10% for OS cache
• Executor memory overhead
• Tune by case
• Can be heavily used by Netty
• Usually 15% - 20%
• Around 3GB per core
23© Cloudera, Inc. All rights reserved.
Spark Configuration
• Serialization
• spark.serializer – kryo performs better and is REQUIRED by HoS
• spark.kryo.referenceTracking – disable to avoid java performance issue
• Shuffle
• spark.shuffle.compress
• spark.shuffle.spill.compress
• Trade CPU for I/O
• Increase number of reducers
24© Cloudera, Inc. All rights reserved.
Partitioning
• Number of mappers
• Inputformat
• mapreduce.input.fileinputformat.split.maxsize
• Number of reducers
• hive.exec.reducers.bytes.per.reducer
• mapreduce.job.reduces
• HoS tends to launch more reducers
• Merge small files
• hive.merge.sparkfiles
25© Cloudera, Inc. All rights reserved.
Hive Configuration
• General optimizations
• Enable vectorization
• Enable CBO
• Map join auto convertion
• Map side aggregation
• Etc.
26© Cloudera, Inc. All rights reserved.
Hive Configuration
• Map join
• hive.auto.convert.join.noconditionaltask.size
• HoS doesn’t support conditional map join yet
• HoS uses raw data size as small table size – different from MR
• hive.stats.collect.rawdatasize
• Skew join
• Compile time – same as MR
• Runtime - HoS will split the original task at join
27© Cloudera, Inc. All rights reserved.
Resource Allocation
• Static allocation
• spark.executor.instances
• Won’t release until session is closed
• Recommended for benchmarking
• Dynamic allocation
• spark.dynamicAllocation.enabled
• spark.executor.dynamicAllocation.initialExecutors
• spark.executor.dynamicAllocation.minExecutors
• spark.executor.dynamicAllocation.maxExecutors
• Number of executors per Spark application scales up and down
• Suited for multi-tenancy scenarios (multi-session)
28© Cloudera, Inc. All rights reserved.
Resource Allocation
• Pre-warm containers
• hive.prewarm.enabled
• spark.scheduler.maxRegisteredResourcesWaitingTime
• spark.scheduler.minRegisteredResourcesRatio
• Attempt for better parallelism
• Considerable delay for start-up job
• Not recommended for short-lived sessions
29© Cloudera, Inc. All rights reserved.
Configuration and Tuning Summary
• Number and size of executors most important determinants of
performance
• Resolve query performance/failures by allocating more executors with
more CPU and RAM
• spark.executor.instances, spark.executor.cores, spark.executor.memory,
spark.yarn.executor.memoryOverhead
• Cloudera Manager takes care of most of the optimizations
• Most Hive config settings applicable to HoS, but few have different
semantics
• See Config and Tuning Guide for details
30© Cloudera, Inc. All rights reserved.
Roadmap
• Additional Optimizations
• Dynamic Partition Pruning
• Vectorization support
• Cost-Based Optimizer
• Others – Caching RDDs across queries, Optimize self join/union etc.
• Supportability Enhancements
• Better support for debugging and logging
• More informative stage description in WebUI
• Others: Improve Hue integration, additional metrics specific to HoS etc.
• Rebase to Spark 2.0 and Parquet 1.8
31© Cloudera, Inc. All rights reserved.
More Information & Next Steps
Get Started
• Download C5.7: www.cloudera.com/downloads
Release Notes
• www.cloudera.com/documentation/enterprise/latest/topics/rg_release_
notes.html
Training Classes
• university.cloudera.com
32© Cloudera, Inc. All rights reserved.
Questions?

More Related Content

What's hot (20)

PDF
Application Architectures with Hadoop
hadooparchbook
 
PPTX
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PPTX
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
PPTX
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
PDF
Impala use case @ Zoosk
Cloudera, Inc.
 
PDF
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
PPTX
Risk Management for Data: Secured and Governed
Cloudera, Inc.
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PPTX
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
PDF
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
Apache Spark Operations
Cloudera, Inc.
 
PPTX
Road to Cloudera certification
Cloudera, Inc.
 
PPTX
Cloudbreak - Technical Deep Dive
DataWorks Summit/Hadoop Summit
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
PPTX
Solr consistency and recovery internals
Cloudera, Inc.
 
Application Architectures with Hadoop
hadooparchbook
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Impala use case @ Zoosk
Cloudera, Inc.
 
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
Risk Management for Data: Secured and Governed
Cloudera, Inc.
 
Intro to Apache Spark
Cloudera, Inc.
 
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Apache Spark Operations
Cloudera, Inc.
 
Road to Cloudera certification
Cloudera, Inc.
 
Cloudbreak - Technical Deep Dive
DataWorks Summit/Hadoop Summit
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
Solr consistency and recovery internals
Cloudera, Inc.
 

Similar to Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production (20)

PDF
Hive on spark berlin buzzwords
Szehon Ho
 
PPTX
Empower Hive with Spark
DataWorks Summit
 
PDF
Hive Now Sparks
DataWorks Summit
 
PPTX
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PDF
TriHUG Feb: Hive on spark
trihug
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
PDF
sql on hadoop
Jianwei Li
 
PPTX
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
PPTX
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
Cloudera, Inc.
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
PDF
Cloudera 5.3 Update
Cloudera, Inc.
 
PDF
The state of Spark in the cloud
Nicolas Poggi
 
PPTX
Building Spark as Service in Cloud
InMobi Technology
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Strata London 2019 Scaling Impala.pptx
Manish Maheshwari
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Hive on spark berlin buzzwords
Szehon Ho
 
Empower Hive with Spark
DataWorks Summit
 
Hive Now Sparks
DataWorks Summit
 
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
TriHUG Feb: Hive on spark
trihug
 
Spark Tips & Tricks
Jason Hubbard
 
sql on hadoop
Jianwei Li
 
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
Cloudera, Inc.
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Cloudera 5.3 Update
Cloudera, Inc.
 
The state of Spark in the cloud
Nicolas Poggi
 
Building Spark as Service in Cloud
InMobi Technology
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Strata London 2019 Scaling Impala.pptx
Manish Maheshwari
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

  • 1. 1© Cloudera, Inc. All rights reserved. Faster Batch Processing with Hive-on-Spark Santosh Kumar | Cloudera Rui Li | Intel
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • What is Hive-on-Spark? • Using Hive-on-Spark • Performance Metrics • Configuration & Tuning • What’s Next? • Q&A
  • 3. 3© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 4. 4© Cloudera, Inc. All rights reserved. Spark Takes Advantage of Memory • Resilient Distributed Datasets (RDD) • In-memory data-structure partitioned across a set of machines • Can fall back to disk when data-set does not fit in memory • Created by parallel transformations on data in stable storage • Provides fault-tolerance through concept of lineage
  • 5. 5© Cloudera, Inc. All rights reserved. Introduction • Enables Hive to use Spark as underlying execution engine • Motivations • Consolidation of Spark as execution engine • Better performance • Increased adoption of Hive (e.g. for Spark users) • Community effort by Cloudera, IBM, Intel, MapR, and others
  • 6. 6© Cloudera, Inc. All rights reserved. Choosing the Right SQL Engine Know Your Audience, Know Your Use Case Batch Processing BI and SQL Analytics Procedural Development SQLOR Impala
  • 7. 7© Cloudera, Inc. All rights reserved. Current State of Hive-on-Spark (HoS) • Fully supported production release in C5.7 • Functional parity with Hive-on-MapReduce (HoMR) • Average 3x performance improvement vs HoMR • Automatic configuration and optimizations via Cloudera Manager • Strong early user base • Early commitment for future collaboration from Intel and others
  • 8. 8© Cloudera, Inc. All rights reserved. Design Principles • Minimize impact on existing code path • Minimizes functional and performance impact • Minimizes maintenance • Maximizes support for Hive features – current as well as future • Spark invoked only at execution layer • HoS produces similar logical operators plan as HoMR • Logical plan runs on low-level Spark primitives • Minimizes usage of advanced Spark primitives
  • 9. 9© Cloudera, Inc. All rights reserved. Getting Started with Hive-on-Spark
  • 10. 10© Cloudera, Inc. All rights reserved. Configuration • Minimal configurations needed • Via Cloudera Manager: Set “Spark on YARN Service” (internally sets spark.master=yarn-cluster) • Set hive.execution.engine=spark per service or query • Only yarn-cluster is supported • Cloudera Manager auto-configures most configurations • Configuration & Tuning Guide available on Docs
  • 11. 11© Cloudera, Inc. All rights reserved. Performance Avg. ~3X faster than Hive-on-MapReduce More Suitable Less Suitable Complex workloads w/ multiple MR stages e.g. filter followed by JOIN followed by GROUP BY Simple workloads e.g. select * Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs Workloads requiring mins to hours for completion Workloads typically requiring <1 min
  • 12. 12© Cloudera, Inc. All rights reserved. Query Execution: Background Input status_updates( userid int,status string,ds string) profiles(userid int,school string,gender int) Output school_summary(school string,cnt int,ds string) gender_summary(gender int,cnt int,ds string)
  • 13. 13© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  • 14. 14© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  • 15. 15© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS FileSinkOperator (disk write) and TableScanOperator (disk read) are very costly
  • 16. 16© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  • 17. 17© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  • 18. 18© Cloudera, Inc. All rights reserved. Optimization for Resource Management: Long-Live Executors (LLE) • MR: Each query an independent YARN application • Spark: Each SQL session is a long-lived YARN application • First query of a session spawns a YARN app • Subsequent queries re-use same YARN app as well as containers • Session disconnect shuts down YARN app and releases container resources
  • 19. 19© Cloudera, Inc. All rights reserved. Long-Lived Executors Details • Hive User Session will submit Spark Application to YARN • Spark YARN Application: • YARN container = Spark Executors live in YARN containers • YARN Application Master = RemoteDriver • Submits Spark ‘jobs’, aka Hive queries, to Spark executors • Connects back to HS2 to report job progress from Spark executors User1 User2 HiveServer2 Session1 Session2 YARN Cluster AM (RemoteDriver1) Containers (Executors) AM (RemoteDriver2) Containers (Executors)
  • 20. 20© Cloudera, Inc. All rights reserved. Configuration and Tuning Hive-on-Spark
  • 21. 21© Cloudera, Inc. All rights reserved. Spark Configuration • Size of executors • Bigger and fewer executors • Threads contention • GC pressure • Smaller and more executors • Less memory efficient • Bigger start-up overhead
  • 22. 22© Cloudera, Inc. All rights reserved. Spark Configuration • CPU • Around 5-7 cores per executor • Memory • Leave 10% for OS cache • Executor memory overhead • Tune by case • Can be heavily used by Netty • Usually 15% - 20% • Around 3GB per core
  • 23. 23© Cloudera, Inc. All rights reserved. Spark Configuration • Serialization • spark.serializer – kryo performs better and is REQUIRED by HoS • spark.kryo.referenceTracking – disable to avoid java performance issue • Shuffle • spark.shuffle.compress • spark.shuffle.spill.compress • Trade CPU for I/O • Increase number of reducers
  • 24. 24© Cloudera, Inc. All rights reserved. Partitioning • Number of mappers • Inputformat • mapreduce.input.fileinputformat.split.maxsize • Number of reducers • hive.exec.reducers.bytes.per.reducer • mapreduce.job.reduces • HoS tends to launch more reducers • Merge small files • hive.merge.sparkfiles
  • 25. 25© Cloudera, Inc. All rights reserved. Hive Configuration • General optimizations • Enable vectorization • Enable CBO • Map join auto convertion • Map side aggregation • Etc.
  • 26. 26© Cloudera, Inc. All rights reserved. Hive Configuration • Map join • hive.auto.convert.join.noconditionaltask.size • HoS doesn’t support conditional map join yet • HoS uses raw data size as small table size – different from MR • hive.stats.collect.rawdatasize • Skew join • Compile time – same as MR • Runtime - HoS will split the original task at join
  • 27. 27© Cloudera, Inc. All rights reserved. Resource Allocation • Static allocation • spark.executor.instances • Won’t release until session is closed • Recommended for benchmarking • Dynamic allocation • spark.dynamicAllocation.enabled • spark.executor.dynamicAllocation.initialExecutors • spark.executor.dynamicAllocation.minExecutors • spark.executor.dynamicAllocation.maxExecutors • Number of executors per Spark application scales up and down • Suited for multi-tenancy scenarios (multi-session)
  • 28. 28© Cloudera, Inc. All rights reserved. Resource Allocation • Pre-warm containers • hive.prewarm.enabled • spark.scheduler.maxRegisteredResourcesWaitingTime • spark.scheduler.minRegisteredResourcesRatio • Attempt for better parallelism • Considerable delay for start-up job • Not recommended for short-lived sessions
  • 29. 29© Cloudera, Inc. All rights reserved. Configuration and Tuning Summary • Number and size of executors most important determinants of performance • Resolve query performance/failures by allocating more executors with more CPU and RAM • spark.executor.instances, spark.executor.cores, spark.executor.memory, spark.yarn.executor.memoryOverhead • Cloudera Manager takes care of most of the optimizations • Most Hive config settings applicable to HoS, but few have different semantics • See Config and Tuning Guide for details
  • 30. 30© Cloudera, Inc. All rights reserved. Roadmap • Additional Optimizations • Dynamic Partition Pruning • Vectorization support • Cost-Based Optimizer • Others – Caching RDDs across queries, Optimize self join/union etc. • Supportability Enhancements • Better support for debugging and logging • More informative stage description in WebUI • Others: Improve Hue integration, additional metrics specific to HoS etc. • Rebase to Spark 2.0 and Parquet 1.8
  • 31. 31© Cloudera, Inc. All rights reserved. More Information & Next Steps Get Started • Download C5.7: www.cloudera.com/downloads Release Notes • www.cloudera.com/documentation/enterprise/latest/topics/rg_release_ notes.html Training Classes • university.cloudera.com
  • 32. 32© Cloudera, Inc. All rights reserved. Questions?