SlideShare a Scribd company logo
HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA
SQL ANALYTICS PROCESSING IN THE CLOUD
A REAL WORLD CASE STUDY
Jaipaul Agonus
FINRA
Strata Hadoop World
New York, Sep 2015
FINRA - WHAT DO WE DO?
Collect and Create
• Up to 75 billion events per day
• 13 National Exchanges, 5 Reporting Facilities
• Reconstruct the market from trillions of events spanning years
Detect & Investigate
• Identify market manipulations, insider trading, fraud and compliance
violations
Enforce & Discipline
• Ensure rule compliance
• Fine and bar broker dealers
• Refer matters to the SEC and other authorities
2
FINRA’S SURVEILLANCE ALGORITHMS
Hundreds of surveillance algorithms against massive amounts of data in
multiple products (Equities, Options, etc.) and across multiple
exchanges (NASDAQ, NYSE, CBOE, etc.).
3
Best Execution
Layering
Compliance
Detecting abusive activity and compliance breaches
FINRA’S SURVEILLANCE ALGORITHMS
4
FINRA’S SURVEILLANCE ALGORITHMS
Dealing with big data before there was “Big Data”.
Over 430 batch analytics in the Surveillance suite.
“Massively Parallel Processing” methodology used to solve big-data
problems in the legacy world.
5
PRE-HADOOP DATA ARCHITECTURE
 Tiered storage design that struggles to balance Cost, Performance and Flexibility
6
PRE-HADOOP PAIN POINTS
Data Silos
Data distributed physically across MPP appliances, NAS and Tapes,
affecting accessibility and efficiency.
7
PRE-HADOOP PAIN POINTS
Cost
Expensive, specialized hardware tuned for CPU, storage and network
performance.
Proprietary software that comes with relative high cost & vendor lock-in.
8
PRE-HADOOP PAIN POINTS
Non-Elasticity
Can't grow or shrink easily with data volume, bound by the cost of
hardware in the appliance and the relative high cost of the software.
9
ADDRESSING PAIN POINTS WITH..
HIVE
De facto standard for
SQL-on-Hadoop
Amazon EMR
Amazon Elastic Map
Reduce – Managed
Hadoop Framework
Amazon S3
Amazon Simple
Storage Service With
Practically Infinite
Storage
10
WHY SQL?
Heavily SQL based legacy application that works on source data
that’s already available in structured format.
Hundreds of thousands of lines of legacy SQL code, iterated and
tested rigorously through the years and readily available for easy
Hive porting.
Developers, testers, data scientists and analysts with deep SQL
skills and necessary business acumen already part of FINRA
culture.
11
HIVE
Developed at Facebook and now open-source de facto SQL standard for Hadoop.
Been around for seven years, battle tested at scale, and widely used across industries.
Powerful abstraction over MapReduce jobs, translates HiveQL into map-reduce jobs that
work against distributed dataset.
12
HIVE – EXECUTION ENGINES
MapReduce
Mature batch-processing platform for the petabyte scale!
Does not perform well enough for small data or iterative calculations with
long data pipelines.
Tez
Aims to translate complex SQL statements into an optimized, purpose-
built data processing graphs.
Strikes a balance between performance, throughput, and scalability.
Spark
Fast in-memory computing, allows you to leverage available memory by
fitting in all intermediate data.
13
AMAZON(AWS) S3
Cost-effective and durable object storage for a variety of content.
Allows separation of storage from compute resources, providing ability to scale each
independently in pay-as-you-go pricing model.
Meets Hadoop’s file system requirements and integrates well with EMR.
14
EMR – ELASTIC MAP REDUCE
Managed hadoop framework, easy to deploy and manage Hadoop clusters.
Easy, fast, and cost-effective way to distribute and process vast amounts of data
across dynamically scalable Amazon EC2 Instances.
15
EMR INSTANCE TYPES
A wide selection of virtual Linux servers with varying combinations of
CPU, memory, storage, and networking capacity.
Each instance type includes one or more instance sizes, allowing you to
scale your resources to target workload.
Available in Spot (your bid price) or On-demand model.
[Source – Amazon] 16
CLOUD BASED HADOOP ARCHITECTURE
17
UNDERLYING DESIGN PATTERN
[Source – Amazon]
Cluster lives just for the duration of the
job, shutdown when the job is done.
Persist input and output data in S3.
Run multiple jobs in multiple Amazon
EMR clusters over the same data set (in
S3) without overloading your HDFS
nodes.
Transient Clusters with S3 as HDFS
18
UNDERLYING DESIGN PATTERN – BENEFITS
[Source – Amazon]
Control your cost, pay for what you use.
Minimum maintenance, cluster goes away when the job is done.
Persisting data in S3 enables easy reprocessing if the spot Instances are
taken away due to an outbid.
19
LESSON LEARNED
Design your Hive batch analytics with focus on enabling direct data
access and maximizing resource utilization.
20
HIVE – DESIGNING FOR PERFORMANCE
Enable Direct Data Access
Partition data along natural query boundaries (e.g. trade date) and process only
the data you need.
Improve join performance by bucketing and sorting data ahead of time and
reduce I/O scans during join process.
Use broadcast joins when joining small tables.
21
HIVE – DESIGNING FOR PERFORMANCE
Tune Hive Configurations
Increase parallelism by tuning MapReduce split size to use all the available map
slots in the cluster.
Increase the replication factor on dimension tables (not for additional resiliency
but for better performance).
Compress intermediate data and reduce data transfer between
mappers/reducers.
22
LESSON LEARNED
Measure and profile your clusters from the outset and adjust
continuously keeping pace with changing data size, processing flow
and execution frameworks.
23
CLUSTER PROFILING - SAMPLE
Batch Process:
• Order Audit Trail Report Card Validation
Source Dataset:
• 1 month of market orders data
• Over 100 billion rows
• Around 5 terabytes in size
• Input stored in S3
Instance Choices:
• m1.xlarge (General Purpose) - 4 CPUs, 15 GB RAM, 4 x 420 Disk Storage, High network performance
• c3.8xlarge (Compute Optimized) - 32 CPUs, 60 GB RAM, 2 x 320 SSD Storage, High network performance
Run Options:
Attribute Option-1 Option-2 Option-3 Option-4
Instance Type m1.xl C3.8xl C3.8xl C3.8xl
Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized
Cluster Size 100 16 32 100
24
CLUSTER PROFILE – COMPARISON RESULTS
100 M1s Vs 16 C3s Vs 32 C3s Vs 100 C3s
Attribute Option-1 Option-2 Option-3 Option-4
Instance Type m1.xl C3.8xl C3.8xl C3.8xl
Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized
Cluster Size 100 16 32 100
Cost $57.33 $111.86 $127.05 $312.00
Time 3 Hrs 12 mins 8 Hrs 4 Hrs 56 mins 3 Hrs
Approximate Spot Price per Instance $0.14 $0.78 $0.66 $0.78
Peak Disk Usage 0.04% 2.10% 1.20% 0.45%
Peak Memory Usage 56% 43% 45% 41%
Peak CPU Usage 100% 95% 98% 92%
25
LESSON LEARNED
Abstract execution framework (MR, Tez, Spark) from business logic
in your app architecture. This allows switching frameworks, keeping
up with Hive community.
26
LESSON LEARNED
Hive UDFs (User Defined Functions) can solve your complex
problems when SQL falls short.
27
HIVE UDFS
Used when Hive functionality falls short
• e.g. window function with ignore nulls – not supported in Hive
• e.g. date formatting functions – Hive 1.2 has better support
Used when non-procedural SQL can’t accomplish the task
• e.g. de-dupe many-to-many time series pairings
A->B 10AM, A->C 11AM, D->B 11AM, D->C 11AM, E->C 12PM
28
LESSON LEARNED
Choose an optimal storage format (row, columnar) and compression
type (splittable, space efficient, fast).
29
FILE STORAGE – FORMAT AND COMPRESSION
[Source – Amazon]
STORAGE FORMATS
Columnar or Row based storage
RCFILE, ORC, PARQUET – Columnar formats, skip loading unwanted
columns!
COMPRESSION TYPES
Reduce the number of bytes written to/read from HDFS
Some are fast but offer less space reduction, some are space efficient but
slower, some are splittable and some are not.
30
Compression Extension Splittable Encoding/Decoding
Speed (Scale 1-4)
Space Savings %
(scale 1-4)
Gzip gz no 1 4
lzo lzo yes if indexed 2 2
bzip2 bz2 yes 3 3
snappy snappy no 4 1
LESSON LEARNED
Being at the bleeding edge of sql-on-hadoop has its risks, you need
the right strategy to mitigate risks/challenges along the way.
31
RISKS AND MITIGATION STRATEGIES
Hive, released in late 2008, Tez/Spark backend added recently.
Traditional platforms like Oracle tested and iterated through four decades.
Discovered issues with Hadoop/Hive during migration that are related to
compression, storage types and memory management.
Performance issues with interactive analytics – Impacts Data Scientists & Business
Analysts.
32
RISKS AND MITIGATION STRATEGIES
Extensive parallel run comparison (apples-to-apples) against legacy production
data to identify issues before production roll out.
Partnered with Hadoop and Cloud vendors to address Hadoop/Hive issues and
feature requests quickly.
Push of a button automated regression test suites to kick the tires and test
functionality for any minor software increments.
Analyzing Presto/Tez to solve interactive analytics problems.
33
RISKS AND MITIGATION STRATEGIES
Took advantage of cloud elasticity to complete parallel runs
against production volume at scale to produce results swiftly.
34
END STATE
COST
Cost savings in infrastructure relative to our legacy environment.
Minimal upfront spending, pay-as-you-go pricing model.
Variety of machine types and size to choose from based on cost and
performance needs.
35
END STATE
FLEXIBILITY
Dynamic infrastructure provides great flexibility with faster
reprocessing and testing needs.
Simplified software procurement and license management.
Ease of exploratory runs without affecting production workload.
36
END STATE
SCALABILITY
Scale out at will easily on high volume days.
Cloud elasticity enables running multiple days at scale in parallel.
Reprocessing for historic days is completed in hours compared to
weeks in our legacy environment.
37
QUESTIONS?
Jaipaul Agonus
FINRA
Jaipaul.agonus@finra.org
linkedin.com/in/jaipaulagonus
38

More Related Content

Viewers also liked (12)

PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
PPTX
Advanced Analytics using Apache Hive
Murtaza Doctor
 
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
PPTX
Best Practices for Architecting a Pragmatic Web API.
Mario Cardinal
 
PDF
AWS SQS for better architecture
Saurabh Bangad
 
PDF
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
ODP
Kanban Board Examples
Shore Labs
 
PDF
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Divante
 
PDF
Magento scalability from the trenches (Meet Magento Sweden 2016)
Divante
 
PDF
Omnichannel Customer Experience
Divante
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Big Data in Retail - Examples in Action
David Pittman
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
Advanced Analytics using Apache Hive
Murtaza Doctor
 
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Best Practices for Architecting a Pragmatic Web API.
Mario Cardinal
 
AWS SQS for better architecture
Saurabh Bangad
 
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Kanban Board Examples
Shore Labs
 
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Divante
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Divante
 
Omnichannel Customer Experience
Divante
 
Optimizing Apache Spark SQL Joins
Databricks
 
Big Data in Retail - Examples in Action
David Pittman
 

Similar to Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud (20)

PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
PPTX
Intro to SnappyData Webinar
SnappyData
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PDF
[SSA] 04.sql on hadoop(2014.02.05)
Steve Min
 
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Turkish Testing Board
 
PDF
Gcp data engineer
Narendranath Reddy T
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PPTX
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PPT
AWS Summit Berlin 2013 - Big Data Analytics
AWS Germany
 
PDF
Real time data processing frameworks
IJDKP
 
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Qubole
 
PPTX
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
PPT
Hadoop tutorial
Aamir Ameen
 
PPTX
Thing you didn't know you could do in Spark
SnappyData
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Intro to SnappyData Webinar
SnappyData
 
Essential Data Engineering for Data Scientist
SoftServe
 
[SSA] 04.sql on hadoop(2014.02.05)
Steve Min
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Turkish Testing Board
 
Gcp data engineer
Narendranath Reddy T
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Germany
 
Real time data processing frameworks
IJDKP
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Qubole
 
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Hadoop tutorial
Aamir Ameen
 
Thing you didn't know you could do in Spark
SnappyData
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
July Patch Tuesday
Ivanti
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Python basic programing language for automation
DanialHabibi2
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Ad

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud

  • 1. HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY Jaipaul Agonus FINRA Strata Hadoop World New York, Sep 2015
  • 2. FINRA - WHAT DO WE DO? Collect and Create • Up to 75 billion events per day • 13 National Exchanges, 5 Reporting Facilities • Reconstruct the market from trillions of events spanning years Detect & Investigate • Identify market manipulations, insider trading, fraud and compliance violations Enforce & Discipline • Ensure rule compliance • Fine and bar broker dealers • Refer matters to the SEC and other authorities 2
  • 3. FINRA’S SURVEILLANCE ALGORITHMS Hundreds of surveillance algorithms against massive amounts of data in multiple products (Equities, Options, etc.) and across multiple exchanges (NASDAQ, NYSE, CBOE, etc.). 3
  • 4. Best Execution Layering Compliance Detecting abusive activity and compliance breaches FINRA’S SURVEILLANCE ALGORITHMS 4
  • 5. FINRA’S SURVEILLANCE ALGORITHMS Dealing with big data before there was “Big Data”. Over 430 batch analytics in the Surveillance suite. “Massively Parallel Processing” methodology used to solve big-data problems in the legacy world. 5
  • 6. PRE-HADOOP DATA ARCHITECTURE  Tiered storage design that struggles to balance Cost, Performance and Flexibility 6
  • 7. PRE-HADOOP PAIN POINTS Data Silos Data distributed physically across MPP appliances, NAS and Tapes, affecting accessibility and efficiency. 7
  • 8. PRE-HADOOP PAIN POINTS Cost Expensive, specialized hardware tuned for CPU, storage and network performance. Proprietary software that comes with relative high cost & vendor lock-in. 8
  • 9. PRE-HADOOP PAIN POINTS Non-Elasticity Can't grow or shrink easily with data volume, bound by the cost of hardware in the appliance and the relative high cost of the software. 9
  • 10. ADDRESSING PAIN POINTS WITH.. HIVE De facto standard for SQL-on-Hadoop Amazon EMR Amazon Elastic Map Reduce – Managed Hadoop Framework Amazon S3 Amazon Simple Storage Service With Practically Infinite Storage 10
  • 11. WHY SQL? Heavily SQL based legacy application that works on source data that’s already available in structured format. Hundreds of thousands of lines of legacy SQL code, iterated and tested rigorously through the years and readily available for easy Hive porting. Developers, testers, data scientists and analysts with deep SQL skills and necessary business acumen already part of FINRA culture. 11
  • 12. HIVE Developed at Facebook and now open-source de facto SQL standard for Hadoop. Been around for seven years, battle tested at scale, and widely used across industries. Powerful abstraction over MapReduce jobs, translates HiveQL into map-reduce jobs that work against distributed dataset. 12
  • 13. HIVE – EXECUTION ENGINES MapReduce Mature batch-processing platform for the petabyte scale! Does not perform well enough for small data or iterative calculations with long data pipelines. Tez Aims to translate complex SQL statements into an optimized, purpose- built data processing graphs. Strikes a balance between performance, throughput, and scalability. Spark Fast in-memory computing, allows you to leverage available memory by fitting in all intermediate data. 13
  • 14. AMAZON(AWS) S3 Cost-effective and durable object storage for a variety of content. Allows separation of storage from compute resources, providing ability to scale each independently in pay-as-you-go pricing model. Meets Hadoop’s file system requirements and integrates well with EMR. 14
  • 15. EMR – ELASTIC MAP REDUCE Managed hadoop framework, easy to deploy and manage Hadoop clusters. Easy, fast, and cost-effective way to distribute and process vast amounts of data across dynamically scalable Amazon EC2 Instances. 15
  • 16. EMR INSTANCE TYPES A wide selection of virtual Linux servers with varying combinations of CPU, memory, storage, and networking capacity. Each instance type includes one or more instance sizes, allowing you to scale your resources to target workload. Available in Spot (your bid price) or On-demand model. [Source – Amazon] 16
  • 17. CLOUD BASED HADOOP ARCHITECTURE 17
  • 18. UNDERLYING DESIGN PATTERN [Source – Amazon] Cluster lives just for the duration of the job, shutdown when the job is done. Persist input and output data in S3. Run multiple jobs in multiple Amazon EMR clusters over the same data set (in S3) without overloading your HDFS nodes. Transient Clusters with S3 as HDFS 18
  • 19. UNDERLYING DESIGN PATTERN – BENEFITS [Source – Amazon] Control your cost, pay for what you use. Minimum maintenance, cluster goes away when the job is done. Persisting data in S3 enables easy reprocessing if the spot Instances are taken away due to an outbid. 19
  • 20. LESSON LEARNED Design your Hive batch analytics with focus on enabling direct data access and maximizing resource utilization. 20
  • 21. HIVE – DESIGNING FOR PERFORMANCE Enable Direct Data Access Partition data along natural query boundaries (e.g. trade date) and process only the data you need. Improve join performance by bucketing and sorting data ahead of time and reduce I/O scans during join process. Use broadcast joins when joining small tables. 21
  • 22. HIVE – DESIGNING FOR PERFORMANCE Tune Hive Configurations Increase parallelism by tuning MapReduce split size to use all the available map slots in the cluster. Increase the replication factor on dimension tables (not for additional resiliency but for better performance). Compress intermediate data and reduce data transfer between mappers/reducers. 22
  • 23. LESSON LEARNED Measure and profile your clusters from the outset and adjust continuously keeping pace with changing data size, processing flow and execution frameworks. 23
  • 24. CLUSTER PROFILING - SAMPLE Batch Process: • Order Audit Trail Report Card Validation Source Dataset: • 1 month of market orders data • Over 100 billion rows • Around 5 terabytes in size • Input stored in S3 Instance Choices: • m1.xlarge (General Purpose) - 4 CPUs, 15 GB RAM, 4 x 420 Disk Storage, High network performance • c3.8xlarge (Compute Optimized) - 32 CPUs, 60 GB RAM, 2 x 320 SSD Storage, High network performance Run Options: Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 24
  • 25. CLUSTER PROFILE – COMPARISON RESULTS 100 M1s Vs 16 C3s Vs 32 C3s Vs 100 C3s Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 Cost $57.33 $111.86 $127.05 $312.00 Time 3 Hrs 12 mins 8 Hrs 4 Hrs 56 mins 3 Hrs Approximate Spot Price per Instance $0.14 $0.78 $0.66 $0.78 Peak Disk Usage 0.04% 2.10% 1.20% 0.45% Peak Memory Usage 56% 43% 45% 41% Peak CPU Usage 100% 95% 98% 92% 25
  • 26. LESSON LEARNED Abstract execution framework (MR, Tez, Spark) from business logic in your app architecture. This allows switching frameworks, keeping up with Hive community. 26
  • 27. LESSON LEARNED Hive UDFs (User Defined Functions) can solve your complex problems when SQL falls short. 27
  • 28. HIVE UDFS Used when Hive functionality falls short • e.g. window function with ignore nulls – not supported in Hive • e.g. date formatting functions – Hive 1.2 has better support Used when non-procedural SQL can’t accomplish the task • e.g. de-dupe many-to-many time series pairings A->B 10AM, A->C 11AM, D->B 11AM, D->C 11AM, E->C 12PM 28
  • 29. LESSON LEARNED Choose an optimal storage format (row, columnar) and compression type (splittable, space efficient, fast). 29
  • 30. FILE STORAGE – FORMAT AND COMPRESSION [Source – Amazon] STORAGE FORMATS Columnar or Row based storage RCFILE, ORC, PARQUET – Columnar formats, skip loading unwanted columns! COMPRESSION TYPES Reduce the number of bytes written to/read from HDFS Some are fast but offer less space reduction, some are space efficient but slower, some are splittable and some are not. 30 Compression Extension Splittable Encoding/Decoding Speed (Scale 1-4) Space Savings % (scale 1-4) Gzip gz no 1 4 lzo lzo yes if indexed 2 2 bzip2 bz2 yes 3 3 snappy snappy no 4 1
  • 31. LESSON LEARNED Being at the bleeding edge of sql-on-hadoop has its risks, you need the right strategy to mitigate risks/challenges along the way. 31
  • 32. RISKS AND MITIGATION STRATEGIES Hive, released in late 2008, Tez/Spark backend added recently. Traditional platforms like Oracle tested and iterated through four decades. Discovered issues with Hadoop/Hive during migration that are related to compression, storage types and memory management. Performance issues with interactive analytics – Impacts Data Scientists & Business Analysts. 32
  • 33. RISKS AND MITIGATION STRATEGIES Extensive parallel run comparison (apples-to-apples) against legacy production data to identify issues before production roll out. Partnered with Hadoop and Cloud vendors to address Hadoop/Hive issues and feature requests quickly. Push of a button automated regression test suites to kick the tires and test functionality for any minor software increments. Analyzing Presto/Tez to solve interactive analytics problems. 33
  • 34. RISKS AND MITIGATION STRATEGIES Took advantage of cloud elasticity to complete parallel runs against production volume at scale to produce results swiftly. 34
  • 35. END STATE COST Cost savings in infrastructure relative to our legacy environment. Minimal upfront spending, pay-as-you-go pricing model. Variety of machine types and size to choose from based on cost and performance needs. 35
  • 36. END STATE FLEXIBILITY Dynamic infrastructure provides great flexibility with faster reprocessing and testing needs. Simplified software procurement and license management. Ease of exploratory runs without affecting production workload. 36
  • 37. END STATE SCALABILITY Scale out at will easily on high volume days. Cloud elasticity enables running multiple days at scale in parallel. Reprocessing for historic days is completed in hours compared to weeks in our legacy environment. 37