SlideShare a Scribd company logo
The Enterprise and Connected Data,
Trends in the Apache Hadoop
Ecosystem
Alan Gates
Co-Founder
Hortonworks
@alanfgates
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey Begins…
1 ° ° °
° ° ° N
HDFS
MapReduce
Batch apps
2006
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today
Our Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
6 Years of Apache Hive and Beyond
• Apache Hive becomes a Top-Level Project
• HiveServer2 adds ODBC/JDBC
• SQL breadth expands with windowing
and more
• Apache Tez enters incubation
• Hive 0.13 marks delivery of the Stinger
Initiative with Tez, Vectorized Query
and ORCFile support
• Standard SQL authorization,
integration with Apache Ranger
• ACID transactions introduced
• Governance added with Apache
Atlas integration
• Hive 2 introduces LLAP and
intelligent in-memory caching
2010 2011 2012 2013 2014 2015 2016
A SQL data warehouse infrastructure that
delivers fast, scalable SQL processing on
Hadoop and in the Cloud
• Extensive SQL:2011 Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
HDFS
S3 + Other HDFS
Compatible Filesystems
YARN Cluster
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC SQL
Queries
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: 25+x Performance Boost
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s new in Spark 2.0?
 API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
 Performance Improvements
– Tungsten Phase 2 – Whole-stage code generation
 ML
– ML model persistence
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
 SparkSQL
– SQL 2003 support (new ANSI SQL parser, subquery support)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
How to Secure and Govern Access to Your Data?
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Policies
?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secure and Govern Your Data with Tag-Based Access Policies
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data In Motion
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey: From the Data Center to the Cloud!
2006 Today
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop in the Cloud?
Unlimited
Elastic Scale
Ephemeral &
Long-Running
IT &
Business Agility
No Upfront
HW Costs
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Architectural Considerations for Hadoop in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data and Storage
Understand and Leverage Unique Cloud Properties
 Shared data lake is cloud storage accessible
by all apps
 Cloud storage segregated from compute
 Built-in geo-distribution and DR
Focus Areas
 Address cloud storage consistency
and performance
 Enhance performance via memory
and local storage
Shared Data
& Storage
10101
10101010101
01010101010101
0101010101010101010
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhance Performance via Caching
Tabular Data: LLAP Read + Write-thru Cache
 Shared across jobs / apps and across engines
 Cache only the needed columns
 Spills to SSD when memory is full (anti-caching)
 Read & Write-through cache
 Security: Column-level and row-level
HDFS Caching for Non-tabular Data
 Cache data from cloud storage as needed
 Write-through cache
Workloads
Cloud Storage
LLAP R/W TablesHDFS Files
Cache
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Prescriptive On-Demand Ephemeral Workloads
On-Demand
Ephemeral
Workloads
Data Science
R/W TablesCompute Fabric
ETL
R/W TablesCompute Fabric
Warehouse
R/W TablesCompute Fabric
Search
R/W TablesCompute Fabric
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads
 Metadata considerations
– Tabular data metastore
– Lineage and provenance metadata
– Pipeline and job management metadata
– Add upon ingest
– Update as processing modifies data
 Access / tag-based policies and audit logs
 Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB)
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
Shared
Metadata
Policies
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Elastic Resource Management in Context of Workload
Workload Management vs. Cluster Management
 Understand resource needs of different
workload types
 Add / remove resources to meet workload SLAs
 Manage compute power and high-performance
data-access (ex., LLAP)
 Pricing-aware: instances (spot, reserved),
data, bandwidth
Elastic
Resource
Management
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in
Motion
Data at
Rest
Deep Historical
Analysis
DATA CE NTE R
Stream Analytics
Edge
Data
Data in
Motion
Machine
Learning
CLOU D Edge
Data
Edge
Analytics
Data at
Rest
Transformational Applications Require Connected Data
Thank You

More Related Content

PPTX
Hive2.0 big dataspain-nov-2016
alanfgates
 
PPTX
Hive ACID Apache BigData 2016
alanfgates
 
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
PPTX
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
PPTX
Hive acid and_2.x new_features
Alberto Romero
 
PPTX
ORC File - Optimizing Your Big Data
DataWorks Summit
 
PPTX
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
PPTX
Transactional SQL in Apache Hive
DataWorks Summit
 
Hive2.0 big dataspain-nov-2016
alanfgates
 
Hive ACID Apache BigData 2016
alanfgates
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Hive acid and_2.x new_features
Alberto Romero
 
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
Transactional SQL in Apache Hive
DataWorks Summit
 

What's hot (20)

PPTX
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Running Services on YARN
DataWorks Summit/Hadoop Summit
 
PPTX
Running Enterprise Workloads in the Cloud
DataWorks Summit
 
PDF
veshaal-singh-ebs-oracle cloud(iaas+paas)
aioughydchapter
 
PPTX
An Overview on Optimization in Apache Hive: Past, Present, Future
DataWorks Summit
 
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
PPTX
Apache Phoenix Query Server PhoenixCon2016
Josh Elser
 
PPTX
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
PPTX
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
PDF
Next Generation Execution for Apache Storm
DataWorks Summit
 
PPTX
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
 
PDF
Aioug ha day oct2015 goldengate- High Availability Day 2015
aioughydchapter
 
PDF
What is new in Apache Hive 3.0?
DataWorks Summit
 
PPTX
Apache Phoenix Query Server
Josh Elser
 
PPTX
Debugging Apache Hadoop YARN Cluster in Production
Xuan Gong
 
PPTX
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
PPTX
Apache Ambari - What's New in 2.2
Hortonworks
 
PDF
Aman sharma hyd_12crac High Availability Day 2015
aioughydchapter
 
PPTX
Pimping SQL Developer and Data Modeler
Kris Rice
 
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Running Services on YARN
DataWorks Summit/Hadoop Summit
 
Running Enterprise Workloads in the Cloud
DataWorks Summit
 
veshaal-singh-ebs-oracle cloud(iaas+paas)
aioughydchapter
 
An Overview on Optimization in Apache Hive: Past, Present, Future
DataWorks Summit
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
Apache Phoenix Query Server PhoenixCon2016
Josh Elser
 
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
Next Generation Execution for Apache Storm
DataWorks Summit
 
Building Data Pipelines for Solr with Apache NiFi
Bryan Bende
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
aioughydchapter
 
What is new in Apache Hive 3.0?
DataWorks Summit
 
Apache Phoenix Query Server
Josh Elser
 
Debugging Apache Hadoop YARN Cluster in Production
Xuan Gong
 
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
Apache Ambari - What's New in 2.2
Hortonworks
 
Aman sharma hyd_12crac High Availability Day 2015
aioughydchapter
 
Pimping SQL Developer and Data Modeler
Kris Rice
 
Ad

Viewers also liked (20)

PDF
Strata Stinger Talk October 2013
alanfgates
 
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
PPTX
Hive acid-updates-summit-sjc-2014
alanfgates
 
PPTX
Introduction to Hive
Uday Vakalapudi
 
PPTX
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
PPTX
Keynote apache bd-eu-nov-2016
alanfgates
 
PPTX
Hortonworks apache training
alanfgates
 
PDF
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
PPTX
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
PPTX
Hive acid-updates-strata-sjc-feb-2015
alanfgates
 
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PDF
Architecting a Next Generation Data Platform
hadooparchbook
 
PDF
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
PDF
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark Summit
 
Strata Stinger Talk October 2013
alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
Hive acid-updates-summit-sjc-2014
alanfgates
 
Introduction to Hive
Uday Vakalapudi
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
Keynote apache bd-eu-nov-2016
alanfgates
 
Hortonworks apache training
alanfgates
 
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
Hive acid-updates-strata-sjc-feb-2015
alanfgates
 
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PySpark Best Practices
Cloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Hive Quick Start Tutorial
Carl Steinbach
 
Architecting a Next Generation Data Platform
hadooparchbook
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark Summit
 
Ad

Similar to Big data spain keynote nov 2016 (20)

PDF
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PPTX
SoCal BigData Day
John Park
 
PPTX
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
PDF
What is New in Apache Hive 3.0?
DataWorks Summit
 
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
PPTX
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
PPTX
OOP 2014
Emil Andreas Siemes
 
PDF
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 
PPTX
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
PPTX
Munich HUG 21.11.2013
Emil Andreas Siemes
 
PPTX
Hadoop crashcourse v3
Hortonworks
 
PDF
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
PPTX
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
PDF
Introduction to Hadoop
POSSCON
 
PPTX
Apache Hadoop Now Next and Beyond
DataWorks Summit
 
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
PPTX
Hortonworks.bdb
Emil Andreas Siemes
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
SoCal BigData Day
John Park
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
DataWorks Summit
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
Cloud Austin Meetup - Hadoop like a champion
Ameet Paranjape
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Munich HUG 21.11.2013
Emil Andreas Siemes
 
Hadoop crashcourse v3
Hortonworks
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Introduction to Hadoop
POSSCON
 
Apache Hadoop Now Next and Beyond
DataWorks Summit
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
Hortonworks.bdb
Emil Andreas Siemes
 

Recently uploaded (20)

PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Presentation about variables and constant.pptx
safalsingh810
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
oapresentation.pptx
mehatdhavalrajubhai
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Explanation about Structures in C language.pptx
Veeral Rathod
 

Big data spain keynote nov 2016

  • 1. The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem Alan Gates Co-Founder Hortonworks @alanfgates
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Hadoop Journey Begins… 1 ° ° ° ° ° ° N HDFS MapReduce Batch apps 2006
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today Our Hadoop Journey: Ecosystem Innovation Accelerates 2006 2011
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved 6 Years of Apache Hive and Beyond • Apache Hive becomes a Top-Level Project • HiveServer2 adds ODBC/JDBC • SQL breadth expands with windowing and more • Apache Tez enters incubation • Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support • Standard SQL authorization, integration with Apache Ranger • ACID transactions introduced • Governance added with Apache Atlas integration • Hive 2 introduces LLAP and intelligent in-memory caching 2010 2011 2012 2013 2014 2015 2016 A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud • Extensive SQL:2011 Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage HDFS S3 + Other HDFS Compatible Filesystems YARN Cluster LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: 25+x Performance Boost 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s new in Spark 2.0?  API Improvements – SparkSession – new entry point – Unified DataFrame & DataSet API – Structured Streaming/Continuous Application  Performance Improvements – Tungsten Phase 2 – Whole-stage code generation  ML – ML model persistence – Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)  SparkSQL – SQL 2003 support (new ANSI SQL parser, subquery support)
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved How to Secure and Govern Access to Your Data? Classification Prohibition Time Location Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Policies ?
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure and Govern Your Data with Tag-Based Access Policies Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data In Motion  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Hadoop Journey: From the Data Center to the Cloud! 2006 Today
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Hadoop in the Cloud? Unlimited Elastic Scale Ephemeral & Long-Running IT & Business Agility No Upfront HW Costs
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Architectural Considerations for Hadoop in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Data and Storage Understand and Leverage Unique Cloud Properties  Shared data lake is cloud storage accessible by all apps  Cloud storage segregated from compute  Built-in geo-distribution and DR Focus Areas  Address cloud storage consistency and performance  Enhance performance via memory and local storage Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhance Performance via Caching Tabular Data: LLAP Read + Write-thru Cache  Shared across jobs / apps and across engines  Cache only the needed columns  Spills to SSD when memory is full (anti-caching)  Read & Write-through cache  Security: Column-level and row-level HDFS Caching for Non-tabular Data  Cache data from cloud storage as needed  Write-through cache Workloads Cloud Storage LLAP R/W TablesHDFS Files Cache
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Prescriptive On-Demand Ephemeral Workloads On-Demand Ephemeral Workloads Data Science R/W TablesCompute Fabric ETL R/W TablesCompute Fabric Warehouse R/W TablesCompute Fabric Search R/W TablesCompute Fabric
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Data Requires Shared Metadata, Security, and Governance Shared Metadata Across All Workloads  Metadata considerations – Tabular data metastore – Lineage and provenance metadata – Pipeline and job management metadata – Add upon ingest – Update as processing modifies data  Access / tag-based policies and audit logs  Centrally stored to facilitate use across clusters – Ex. backed by Cloud RDS (or shared DB) Classification Prohibition Time Location Streams Pipelines Feeds Tables Files Objects Shared Metadata Policies
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Elastic Resource Management in Context of Workload Workload Management vs. Cluster Management  Understand resource needs of different workload types  Add / remove resources to meet workload SLAs  Manage compute power and high-performance data-access (ex., LLAP)  Pricing-aware: instances (spot, reserved), data, bandwidth Elastic Resource Management
  • 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data in Motion Data at Rest Deep Historical Analysis DATA CE NTE R Stream Analytics Edge Data Data in Motion Machine Learning CLOU D Edge Data Edge Analytics Data at Rest Transformational Applications Require Connected Data

Editor's Notes

  • #2: Speaker: Alan Gates, Hortonworks Co-Founder Title: 10 Years of Apache Hadoop and Beyond Duration: 40 minutes Abstract: In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
  • #3: Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data. Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop. [NEXT]
  • #4: Fast forward to 2011 and the concept of YARN began to emerge. Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications. The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals. [NEXT]
  • #5: I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
  • #8: swiss army knife of big data, can do streaming, SQL, ETL, ML available from multiple languages (python, java, scala)
  • #10: So the enterprise has invested in integrating Hadoop into its data lake architecture. Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc. The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it. [NEXT ]
  • #11: The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need. The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger). This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies. Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply. Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies. This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed. [NEXT]
  • #12: TALK TRACK People are no longer willing to wait until data is in the store before processing it Hortonworks DataFlow is a platform for data in motion. It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing. MiNiFi/NiFi : creates dynamic, configurable data pipelines Kafka support adaptation to differing rates of data creation and delivery Storm for real-time streaming processing to create immediate insights at a massive scale. There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
  • #13: So after 10 years, the Hadoop ecosystem is available everywhere. In the Data Center, within appliances, across public and private clouds. This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases. [NEXT]
  • #14: 13
  • #15: While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class. I’ll spend the remainder of this talk on the key architectural considerations Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS. Also memory and local storage play a different role – that of caching An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways. Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters. And finally, I’ll touch on Elastic Resource Management We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads [NEXT]
  • #16: In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster. Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake. Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party Good news: goe-distribution is built-into the cloud storage and DR becomes simpler. Cloud storage has two limitation: Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps. Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic. Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute. Memory and Local storage play a different role in the cloud – cache to enhance the performance. [NEXT]
  • #17: Wrt to caching we need to consider both tabular data and non-tabular data. For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark. LLAP only caches the needed columns, so it’s very efficient in its use of memory. Further, data is stored in an internal serialized form to optimize compute The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full . LLAP currently provides read caching, but is being extended to support a write-through cache. And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible …. From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache. This essentially evolves HDFS to play a different role, A place to store intermediate data and also to be a finely-tuned caching layer between the applications and the cloud storage. [NEXT]
  • #18: Always-on multitenant clusters are important for a range of mission critical use cases. However, bringing forth an ephemeral cluster to support a specific workload is game changing. The agile nature of the cloud allows us to create prescription workload environments. For someone interested in modeling and analyzing data sets, - they simply want to interact with a PRE-TUNED environment optimized for the application. - The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood. Whether it’s data science, data warehouse, ETL, or other common workload types, - provide pre-configured and pre-tuned compute environments - Further we need be able manage them in, ephemeral fashion. The NET: deliver user experiences that are focused on business agility, - rather than infinite configurability and cluster management. [NEXT]
  • #19: So far: I shared data and storage and how to optimize performance by caching. Shared data fundamentally requires a shared approach to metadata, security and governance. The Metadata is not just the classic Hive metadata that describes the tabular data, -about storing and tracking the lineage and provenance of data, - about details related to data pipeline processing and job management. Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is Also, as data is ingested and processed, metadata needs to be created and adjusted Governance and securing the data remain critical and its matadata needs to be managed across all workloads. - The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment. If we don’t do this then the cloud will not be adapted aggressively for enterprise use. Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata.. In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters. [NEXT]
  • #20: Final area: resource management. We up-level resource manage - So far, Yarn has focused on optimizing resources in the context cluster. - The cloud is not about the cluster it is about the workloads, And further resources are elastic. The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA - It may need get extra resources from the cloud - get the right resource to match the needs of the workload. Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck –e.g. spin up LLAPs memory in order to improve caching and hence meet SLA Cloud offers another dimension – that of cost and budgets. There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in Resource management in this new dimension is important if you want reap the benefit low cost cloud computing. To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing. CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud and also to take advantage of the unique cloud features such as elasticity. We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week. [NEXT]
  • #21: Today we have talked about evolving Hadoop to run well in the cloud. At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center. This is illustrated on the screen. It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data.. Also it illustrates the the connected ness of data at motion and data at rest. The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data - (data in motion and data at rest) It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights. It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data. So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example. In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models. [NEXT SLIDE]
  • #22: THANK YOU AND HAVE A GREAT CONFERENCE!