SlideShare a Scribd company logo
Curb Your Insecurity with HDP
Tips for a Secure Cluster (with Spark too)
Hadoop Summit – San Jose
June 29th, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Pardeep Kumar
Sr. Systems Architect, NA Prof. Services
4+ years in Hadoop
Helping Fortune500 customers succeed in
their Hadoop journey
Setup, implement, migrate and secure some
of the largest clusters in North America
Security, & Migration SME, HCC Guru
Loves Hadoop, Cricket and Kerberos ;)
pardeep.kumar@hortonworks.com
@hadooptutor
linkedin.com/in/pardeepkumarmishra
Ancil McBarnett
Sr. Solutions Engineer, NorthEast
Helping organizations design, implement,
operate and consume Hadoop and Big Data
Solutions. Specialize in Security and Hive
Tuning. HCC Guru.
Loves Cricket, and DJ Bravo Champion :D
amcbarnett@hortonworks.com
@mcbkingdom
linkedin.com/in/mcbkingdom
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Security in 4 Steps
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How do I set policy across the entire cluster?
Who am I/prove it?
What can I do?
What did I do?
How can I encrypt at rest and over the wire?
Comprehensive Approach to Security
Data Protection
Protect data at rest and in motion
In order to protect any data system you must implement the following:
Audit
Maintain a record of data access
Authorization
Provision access to data
Authentication
Authenticate users and systems
Administration
Central management and consistent security
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Security: Comprehensive, Complete, Extensible
Perimeter Level Security
• Network Security (i.e. Firewalls)
• Apache Knox (i.e. Gateways)
Authentication
• LDAP/ AD - Kerberos
Data Protection
• Encrypts data in motion and data at rest;
refer partner encryption solutions for broader
needs: HDFS TDE with Ranger KMS
Authorization & Audit
• Consistent authorization controls
across all Apache components within
HDP: Apache Ranger
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication with Kerberos
Kerberos is necessary evil, just do it!!
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Without Kerberos
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Configure Kerberos – Ambari Wizard
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security With Kerberos
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS File Security
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive Database and Table Security
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authorization and Audit
Authorization
Fine grain access control
• HDFS – Folder, File
• Hive – Database, Table, Column
• HBase – Table, Column Family, Column
• Storm, Knox and more
Audit
Extensive user access auditing in
HDFS, Hive and HBase
• IP Address
• Resource type/ resource
• Timestamp
• Access granted or denied
Control access
into system
Flexibility
in defining
policies
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rest API Security with Apache Knox
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop REST APIs
Useful for connecting to Hadoop from the outside the cluster
When more client language flexibility is required
– i.e. Java binding not an option
Challenges
– Client must have knowledge of cluster topology
– Required to open ports (and in some cases, on every host) outside the cluster
Service API
WebHDFS Supports HDFS user operations including reading files, writing to
files, making directories, changing permissions and renaming.
WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL
commands. Learn more about WebHCat.
Hive Hive REST API operations
HBase HBase REST API operations
Oozie Job submission and management, and Oozie administration.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO Integration—Siteminder
and OAM
• LDAP and AD integration
Incubated and led by Hortonworks,
Apache Knox extends the reach of Hadoop REST API
without Kerberos complexities
Integrated with existing systems to
simplify identity maintenance
Single, simple point of access for a
cluster
Central controls ensure consistency
across one or more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop REST API with Knox
Service Direct URL Knox URL
WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/webhdfs
WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/templeton
Oozie http://ooziehost:11000/oozie https://knox-host:8443/oozie
HBase http://hbasehost:60080 https://knox-host:8443/hbase
Hive http://hivehost:10001/cliservice https://knox-host:8443/hive
YARN http://yarn-host:yarn-port/ws https://knox-host:8443/resourcemanager
Masters could
be on many
different hosts
One hosts,
one port
Consistent
paths
SSL config at
one host
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop REST API Security: Drill-Down
REST
Client
Enterprise
Identity
Provider
LDAP/AD
Knox Gateway
GW
GW
Firewall
Firewall
DMZ
LB
Edge
Node/Hadoo
p CLIs RPC
HTTP
HTTP HTTP
LDAP
Hadoop Cluster 1
Masters
Slaves
RM
NN
Web
HCat
Oozie
DN NM
HS2
Hadoop Cluster 2
Masters
Slaves
RM
NN
Web
HCat
Oozie
DN NM
HS2
HBase
HBase
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Protection
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Protection
HDP allows you to apply data protection policy at
different layers across the Hadoop stack
Layer What? How ?
Storage and
Access
Encrypt data while it is at rest
HDFS Transparent Data Encryption, Partners,
Hbase encryption, OS level encrypt,
Transmission Encrypt data as it moves SSL, SASL, RPC
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Points of Communication
Page 22
WebHDFS
DataTransferProtocol
Nodes
M/R Shuffle
Client
1
2
4
RPC3
Nodes
DataTransfer2
JDBC/ODBC
3
Hadoop Cluster
RPC
4
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Protection - HDFS Encryption
DATA ACCESS
DATA MANAGEMENT
SECURITY PARTNERS
YARN
KeyProvider API
(partner integration point)
Key Management System (KMS)
Stateless Key Management
°
1
°
°
°
°
° °
° °
° °
° °
° N°
1 ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
° °
° °
° °
° °
°
HDFS
Encryption Zone
Encrypted
File
Encrypted
File
Encrypted
File
Encrypted
File
Encrypted
Files
Name
Node
HDFS
Client
HDFS
Client
• Leverage Native HDFS Transparent Data Encryption or commercial ones like Protegrity etc.
• Hortonworks collaborating with partners to deliver enterprise scale
Key Management , deliver more choices to customers
• Open source KMS with Ranger
• Or Partner with commercial KMS solutions i.e. Voltage KMS
- Partner joint engineering resources
- Voltage Stateless Key Management integrated with KeyProvider API
Only HDP offers open
source and
commercial choices
for key managementOpen Source Key Management
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Transparent Data Encryption
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Securing Spark Deployments
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark - Authentication
Hadoop Cluster
Spark leverages Kerberos on
YARN
KDC
Use Spark ST,
submit Spark Job
Spark gets Namenode
(NN) service ticket
YARN launches
Spark Executors
using John Doe’s
identity
John
Doe
Spark AM
NN
Executor reads from HDFS
using John Doe’s
delegation token
kinit
1
2
3
4
5
6
7
Get Service Ticket
(ST) for Spark
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
John
Doe
RangerCan John launch this job?
Can John read this file
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Channel Encryption - Example
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Gotchas with Spark Security
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop
– Forces STS to run as Hive user to read all data
– Reduces security
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 SparkSQL – Granular security unavailable
– Ranger integration will solve this problem (Refer to talk in Room 210A for Security in Spark and
Hive)
– Brings Row/Column level/Masking features to SparkSQL
 Spark + HBase with Kerberos
– Issue fixed in Spark 1.4 (Spark-6918)
 Spark Stream + Kafka + Kerberos + SSL
– Issues fixed in HDP 2.4.x
 Spark jobs > 72 Hours
– Kerberos token not renewed, fixed in Spark 1.5+
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions??

More Related Content

PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
PPTX
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
PPTX
Deep Learning using Spark and DL4J for fun and profit
DataWorks Summit/Hadoop Summit
 
PDF
Multitenancy At Bloomberg - HBase and Oozie
DataWorks Summit
 
PDF
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
PPTX
Embeddable data transformation for real time streams
Joey Echeverria
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
Deep Learning using Spark and DL4J for fun and profit
DataWorks Summit/Hadoop Summit
 
Multitenancy At Bloomberg - HBase and Oozie
DataWorks Summit
 
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Embeddable data transformation for real time streams
Joey Echeverria
 

What's hot (20)

PPTX
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
PPTX
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
PPTX
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
PPTX
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
PDF
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
PPTX
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
PPTX
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
DataWorks Summit/Hadoop Summit
 
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (20)

PPTX
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Pardeep Kumar Mishra (Big Data / Hadoop Consultant)
 
PPTX
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
KEY
Real Time BI with Hadoop
Bradford Stephens
 
PPTX
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
PDF
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Tin Ho
 
PPTX
Using Hadoop for Cognitive Analytics
DataWorks Summit/Hadoop Summit
 
PPTX
The Path to Wellness through Big Data
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Security Today and Tomorrow
DataWorks Summit
 
PDF
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
PPTX
HIPAA Compliance in the Cloud
DataWorks Summit/Hadoop Summit
 
PPTX
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Open Source Ingredients for Interactive Data Analysis in Spark
DataWorks Summit/Hadoop Summit
 
PDF
Machine Learning for Any Size of Data, Any Type of Data
DataWorks Summit/Hadoop Summit
 
PPTX
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
PPTX
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Pardeep Kumar Mishra (Big Data / Hadoop Consultant)
 
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
Real Time BI with Hadoop
Bradford Stephens
 
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
 
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Tin Ho
 
Using Hadoop for Cognitive Analytics
DataWorks Summit/Hadoop Summit
 
The Path to Wellness through Big Data
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
Hadoop Security Today and Tomorrow
DataWorks Summit
 
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
HIPAA Compliance in the Cloud
DataWorks Summit/Hadoop Summit
 
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
Open Source Ingredients for Interactive Data Analysis in Spark
DataWorks Summit/Hadoop Summit
 
Machine Learning for Any Size of Data, Any Type of Data
DataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Curb your insecurity with HDP (20)

PDF
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
PPTX
Apache Ranger
Rommel Garcia
 
PPTX
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
PPTX
Hadoop security
Shivaji Dutta
 
PPTX
Hdp security overview
Hortonworks
 
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
PDF
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
PPTX
Managing enterprise users in Hadoop ecosystem
DataWorks Summit
 
PDF
Discover HDP 2.1: Apache Solr for Hadoop Search
Hortonworks
 
PDF
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks
 
PDF
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Hortonworks
 
PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
PDF
Discover.hdp2.2.h base.final[2]
Hortonworks
 
PPTX
Realtime analytics + hadoop 2.0
Rommel Garcia
 
PPTX
Realtime Analytics in Hadoop
Rommel Garcia
 
PDF
Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...
huguk
 
PDF
August 2014 HUG : Comprehensive Security for Hadoop
Yahoo Developer Network
 
PDF
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
PPTX
Big data spain keynote nov 2016
alanfgates
 
Curb your insecurity with HDP - Tips for a Secure Cluster
ahortonworks
 
Apache Ranger
Rommel Garcia
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
Hadoop security
Shivaji Dutta
 
Hdp security overview
Hortonworks
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Hortonworks
 
Managing enterprise users in Hadoop ecosystem
DataWorks Summit
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Hortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Hortonworks
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Hortonworks
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Discover.hdp2.2.h base.final[2]
Hortonworks
 
Realtime analytics + hadoop 2.0
Rommel Garcia
 
Realtime Analytics in Hadoop
Rommel Garcia
 
Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...
huguk
 
August 2014 HUG : Comprehensive Security for Hadoop
Yahoo Developer Network
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
Big data spain keynote nov 2016
alanfgates
 

More from DataWorks Summit/Hadoop Summit (20)

PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Doc9.....................................
SofiaCollazos
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Artificial Intelligence (AI)
Mukul
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 

Curb your insecurity with HDP

  • 1. Curb Your Insecurity with HDP Tips for a Secure Cluster (with Spark too) Hadoop Summit – San Jose June 29th, 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Pardeep Kumar Sr. Systems Architect, NA Prof. Services 4+ years in Hadoop Helping Fortune500 customers succeed in their Hadoop journey Setup, implement, migrate and secure some of the largest clusters in North America Security, & Migration SME, HCC Guru Loves Hadoop, Cricket and Kerberos ;) [email protected] @hadooptutor linkedin.com/in/pardeepkumarmishra Ancil McBarnett Sr. Solutions Engineer, NorthEast Helping organizations design, implement, operate and consume Hadoop and Big Data Solutions. Specialize in Security and Hive Tuning. HCC Guru. Loves Cricket, and DJ Bravo Champion :D [email protected] @mcbkingdom linkedin.com/in/mcbkingdom
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Security in 4 Steps
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How do I set policy across the entire cluster? Who am I/prove it? What can I do? What did I do? How can I encrypt at rest and over the wire? Comprehensive Approach to Security Data Protection Protect data at rest and in motion In order to protect any data system you must implement the following: Audit Maintain a record of data access Authorization Provision access to data Authentication Authenticate users and systems Administration Central management and consistent security
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Security: Comprehensive, Complete, Extensible Perimeter Level Security • Network Security (i.e. Firewalls) • Apache Knox (i.e. Gateways) Authentication • LDAP/ AD - Kerberos Data Protection • Encrypts data in motion and data at rest; refer partner encryption solutions for broader needs: HDFS TDE with Ranger KMS Authorization & Audit • Consistent authorization controls across all Apache components within HDP: Apache Ranger
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication with Kerberos Kerberos is necessary evil, just do it!!
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Without Kerberos
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Configure Kerberos – Ambari Wizard
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security With Kerberos
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS File Security
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Database and Table Security
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authorization and Audit Authorization Fine grain access control • HDFS – Folder, File • Hive – Database, Table, Column • HBase – Table, Column Family, Column • Storm, Knox and more Audit Extensive user access auditing in HDFS, Hive and HBase • IP Address • Resource type/ resource • Timestamp • Access granted or denied Control access into system Flexibility in defining policies
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rest API Security with Apache Knox
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop REST APIs Useful for connecting to Hadoop from the outside the cluster When more client language flexibility is required – i.e. Java binding not an option Challenges – Client must have knowledge of cluster topology – Required to open ports (and in some cases, on every host) outside the cluster Service API WebHDFS Supports HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat. Hive Hive REST API operations HBase HBase REST API operations Oozie Job submission and management, and Oozie administration.
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication—API Security with Knox • Eliminates SSH “edge node” • Central API management • Central audit control • Service level authorization • SSO Integration—Siteminder and OAM • LDAP and AD integration Incubated and led by Hortonworks, Apache Knox extends the reach of Hadoop REST API without Kerberos complexities Integrated with existing systems to simplify identity maintenance Single, simple point of access for a cluster Central controls ensure consistency across one or more clusters • Kerberos Encapsulation • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop REST API with Knox Service Direct URL Knox URL WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/webhdfs WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/templeton Oozie http://ooziehost:11000/oozie https://knox-host:8443/oozie HBase http://hbasehost:60080 https://knox-host:8443/hbase Hive http://hivehost:10001/cliservice https://knox-host:8443/hive YARN http://yarn-host:yarn-port/ws https://knox-host:8443/resourcemanager Masters could be on many different hosts One hosts, one port Consistent paths SSL config at one host
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop REST API Security: Drill-Down REST Client Enterprise Identity Provider LDAP/AD Knox Gateway GW GW Firewall Firewall DMZ LB Edge Node/Hadoo p CLIs RPC HTTP HTTP HTTP LDAP Hadoop Cluster 1 Masters Slaves RM NN Web HCat Oozie DN NM HS2 Hadoop Cluster 2 Masters Slaves RM NN Web HCat Oozie DN NM HS2 HBase HBase
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Protection
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Protection HDP allows you to apply data protection policy at different layers across the Hadoop stack Layer What? How ? Storage and Access Encrypt data while it is at rest HDFS Transparent Data Encryption, Partners, Hbase encryption, OS level encrypt, Transmission Encrypt data as it moves SSL, SASL, RPC
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Points of Communication Page 22 WebHDFS DataTransferProtocol Nodes M/R Shuffle Client 1 2 4 RPC3 Nodes DataTransfer2 JDBC/ODBC 3 Hadoop Cluster RPC 4
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Protection - HDFS Encryption DATA ACCESS DATA MANAGEMENT SECURITY PARTNERS YARN KeyProvider API (partner integration point) Key Management System (KMS) Stateless Key Management ° 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N° 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS Encryption Zone Encrypted File Encrypted File Encrypted File Encrypted File Encrypted Files Name Node HDFS Client HDFS Client • Leverage Native HDFS Transparent Data Encryption or commercial ones like Protegrity etc. • Hortonworks collaborating with partners to deliver enterprise scale Key Management , deliver more choices to customers • Open source KMS with Ranger • Or Partner with commercial KMS solutions i.e. Voltage KMS - Partner joint engineering resources - Voltage Stateless Key Management integrated with KeyProvider API Only HDP offers open source and commercial choices for key managementOpen Source Key Management
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Transparent Data Encryption
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Securing Spark Deployments
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark - Authentication Hadoop Cluster Spark leverages Kerberos on YARN KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity John Doe Spark AM NN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Get Service Ticket (ST) for Spark
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark John Doe RangerCan John launch this job? Can John read this file
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Channel Encryption - Example Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Gotchas with Spark Security  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Forces STS to run as Hive user to read all data – Reduces security – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  SparkSQL – Granular security unavailable – Ranger integration will solve this problem (Refer to talk in Room 210A for Security in Spark and Hive) – Brings Row/Column level/Masking features to SparkSQL  Spark + HBase with Kerberos – Issue fixed in Spark 1.4 (Spark-6918)  Spark Stream + Kafka + Kerberos + SSL – Issues fixed in HDP 2.4.x  Spark jobs > 72 Hours – Kerberos token not renewed, fixed in Spark 1.5+
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions??