Securing the Hadoop Ecosystem
Patrick Angeles
Big Data Warehouse Meetup
Feb 10, 2014
Why is Security Important?
About Me
Hadooping for 5+ years
• Responsible for several secure Hadoop deployments
• Did e-commerce and consumer analytics (PCI, PII,
etc.)
• Crypto and PKI in a previous life.
•
Why Secure Hadoop?
•

Multi-tenancy
•

•

You want your cluster to store data and run workloads
from multiple users and groups

Compliance
•

You have policies on which personnel can view what data
Agenda
Hadoop Ecosystem Interactions
• Security Concepts
• Security in Practice
•

•
•

IT Infrastructure Integration
Deployment Recommendations
Hadoop on its Own
WebHdfs
client

HDFS
client

Hadoop
NN

SNN

DN TT

Map
Task

DN TT

Map
Task

DN TT

Reduce
Task

HttpFS

MR
client

hdfs, httpfs & mapred users

JT

end users

protocols: RPC/data transfer/HTTP
Hadoop and Friends
service users

end users

clients

protocols: RPCs/data/HTTP/Thrift/Avro-RPC
services

clients
Hbase

Zookeeper

RPC

Hbase

RPC

Zookeeper
Oozie

HTTP

Oozie

WebHdfs
Pig

HTTP

Hue

Crunch

HTTP

browser

HTTP
Cascading

MapRed

RPC

Hadoop

RPC

Flume

Sqoop

Impala

Hive

Hive Metastore

Thrift

Avro RPC

Thrift

Flume

Impala
Security Concepts
Authentication
• Authorization
• Confidentiality
•

•

•

Encryption

Auditing
•

Traceability
Authentication
•

End Users to Services, as a user
•
•
•

•

Services to Services, as a service
•
•

•

CLI & libraries: Kerberos (kinit or keytab)
Web UIs: Kerberos SPNEGO & pluggable HTTP auth
MR tasks use delegation tokens
Credentials: Kerberos (keytab)
Client SSL certificates (for shuffle encryption)

Services to Services, on behalf of a user
•

Proxy-user (after Kerberos for service)
Authorization
•

HDFS Data
•

•

HBase Data
•

•

Fine-grained authorization through Apache Sentry (Incubating)

Jobs (Hadoop, Oozie)
•

•

Read/Write Access Control Lists (ACLs) at table level

Hive Server 2 and Impala
•

•

File System permissions (Unix like user/group
permissions)

Job ACLs for Hadoop Scheduler Queues, manage &
view jobs

Zookeeper
•

ACLs at znodes, authenticated & read/write
Confidentiality
•

Data in transit
RPC: using SASL
• HDFS data: using SASL
• HTTP: using SSL (web UIs, shuffle). Requires SSL
certs
•

•

Data at rest
Nothing out of the box
• Doable by: custom ‘compression’ codec or
local file system encryption
•
Auditing
•

Who accessed (read/write) FS data
•
•

•

Who submitted, managed, or viewed a Job or a
Query
•

•

NN audit log contains all file opens, creates
NN audit log contains all metadata ops, e.g. rename, listdir

JT, RM, and Job History Server logs contain history of all
jobs run on a cluster

Who submitted, managed, or viewed a workflow
•

Oozie audit logs contain history of all user requests
Auditing Gaps
•

Not all projects have explicit audit logs
•
•

•

It is difficult to correlate jobs & data access
•
•

•

Audit-like information can be extracted by processing logs
Eg: Impala query logs are distributed across all nodes
Eg: Map-Reduce jobs launched by Pig job
Eg: HDFS data accessed by a Map-Reduce job

Tools written on top of Hadoop can do this well
Security in Practice
Integration: Kerberos
Users don’t want Yet Another Credential
• Corp IT doesn’t want to provision thousands of
service principals
• Solution: local KDC + one-way trust
• Run a KDC (usually MIT Kerberos) in the cluster
•

•

•

Put all service principals here

Set up one-way trust of central corporate realm by
local KDC
•

Normal user credentials can be used to access Hadoop
Integration: Groups
•

Much of Hadoop authorization uses “groups”
•

•

Users’ groups are not stored in Hadoop anywhere
•
•

•

User ‘patrick’ might belong to groups ‘analysts’, ‘eng’, etc.
Refers to external system to determine group membership
NN/JT/Oozie/Hive servers all must perform group mapping

Default plugins for user/group mapping:
•
•
•

ShellBasedUnixGroupsMapping – forks/runs `/bin/id’
JniBasedUnixGroupsMapping – makes a system call
LdapGroupsMapping – talks directly to an LDAP server
Integration: Kerberos + LDAP

Central Active Directory

LDAP group
mapping

me@EXAMPLE.COM
…

Hadoop Cluster

NN

JT
Local KDC

Cross-realm trust

hdfs/host1@HADOOP.EXAMPLE.COM
yarn/host2@HADOOP.EXAMPLE.COM
…
Integration: Web Interfaces
•

Most web interfaces authenticate using SPNEGO
•
•
•

•

Standard HTTP authentication protocol
Used internally by services which communicate over HTTP
Most browsers support Kerberos SPNEGO authentication

Hadoop components which use servlets for web
interfaces can plug in custom filter
•

Integrate with intranet SSO HTTP solution
Recommendations
•

Security configuration is a PITA
•

•

Do only what you really need

Enable cluster security (Kerberos) only if un-trusted
groups of users are sharing the cluster
•

Otherwise use edge-security to keep outsiders out

Only enable wire encryption if required
• Only enable web interface authentication if required
•
Security Enablement
•

Secure Hadoop enablement order
1.
2.
3.
4.
5.
6.
7.

HDFS RPC (including SNN check-pointing)
JobTracker RPC
TaskTrackers RPC & LinuxTaskControler
Hadoop web UI
Configure monitoring to work with security
Other services (HBase, Oozie, Hive Metastore, etc)
Continue with authorization and network encryption if
needed
Administration
•

Use an admin/management tool
•
•
•

Several inter-related configuration knobs
To manage principals/keytabs creation and distribution
Automatically configures monitoring for security
Q&A

More Related Content

PPTX
Hadoop and Data Access Security
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PDF
Hadoop Security: Overview
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
PDF
Hadoop security overview_hit2012_1117rev
PPTX
Hadoop Security Today and Tomorrow
PDF
Big Data Security with Hadoop
PDF
Hadoop & Security - Past, Present, Future
Hadoop and Data Access Security
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security: Overview
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security overview_hit2012_1117rev
Hadoop Security Today and Tomorrow
Big Data Security with Hadoop
Hadoop & Security - Past, Present, Future

What's hot (20)

PPTX
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
PDF
Hadoop Security
PPTX
Open Source Security Tools for Big Data
PPTX
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
PDF
Article data-centric security key to cloud and digital business
PDF
Hadoop Security
PPTX
Big data security
PDF
Nl HUG 2016 Feb Hadoop security from the trenches
PDF
Hadoop Security and Compliance - StampedeCon 2016
PPTX
Implementing Security on a Large Multi-Tenant Cluster the Right Way
PPTX
What the Enterprise Requires - Business Continuity and Visibility
PPTX
Classification based security in Hadoop
PPTX
Improvements in Hadoop Security
PDF
Advanced Security In Hadoop Cluster
PPTX
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
PDF
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
PPTX
Ranger admin dev overview
PPTX
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
PPTX
Security implementation on hadoop
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Hadoop Security
Open Source Security Tools for Big Data
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Article data-centric security key to cloud and digital business
Hadoop Security
Big data security
Nl HUG 2016 Feb Hadoop security from the trenches
Hadoop Security and Compliance - StampedeCon 2016
Implementing Security on a Large Multi-Tenant Cluster the Right Way
What the Enterprise Requires - Business Continuity and Visibility
Classification based security in Hadoop
Improvements in Hadoop Security
Advanced Security In Hadoop Cluster
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Ranger admin dev overview
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
Security implementation on hadoop
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Ad

Viewers also liked (20)

PDF
Caab2010jan Foh
DOC
Resume 1
DOCX
Anthony R Palazzo - Manufacturing-2016
PPT
learning
DOCX
PDF
The Workers Business Model
PPT
3rd Slide Markma
PPS
Zlot Chorągwi
PDF
Hive contributors meetup apache sentry
PPTX
Cerner Corporation
PDF
Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
PDF
Knowledge Economy - Knowledge Ecology
PDF
April 2014 HUG : Apache Sentry
PDF
Hadoop Integration into Data Warehousing Architectures
PPTX
Presentación Práctica Grupal 2012
PDF
Sentry - An Introduction
PPTX
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
PPTX
Proarrhythmia
Caab2010jan Foh
Resume 1
Anthony R Palazzo - Manufacturing-2016
learning
The Workers Business Model
3rd Slide Markma
Zlot Chorągwi
Hive contributors meetup apache sentry
Cerner Corporation
Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs
Unlock Hadoop Success with Cloudera Navigator Optimizer
Knowledge Economy - Knowledge Ecology
April 2014 HUG : Apache Sentry
Hadoop Integration into Data Warehousing Architectures
Presentación Práctica Grupal 2012
Sentry - An Introduction
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Proarrhythmia
Ad

Similar to Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera (20)

PPTX
Securing the Hadoop Ecosystem
PDF
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
PDF
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
PPTX
Securing Hadoop in an Enterprise Context
PPTX
Securing Hadoop in an Enterprise Context (v2)
PDF
Practical Hadoop Security 1st ed. Edition Lakhe
PPTX
Open Source Security Tools for Big Data
PPTX
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
PDF
Doing hadoop securely
PPT
Setting_up_hadoop_cluster_Detailed-overview
PPTX
Securing Hadoop in an Enterprise Context
PPT
Hadoop Security in Detail__HadoopSummit2010
PPT
1 hadoop security_in_details_hadoop_summit2010
PPTX
Improvements in Hadoop Security
PPTX
Hadoop security
PDF
Охота на уязвимости Hadoop
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PPT
Hadoop Security Preview
PDF
2014 sept 4_hadoop_security
Securing the Hadoop Ecosystem
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context (v2)
Practical Hadoop Security 1st ed. Edition Lakhe
Open Source Security Tools for Big Data
Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj...
Doing hadoop securely
Setting_up_hadoop_cluster_Detailed-overview
Securing Hadoop in an Enterprise Context
Hadoop Security in Detail__HadoopSummit2010
1 hadoop security_in_details_hadoop_summit2010
Improvements in Hadoop Security
Hadoop security
Охота на уязвимости Hadoop
Hadoop Security Preview
Hadoop Security Preview
Hadoop Security Preview
2014 sept 4_hadoop_security

More from Caserta (20)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PDF
Introduction to Data Science (Data Summit, 2017)
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
PDF
The Rise of the CDO in Today's Enterprise
PDF
Building a New Platform for Customer Analytics
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
You're the New CDO, Now What?
PDF
The Data Lake - Balancing Data Governance and Innovation
PDF
Making Big Data Easy for Everyone
PDF
Benefits of the Azure Cloud
PDF
Big Data Analytics on the Cloud
PDF
Intro to Data Science on Hadoop
PDF
The Emerging Role of the Data Lake
PDF
Not Your Father's Database by Databricks
PDF
Mastering Customer Data on Apache Spark
Using Machine Learning & Spark to Power Data-Driven Marketing
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Introduction to Data Science (Data Summit, 2017)
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
The Rise of the CDO in Today's Enterprise
Building a New Platform for Customer Analytics
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
You're the New CDO, Now What?
The Data Lake - Balancing Data Governance and Innovation
Making Big Data Easy for Everyone
Benefits of the Azure Cloud
Big Data Analytics on the Cloud
Intro to Data Science on Hadoop
The Emerging Role of the Data Lake
Not Your Father's Database by Databricks
Mastering Customer Data on Apache Spark

Recently uploaded (20)

PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Microsoft User Copilot Training Slide Deck
Auditboard EB SOX Playbook 2023 edition.
Comparative analysis of machine learning models for fake news detection in so...
Enhancing plagiarism detection using data pre-processing and machine learning...
Rapid Prototyping: A lecture on prototyping techniques for interface design
giants, standing on the shoulders of - by Daniel Stenberg
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
future_of_ai_comprehensive_20250822032121.pptx
Custom Battery Pack Design Considerations for Performance and Safety
Module 1 Introduction to Web Programming .pptx
Lung cancer patients survival prediction using outlier detection and optimize...
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Consumable AI The What, Why & How for Small Teams.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
The influence of sentiment analysis in enhancing early warning system model f...
Advancing precision in air quality forecasting through machine learning integ...
MuleSoft-Compete-Deck for midddleware integrations
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf

Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera

  • 1. Securing the Hadoop Ecosystem Patrick Angeles Big Data Warehouse Meetup Feb 10, 2014
  • 2. Why is Security Important?
  • 3. About Me Hadooping for 5+ years • Responsible for several secure Hadoop deployments • Did e-commerce and consumer analytics (PCI, PII, etc.) • Crypto and PKI in a previous life. •
  • 4. Why Secure Hadoop? • Multi-tenancy • • You want your cluster to store data and run workloads from multiple users and groups Compliance • You have policies on which personnel can view what data
  • 5. Agenda Hadoop Ecosystem Interactions • Security Concepts • Security in Practice • • • IT Infrastructure Integration Deployment Recommendations
  • 6. Hadoop on its Own WebHdfs client HDFS client Hadoop NN SNN DN TT Map Task DN TT Map Task DN TT Reduce Task HttpFS MR client hdfs, httpfs & mapred users JT end users protocols: RPC/data transfer/HTTP
  • 7. Hadoop and Friends service users end users clients protocols: RPCs/data/HTTP/Thrift/Avro-RPC services clients Hbase Zookeeper RPC Hbase RPC Zookeeper Oozie HTTP Oozie WebHdfs Pig HTTP Hue Crunch HTTP browser HTTP Cascading MapRed RPC Hadoop RPC Flume Sqoop Impala Hive Hive Metastore Thrift Avro RPC Thrift Flume Impala
  • 8. Security Concepts Authentication • Authorization • Confidentiality • • • Encryption Auditing • Traceability
  • 9. Authentication • End Users to Services, as a user • • • • Services to Services, as a service • • • CLI & libraries: Kerberos (kinit or keytab) Web UIs: Kerberos SPNEGO & pluggable HTTP auth MR tasks use delegation tokens Credentials: Kerberos (keytab) Client SSL certificates (for shuffle encryption) Services to Services, on behalf of a user • Proxy-user (after Kerberos for service)
  • 10. Authorization • HDFS Data • • HBase Data • • Fine-grained authorization through Apache Sentry (Incubating) Jobs (Hadoop, Oozie) • • Read/Write Access Control Lists (ACLs) at table level Hive Server 2 and Impala • • File System permissions (Unix like user/group permissions) Job ACLs for Hadoop Scheduler Queues, manage & view jobs Zookeeper • ACLs at znodes, authenticated & read/write
  • 11. Confidentiality • Data in transit RPC: using SASL • HDFS data: using SASL • HTTP: using SSL (web UIs, shuffle). Requires SSL certs • • Data at rest Nothing out of the box • Doable by: custom ‘compression’ codec or local file system encryption •
  • 12. Auditing • Who accessed (read/write) FS data • • • Who submitted, managed, or viewed a Job or a Query • • NN audit log contains all file opens, creates NN audit log contains all metadata ops, e.g. rename, listdir JT, RM, and Job History Server logs contain history of all jobs run on a cluster Who submitted, managed, or viewed a workflow • Oozie audit logs contain history of all user requests
  • 13. Auditing Gaps • Not all projects have explicit audit logs • • • It is difficult to correlate jobs & data access • • • Audit-like information can be extracted by processing logs Eg: Impala query logs are distributed across all nodes Eg: Map-Reduce jobs launched by Pig job Eg: HDFS data accessed by a Map-Reduce job Tools written on top of Hadoop can do this well
  • 15. Integration: Kerberos Users don’t want Yet Another Credential • Corp IT doesn’t want to provision thousands of service principals • Solution: local KDC + one-way trust • Run a KDC (usually MIT Kerberos) in the cluster • • • Put all service principals here Set up one-way trust of central corporate realm by local KDC • Normal user credentials can be used to access Hadoop
  • 16. Integration: Groups • Much of Hadoop authorization uses “groups” • • Users’ groups are not stored in Hadoop anywhere • • • User ‘patrick’ might belong to groups ‘analysts’, ‘eng’, etc. Refers to external system to determine group membership NN/JT/Oozie/Hive servers all must perform group mapping Default plugins for user/group mapping: • • • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ JniBasedUnixGroupsMapping – makes a system call LdapGroupsMapping – talks directly to an LDAP server
  • 17. Integration: Kerberos + LDAP Central Active Directory LDAP group mapping [email protected] … Hadoop Cluster NN JT Local KDC Cross-realm trust hdfs/[email protected] yarn/[email protected]
  • 18. Integration: Web Interfaces • Most web interfaces authenticate using SPNEGO • • • • Standard HTTP authentication protocol Used internally by services which communicate over HTTP Most browsers support Kerberos SPNEGO authentication Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solution
  • 19. Recommendations • Security configuration is a PITA • • Do only what you really need Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster • Otherwise use edge-security to keep outsiders out Only enable wire encryption if required • Only enable web interface authentication if required •
  • 20. Security Enablement • Secure Hadoop enablement order 1. 2. 3. 4. 5. 6. 7. HDFS RPC (including SNN check-pointing) JobTracker RPC TaskTrackers RPC & LinuxTaskControler Hadoop web UI Configure monitoring to work with security Other services (HBase, Oozie, Hive Metastore, etc) Continue with authorization and network encryption if needed
  • 21. Administration • Use an admin/management tool • • • Several inter-related configuration knobs To manage principals/keytabs creation and distribution Automatically configures monitoring for security
  • 22. Q&A

Editor's Notes

  • #10: Proxy-user setup:Relying party is configured to recognized super-users who are allowed to impersonate