SlideShare a Scribd company logo
Making Apache HadoopSecureDevaraj Dasddas@yahoo-inc.comYahoo’s Hadoop TeamApache Hadoop India Summit 2011
Who am IPrincipal Engineer at Yahoo! SunnyvaleWorking on Hadoop and related projectsApache Hadoop Committer/PMC memberBefore Yahoo!, Sunnyvale – Yahoo! BangaloreBefore Yahoo! – HP, Bangalore
What is Hadoop?HDFS – Distributed File SystemCombines cluster’s local storage into a single namespace.All data is replicated to multiple machines.Provides locality information to clientsMapReduceBatch computation frameworkJobs divided into tasks. Tasks re-executed on failureOptimizes for data locality of input
ProblemDifferent yahoos need different data.PII versus financial
Need assurance that only the right people can see data.
Need to log who looked at the data.Yahoo! has more yahoos than clusters.Requires isolation or trust.
Security improves ability to share clusters between groups4
Why is Security Hard?Hadoop is Distributedruns on a cluster of computers.Can’t determine the user on client computer.OS doesn’t tell you, must be done by applicationClient needs to authenticate to each computerClient needs to protect against fake servers
Need DelegationNot just client-server, the servers access other services on behalf of others.MapReduce need to have user’s permissionsEven if the user logs outMapReduce jobs need to:Get and keep the necessary credentialsRenew them while the job is runningDestroy them when the job finishes
SolutionPrevent unauthorized HDFS accessAll HDFS clients must be authenticated.
Including tasks running as part of MapReduce jobs
And jobs submitted through Oozie.Users must also authenticate serversOtherwise fraudulent servers could steal credentialsIntegrate Hadoop with KerberosProvides well tested open source distributed authentication system.7
RequirementsSecurity must be optional.Not all clusters are shared between users.Hadoop must not prompt for passwordsMakes it easy to make trojan horse versions.Must have single sign on.Must handle the launch of a MapReduce job on 4,000 NodesPerformance / Reliability must not be compromised
Security DefinitionsAuthentication – Determining the userHadoop 0.20 completely trusted the userSent user and groups over wireWe need it on both RPC and Web UI.Authorization – What can that user do?HDFS had owners and permissions since 0.16.Auditing – Who did that?
AuthenticationChanges low-level transportRPC authentication using SASLKerberos (GSSAPI)TokenSimpleBrowser HTTP secured via pluginConfigurable translation from Kerberos principals to user names
AuthorizationHDFSCommand line and semantics unchangedMapReduce added Access Control ListsLists of users and groups that have access.mapreduce.job.acl-view-job – view jobmapreduce.job.acl-modify-job – kill or modify jobCode for determining group membership is pluggable.Checked on the masters.
AuditingHDFS can track access to filesMapReduce can track who ran each jobProvides fine grain logs of who did whatWith strong authentication, logs provide audit trails
Delegation TokensTo prevent authentication flood at the start of a job, NameNode creates delegation tokens.Allows user to authenticate once and pass credentials to all tasks of a job.JobTracker automatically renews tokens while job is running.Max lifetime of delegation tokens is 7 days.Cancels tokens when job finishes.
Primary Communication Paths14
Kerberos and Single Sign-onKerberos allows user to sign in onceObtains Ticket Granting Ticket (TGT)kinit – get a new Kerberos ticketklist – list your Kerberos ticketskdestroy – destroy your Kerberos ticketTGT’s last for 10 hours, renewable for 7 days by defaultOnce you have a TGT, Hadoop commands just workhadoop fs –ls /hadoop jar wordcount.jar in-dir out-dir15

More Related Content

What's hot (18)

PPTX
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
PDF
Implementing Advanced Caching and Replication Techniques in ...
webhostingguy
 
PDF
Web server
Touhid Arastu
 
PPTX
Server and Its Types - Presentation
Shakeel Haider
 
PPTX
Presentation about servers
Sasin Prabu
 
PDF
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
PDF
Caching objects-in-memory
Mauro Cassani
 
PDF
Hadoop 101
Nader Ganayem
 
ODP
web server
nava rathna
 
PPTX
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Jinith Joseph
 
PDF
Cache Security- The Basics
InterSystems Corporation
 
PPTX
Architecting virtualized infrastructure for big data presentation
Vlad Ponomarev
 
PDF
Configuring the Apache Web Server
webhostingguy
 
PDF
WebSphere : High Performance Extensible Logging
Joseph's WebSphere Library
 
PPTX
What is Server? (Web Server vs Application Server)
Amit Nirala
 
PDF
Cache Security- Configuring a Secure Environment
InterSystems Corporation
 
PPTX
Apache Hadoop & Hive installation with movie rating exercise
Shiva Rama Krishna Dasharathi
 
PDF
MySQL and memcached Guide
webhostingguy
 
Configuring Your First Hadoop Cluster On EC2
benjaminwootton
 
Implementing Advanced Caching and Replication Techniques in ...
webhostingguy
 
Web server
Touhid Arastu
 
Server and Its Types - Presentation
Shakeel Haider
 
Presentation about servers
Sasin Prabu
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
Caching objects-in-memory
Mauro Cassani
 
Hadoop 101
Nader Ganayem
 
web server
nava rathna
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Jinith Joseph
 
Cache Security- The Basics
InterSystems Corporation
 
Architecting virtualized infrastructure for big data presentation
Vlad Ponomarev
 
Configuring the Apache Web Server
webhostingguy
 
WebSphere : High Performance Extensible Logging
Joseph's WebSphere Library
 
What is Server? (Web Server vs Application Server)
Amit Nirala
 
Cache Security- Configuring a Secure Environment
InterSystems Corporation
 
Apache Hadoop & Hive installation with movie rating exercise
Shiva Rama Krishna Dasharathi
 
MySQL and memcached Guide
webhostingguy
 

Similar to Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj Das (20)

PPT
Hadoop Security Architecture
Owen O'Malley
 
PPT
Hadoop Security Preview
Hadoop User Group
 
PPT
Hadoop Security Preview
Hadoop User Group
 
PDF
Hadoop Security: Overview
Cloudera, Inc.
 
PDF
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Cloudera, Inc.
 
PPTX
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Caserta
 
PPTX
Securing the Hadoop Ecosystem
DataWorks Summit
 
PPTX
Open Source Security Tools for Big Data
Great Wide Open
 
PPTX
Open Source Security Tools for Big Data
Rommel Garcia
 
PDF
Practical Hadoop Security 1st ed. Edition Lakhe
kovachvidar
 
PDF
Hadoop & Security - Past, Present, Future
Uwe Printz
 
PDF
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CloudIDSummit
 
PPT
Setting_up_hadoop_cluster_Detailed-overview
oyqhmysnxozaxsqfac
 
PPTX
Hadoop and Big Data Security
Chicago Hadoop Users Group
 
PPTX
Improvements in Hadoop Security
Chris Nauroth
 
PDF
Hadoop security
shrey mehrotra
 
PPTX
Big data security
Joey Echeverria
 
PDF
TriHUG 2/14: Apache Sentry
trihug
 
PPTX
Hadoop security
Kashif Khan
 
PPTX
Hadoop Security Today & Tomorrow with Apache Knox
Vinay Shukla
 
Hadoop Security Architecture
Owen O'Malley
 
Hadoop Security Preview
Hadoop User Group
 
Hadoop Security Preview
Hadoop User Group
 
Hadoop Security: Overview
Cloudera, Inc.
 
Hadoop Security, Cloudera - Todd Lipcon and Aaron Myers - Hadoop World 2010
Cloudera, Inc.
 
Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera
Caserta
 
Securing the Hadoop Ecosystem
DataWorks Summit
 
Open Source Security Tools for Big Data
Great Wide Open
 
Open Source Security Tools for Big Data
Rommel Garcia
 
Practical Hadoop Security 1st ed. Edition Lakhe
kovachvidar
 
Hadoop & Security - Past, Present, Future
Uwe Printz
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CloudIDSummit
 
Setting_up_hadoop_cluster_Detailed-overview
oyqhmysnxozaxsqfac
 
Hadoop and Big Data Security
Chicago Hadoop Users Group
 
Improvements in Hadoop Security
Chris Nauroth
 
Hadoop security
shrey mehrotra
 
Big data security
Joey Echeverria
 
TriHUG 2/14: Apache Sentry
trihug
 
Hadoop security
Kashif Khan
 
Hadoop Security Today & Tomorrow with Apache Knox
Vinay Shukla
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Ad

Apache Hadoop India Summit 2011 talk "Making Apache Hadoop Secure" by Devaraj Das

  • 1. Making Apache HadoopSecureDevaraj [email protected]’s Hadoop TeamApache Hadoop India Summit 2011
  • 2. Who am IPrincipal Engineer at Yahoo! SunnyvaleWorking on Hadoop and related projectsApache Hadoop Committer/PMC memberBefore Yahoo!, Sunnyvale – Yahoo! BangaloreBefore Yahoo! – HP, Bangalore
  • 3. What is Hadoop?HDFS – Distributed File SystemCombines cluster’s local storage into a single namespace.All data is replicated to multiple machines.Provides locality information to clientsMapReduceBatch computation frameworkJobs divided into tasks. Tasks re-executed on failureOptimizes for data locality of input
  • 4. ProblemDifferent yahoos need different data.PII versus financial
  • 5. Need assurance that only the right people can see data.
  • 6. Need to log who looked at the data.Yahoo! has more yahoos than clusters.Requires isolation or trust.
  • 7. Security improves ability to share clusters between groups4
  • 8. Why is Security Hard?Hadoop is Distributedruns on a cluster of computers.Can’t determine the user on client computer.OS doesn’t tell you, must be done by applicationClient needs to authenticate to each computerClient needs to protect against fake servers
  • 9. Need DelegationNot just client-server, the servers access other services on behalf of others.MapReduce need to have user’s permissionsEven if the user logs outMapReduce jobs need to:Get and keep the necessary credentialsRenew them while the job is runningDestroy them when the job finishes
  • 10. SolutionPrevent unauthorized HDFS accessAll HDFS clients must be authenticated.
  • 11. Including tasks running as part of MapReduce jobs
  • 12. And jobs submitted through Oozie.Users must also authenticate serversOtherwise fraudulent servers could steal credentialsIntegrate Hadoop with KerberosProvides well tested open source distributed authentication system.7
  • 13. RequirementsSecurity must be optional.Not all clusters are shared between users.Hadoop must not prompt for passwordsMakes it easy to make trojan horse versions.Must have single sign on.Must handle the launch of a MapReduce job on 4,000 NodesPerformance / Reliability must not be compromised
  • 14. Security DefinitionsAuthentication – Determining the userHadoop 0.20 completely trusted the userSent user and groups over wireWe need it on both RPC and Web UI.Authorization – What can that user do?HDFS had owners and permissions since 0.16.Auditing – Who did that?
  • 15. AuthenticationChanges low-level transportRPC authentication using SASLKerberos (GSSAPI)TokenSimpleBrowser HTTP secured via pluginConfigurable translation from Kerberos principals to user names
  • 16. AuthorizationHDFSCommand line and semantics unchangedMapReduce added Access Control ListsLists of users and groups that have access.mapreduce.job.acl-view-job – view jobmapreduce.job.acl-modify-job – kill or modify jobCode for determining group membership is pluggable.Checked on the masters.
  • 17. AuditingHDFS can track access to filesMapReduce can track who ran each jobProvides fine grain logs of who did whatWith strong authentication, logs provide audit trails
  • 18. Delegation TokensTo prevent authentication flood at the start of a job, NameNode creates delegation tokens.Allows user to authenticate once and pass credentials to all tasks of a job.JobTracker automatically renews tokens while job is running.Max lifetime of delegation tokens is 7 days.Cancels tokens when job finishes.
  • 20. Kerberos and Single Sign-onKerberos allows user to sign in onceObtains Ticket Granting Ticket (TGT)kinit – get a new Kerberos ticketklist – list your Kerberos ticketskdestroy – destroy your Kerberos ticketTGT’s last for 10 hours, renewable for 7 days by defaultOnce you have a TGT, Hadoop commands just workhadoop fs –ls /hadoop jar wordcount.jar in-dir out-dir15
  • 22. Task IsolationTasks now run as the user.Via a small setuid programCan’t signal other user’s tasks or TaskTrackerCan’t read other tasks jobconf, files, outputs, or logsDistributed cachePublic files shared between jobs and usersPrivate files shared between jobs
  • 23. Web UIsHadoop relies on the Web UIs.These need to be authenticated also…Web UI authentication is pluggable.Yahoo uses an internal packageWe have written a very simple static auth plug-inSPNEGO plugin being developedAll servlets enforce permissions.
  • 24. Proxy-UsersOozie (and other trusted services) run as headless user on behalf of other usersConfigure HDFS and MapReduce with the oozie user as a proxy:Group of users that the proxy can impersonateWhich hosts they can impersonate from19
  • 25. Questions?Questions should be sent to:common/hdfs/[email protected] holes should be sent to:[email protected] from (in production at Yahoo!)Hadoop Commonhttp://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security/Thanks!