SlideShare a Scribd company logo
How to Protect Big Data in a
Containerized Environment
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
Outline
 Securing a Big Data Environment
 Data Protection
 Transparent Data Encryption
 Transparent Data Encryption in a Containerized Environment
 Takeaways
In the Beginning …
 Hadoop was used to process public web data
- No compelling need for security
• No user or service authentication
• No data security
Then Hadoop Became Popular
Security is important.
Layers of Security in Hadoop
 Access
 Authentication
 Authorization
 Data Protection
 Auditing
 Policy (protect from human error)
Hadoop Security: Data Protection
Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
Focus on Data Security
 Confidentiality
- Confidentiality is lost when data is accessed by someone not
authorized to do so
 Integrity
- Integrity is lost when data is modified in unexpected ways
 Availability
- Availability is lost when data is erased or becomes inaccessible
Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
Hadoop Distributed File System (HDFS)
 Data Security Features
- Access Control
- Data Encryption
- Data Replication
Access Control
 Simple
- Identity determined by host operating system
 Kerberos
- Identity determined by Kerberos credentials
- One realm for both compute and storage
- Required for HDFS Transparent Data Encryption
Data Encryption
 Transforming data
Data Replication
 3 way replication
- Can survive any 2 failures
 Erasure Coding
- Can survive more than 2 failures depending on parity bit configuration
HDFS with End-to-End Encryption
 Confidentiality
- Data Access
 Integrity
- Data Access + Data Encryption
 Availability
- Data Access + Data Replication
Data Encryption
 How to transform the data?
10101110001001000101110
00101000111010101010101
00011101010101110
Cleartext
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXX
Ciphertext
Data Encryption – At Rest
 Data is encrypted while on persistent media (disk)
Data Encryption – In Transit
 Data is encrypted while traveling over the network
The Whole Process
Ciphertext
HDFS Transparent Data Encryption (TDE)
 End-to-end encryption
- Data is encrypted/decrypted at the client
• Data is protected at rest and in transit
 Transparent
- No application level code changes required
HDFS TDE – Design
 Goals:
- Only an authorized client/user can access cleartext
- HDFS never stores cleartext or unencrypted data encryption keys
HDFS TDE – Terminology
 Encryption Zone
- A directory whose file contents will be encrypted upon write and
decrypted upon read
- An EZKEY is generated for each zone
HDFS TDE – Terminology
 EZKEY – encryption zone key
 DEK – data encryption key
 EDEK – encrypted data encryption key
HDFS TDE - Data Encryption
 The same key is used to encrypt and decrypt data
 The size of the ciphertext is exactly the same as the size of the original
cleartext
- EZKEY + DEK => EDEK
- EDEK + EZKEY => DEK
HDFS TDE - Services
 HDFS NameNode (NN)
 Kerberos Key Distribution Center (KDC)
 Hadoop Key Management Server (KMS)
- Key Trustee Server
HDFS TDE – Security Concepts
 Division of Labor
- KMS creates the EZKEY & DEK
- KMS encrypts/decrypts the DEK/EDEK using the EZKEY
- HDFS NN communicates with the KMS to create EZKEYs &
EDEKs to store in the extended attributes in the encryption zone
- HDFS client communicates with the KMS to get the DEK using
the EZKEY and EDEK.
HDFS TDE – Security Concepts
 The name of the EZKEY is stored in the HDFS extended
attributes of the directory associated with the encryption zone
 The EDEK is stored in the HDFS extended attributes of the file in
the encryption zone
$ hadoop key …
$ hdfs crypto …
HDFS Examples
 Simplified for the sake of clarity:
- Kerberos actions not shown
- NameNode EDEK cache not shown
HDFS – Create Encryption Zone
/encrypted_dir
xattr: EZKEYNAME EZKEYNAME = KEY
3. Create EZKEY
HDFS – Create Encrypted File
3. Create EDEK
1. Create file 2. Create EDEK
/encrypted_dir/file
xattr: EDEK
4. Store EDEK5. Return Success
/encrypted_dir/file
encrypted data
HDFS TDE – File Write Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
write encrypted data
read
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
HDFS TDE – File Read Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
read encrypted data
write
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
Bring in the Containers (i.e. Docker)
 Issues with containers are the same for any virtualization platform
- Multiple compute clusters
- Multiple HDFS file systems
- Multiple Kerberos realms
- Cross-realm trust configuration
Containers as Virtual Machines
 Note – this is not about using containers to run Big Data tasks:
Containers as Virtual Machines
 This is about running Hadoop / Big Data clusters in containers:
cluster
Containers as Virtual Machines
 A true containerized Big Data environment:
KDC Cross-Realm Trust
 Different KDC realms for corporate, data, and compute
 Must interact correctly in order for the Big Data cluster to function
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
KDC Cross-Realm Trust
 Different KDC realms for corporate, data, and compute
- One-way trust
• Compute realm trusts the corporate realm
• Data realm trusts corporate realm
• Data realm trusts the compute realm
CORP.ENTERPRISE.COM Realm
COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm
KDC:
CORP.ENTERPRISE.COM
KDC:
DATALAKE.ENTERPRISE.COM
KDC:
COMPUTE.ENTERPRISE.COM
HDFS:
hdfs://remotedata/
Hadoop Cluster
rm@COMPUTE.ENTERPRISE.COM
user@CORP.ENTERPRISE.COM
Hadoop Key Management Service
KDC Cross-Realm Trust
Key Management Service
 Must be enterprise quality
- Key Trustee Server
• Java KeyStore KMS
• Cloudera Navigator Key Trustee Server
Containers as Virtual Machines
 A true containerized Big Data environment:
DataLake
DataLake
DataLake
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
Key Takeaways
 Hadoop has many security layers
- HDFS Transparent Data Encryption (TDE) is best of breed
- Security is hard (complex)
- Virtualization / containerization only makes it potentially harder
- Compute and storage separation with virtualization /
containerization can make it even harder still
Key Takeaways
 Be careful with a build vs. buy decision for containerized Big Data
- Recommendation: buy one already built
- There are turnkey solutions
(e.g. BlueData EPIC)
Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
www.bluedata.com
BlueData Booth #1508
in Strata Expo Hall
@tapbluedata

More Related Content

What's hot (20)

PDF
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
PDF
Hadoop on-mesos
Henry Cai 蔡明航
 
PDF
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Yafang Chang
 
PPTX
Scaling HDFS at Xiaomi
DataWorks Summit
 
PPTX
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
PPTX
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
ODP
Guaranteeing Storage Performance by Mike Tutkowski
buildacloud
 
PPTX
DynomiteDB - No spof High-availability Redis cluster solution
Leandro Totino Pereira
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PPTX
SQL on Azure
Maximiliano Accotto
 
PPTX
Bootcamp 2017 - SQL Server on Linux
Maximiliano Accotto
 
PPTX
Migrate Oracle database to Amazon RDS
Jesus Guzman
 
PPTX
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
PPTX
Red Hat Storage Day Seattle: Stretching A Gluster Cluster for Resilient Messa...
Red_Hat_Storage
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PDF
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
PDF
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
John Burwell
 
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
PDF
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
PPTX
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
Hadoop on-mesos
Henry Cai 蔡明航
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Yafang Chang
 
Scaling HDFS at Xiaomi
DataWorks Summit
 
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
Guaranteeing Storage Performance by Mike Tutkowski
buildacloud
 
DynomiteDB - No spof High-availability Redis cluster solution
Leandro Totino Pereira
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
SQL on Azure
Maximiliano Accotto
 
Bootcamp 2017 - SQL Server on Linux
Maximiliano Accotto
 
Migrate Oracle database to Amazon RDS
Jesus Guzman
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
Red Hat Storage Day Seattle: Stretching A Gluster Cluster for Resilient Messa...
Red_Hat_Storage
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
John Burwell
 
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 

Similar to How to Protect Big Data in a Containerized Environment (20)

PPTX
Open Source Security Tools for Big Data
Great Wide Open
 
PPTX
Open Source Security Tools for Big Data
Rommel Garcia
 
PDF
Hadoop & Security - Past, Present, Future
Uwe Printz
 
PDF
Охота на уязвимости Hadoop
Positive Hack Days
 
PDF
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
PDF
Hadoop security
Biju Nair
 
PDF
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
Yahoo!デベロッパーネットワーク
 
PPTX
Unit-3.pptx
JasmineMichael1
 
PPTX
Improvements in Hadoop Security
DataWorks Summit
 
PDF
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
PDF
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Hortonworks
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Big data and Hadoop Section..............
itsTIM66
 
PDF
Hadoop and CLOUDIAN HyperStore
CLOUDIAN KK
 
PPTX
Overview of HDFS Transparent Encryption
Cloudera, Inc.
 
PDF
2014 sept 4_hadoop_security
Adam Muise
 
PPTX
The Rise of DataOps: Making Big Data Bite Size with DataOps
Delphix
 
PPTX
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Edureka!
 
PPT
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
 
Open Source Security Tools for Big Data
Great Wide Open
 
Open Source Security Tools for Big Data
Rommel Garcia
 
Hadoop & Security - Past, Present, Future
Uwe Printz
 
Охота на уязвимости Hadoop
Positive Hack Days
 
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop security
Biju Nair
 
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
Yahoo!デベロッパーネットワーク
 
Unit-3.pptx
JasmineMichael1
 
Improvements in Hadoop Security
DataWorks Summit
 
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Hortonworks
 
Hadoop File system (HDFS)
Prashant Gupta
 
Big data and Hadoop Section..............
itsTIM66
 
Hadoop and CLOUDIAN HyperStore
CLOUDIAN KK
 
Overview of HDFS Transparent Encryption
Cloudera, Inc.
 
2014 sept 4_hadoop_security
Adam Muise
 
The Rise of DataOps: Making Big Data Bite Size with DataOps
Delphix
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Edureka!
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
 
Ad

More from BlueData, Inc. (18)

PPT
Introduction to KubeDirector - SF Kubernetes Meetup
BlueData, Inc.
 
PDF
Dell EMC Ready Solutions for Big Data
BlueData, Inc.
 
PDF
BlueData and Hortonworks Data Platform (HDP)
BlueData, Inc.
 
PDF
BlueData EPIC datasheet (en Français)
BlueData, Inc.
 
PPTX
Best Practices for Running Kafka on Docker Containers
BlueData, Inc.
 
PDF
Bare-metal performance for Big Data workloads on Docker containers
BlueData, Inc.
 
PPTX
Lessons Learned from Dockerizing Spark Workloads
BlueData, Inc.
 
PDF
BlueData EPIC on AWS - Spec Sheet
BlueData, Inc.
 
PPT
The Time Has Come for Big-Data-as-a-Service
BlueData, Inc.
 
PDF
Solution Brief: Real-Time Pipeline Accelerator
BlueData, Inc.
 
PDF
Hadoop Virtualization - Intel White Paper
BlueData, Inc.
 
PDF
Solution Brief: Big Data Lab Accelerator
BlueData, Inc.
 
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
PPTX
BlueData EPIC 2.0 Overview
BlueData, Inc.
 
PPTX
Big Data Case Study: Fortune 100 Telco
BlueData, Inc.
 
PPTX
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData, Inc.
 
PPTX
Spark Infrastructure Made Easy
BlueData, Inc.
 
PPTX
BlueData Integration with Cloudera Manager
BlueData, Inc.
 
Introduction to KubeDirector - SF Kubernetes Meetup
BlueData, Inc.
 
Dell EMC Ready Solutions for Big Data
BlueData, Inc.
 
BlueData and Hortonworks Data Platform (HDP)
BlueData, Inc.
 
BlueData EPIC datasheet (en Français)
BlueData, Inc.
 
Best Practices for Running Kafka on Docker Containers
BlueData, Inc.
 
Bare-metal performance for Big Data workloads on Docker containers
BlueData, Inc.
 
Lessons Learned from Dockerizing Spark Workloads
BlueData, Inc.
 
BlueData EPIC on AWS - Spec Sheet
BlueData, Inc.
 
The Time Has Come for Big-Data-as-a-Service
BlueData, Inc.
 
Solution Brief: Real-Time Pipeline Accelerator
BlueData, Inc.
 
Hadoop Virtualization - Intel White Paper
BlueData, Inc.
 
Solution Brief: Big Data Lab Accelerator
BlueData, Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
BlueData EPIC 2.0 Overview
BlueData, Inc.
 
Big Data Case Study: Fortune 100 Telco
BlueData, Inc.
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData, Inc.
 
Spark Infrastructure Made Easy
BlueData, Inc.
 
BlueData Integration with Cloudera Manager
BlueData, Inc.
 
Ad

Recently uploaded (20)

PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Tally software_Introduction_Presentation
AditiBansal54083
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 

How to Protect Big Data in a Containerized Environment

  • 1. How to Protect Big Data in a Containerized Environment Thomas Phelan Chief Architect, BlueData @tapbluedata
  • 2. Outline  Securing a Big Data Environment  Data Protection  Transparent Data Encryption  Transparent Data Encryption in a Containerized Environment  Takeaways
  • 3. In the Beginning …  Hadoop was used to process public web data - No compelling need for security • No user or service authentication • No data security
  • 4. Then Hadoop Became Popular Security is important.
  • 5. Layers of Security in Hadoop  Access  Authentication  Authorization  Data Protection  Auditing  Policy (protect from human error)
  • 6. Hadoop Security: Data Protection Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
  • 7. Focus on Data Security  Confidentiality - Confidentiality is lost when data is accessed by someone not authorized to do so  Integrity - Integrity is lost when data is modified in unexpected ways  Availability - Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
  • 8. Hadoop Distributed File System (HDFS)  Data Security Features - Access Control - Data Encryption - Data Replication
  • 9. Access Control  Simple - Identity determined by host operating system  Kerberos - Identity determined by Kerberos credentials - One realm for both compute and storage - Required for HDFS Transparent Data Encryption
  • 11. Data Replication  3 way replication - Can survive any 2 failures  Erasure Coding - Can survive more than 2 failures depending on parity bit configuration
  • 12. HDFS with End-to-End Encryption  Confidentiality - Data Access  Integrity - Data Access + Data Encryption  Availability - Data Access + Data Replication
  • 13. Data Encryption  How to transform the data? 10101110001001000101110 00101000111010101010101 00011101010101110 Cleartext XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXX Ciphertext
  • 14. Data Encryption – At Rest  Data is encrypted while on persistent media (disk)
  • 15. Data Encryption – In Transit  Data is encrypted while traveling over the network
  • 17. HDFS Transparent Data Encryption (TDE)  End-to-end encryption - Data is encrypted/decrypted at the client • Data is protected at rest and in transit  Transparent - No application level code changes required
  • 18. HDFS TDE – Design  Goals: - Only an authorized client/user can access cleartext - HDFS never stores cleartext or unencrypted data encryption keys
  • 19. HDFS TDE – Terminology  Encryption Zone - A directory whose file contents will be encrypted upon write and decrypted upon read - An EZKEY is generated for each zone
  • 20. HDFS TDE – Terminology  EZKEY – encryption zone key  DEK – data encryption key  EDEK – encrypted data encryption key
  • 21. HDFS TDE - Data Encryption  The same key is used to encrypt and decrypt data  The size of the ciphertext is exactly the same as the size of the original cleartext - EZKEY + DEK => EDEK - EDEK + EZKEY => DEK
  • 22. HDFS TDE - Services  HDFS NameNode (NN)  Kerberos Key Distribution Center (KDC)  Hadoop Key Management Server (KMS) - Key Trustee Server
  • 23. HDFS TDE – Security Concepts  Division of Labor - KMS creates the EZKEY & DEK - KMS encrypts/decrypts the DEK/EDEK using the EZKEY - HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone - HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.
  • 24. HDFS TDE – Security Concepts  The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone  The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …
  • 25. HDFS Examples  Simplified for the sake of clarity: - Kerberos actions not shown - NameNode EDEK cache not shown
  • 26. HDFS – Create Encryption Zone /encrypted_dir xattr: EZKEYNAME EZKEYNAME = KEY 3. Create EZKEY
  • 27. HDFS – Create Encrypted File 3. Create EDEK 1. Create file 2. Create EDEK /encrypted_dir/file xattr: EDEK 4. Store EDEK5. Return Success /encrypted_dir/file encrypted data
  • 28. HDFS TDE – File Write Work Flow 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file write encrypted data read unencrypted data /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME
  • 29. HDFS TDE – File Read Work Flow 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file read encrypted data write unencrypted data /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME
  • 30. Bring in the Containers (i.e. Docker)  Issues with containers are the same for any virtualization platform - Multiple compute clusters - Multiple HDFS file systems - Multiple Kerberos realms - Cross-realm trust configuration
  • 31. Containers as Virtual Machines  Note – this is not about using containers to run Big Data tasks:
  • 32. Containers as Virtual Machines  This is about running Hadoop / Big Data clusters in containers: cluster
  • 33. Containers as Virtual Machines  A true containerized Big Data environment:
  • 34. KDC Cross-Realm Trust  Different KDC realms for corporate, data, and compute  Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
  • 35. KDC Cross-Realm Trust  Different KDC realms for corporate, data, and compute - One-way trust • Compute realm trusts the corporate realm • Data realm trusts corporate realm • Data realm trusts the compute realm
  • 36. CORP.ENTERPRISE.COM Realm COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM KDC: COMPUTE.ENTERPRISE.COM HDFS: hdfs://remotedata/ Hadoop Cluster [email protected] [email protected] Hadoop Key Management Service KDC Cross-Realm Trust
  • 37. Key Management Service  Must be enterprise quality - Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server
  • 38. Containers as Virtual Machines  A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
  • 39. Key Takeaways  Hadoop has many security layers - HDFS Transparent Data Encryption (TDE) is best of breed - Security is hard (complex) - Virtualization / containerization only makes it potentially harder - Compute and storage separation with virtualization / containerization can make it even harder still
  • 40. Key Takeaways  Be careful with a build vs. buy decision for containerized Big Data - Recommendation: buy one already built - There are turnkey solutions (e.g. BlueData EPIC) Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
  • 41. www.bluedata.com BlueData Booth #1508 in Strata Expo Hall @tapbluedata

Editor's Notes

  • #40: Jason to briefly cover agenda
  • #41: Jason to briefly cover agenda