SlideShare a Scribd company logo
PROTECT YOUR PRIVATE DATA
WITH ORC COLUMN
ENCRYPTION
Owen O’Malley
owen@cloudera.com
September 2019
@owen_omalley
© 2019 Cloudera, Inc. All rights reserved. 2
WHO AM I?
• First committer added to Hadoop
 MapReduce
 Scaling
 Security
• Hive
 ACID transactions
• ORC
 Creator
SECURITY AND DATA PROTECTION
IN HADOOP
© 2019 Cloudera, Inc. All rights reserved. 4
EXAMPLE DATA LAKE SCENARIO
Marketing
Demographics
Electronic
medical records
CRM
POS
(Structured)(Structured) (Structured) (Structured) (Structured)
Cluster 1: Dublin Cluster 2: San Francisco
(Unstructured)(Unstructured)(Unstructured)
Cluster 3: Prague
(Structured)
On Premise Data Lakes
(Unstructured)(Structured) (Unstructured) (Structured)
Cloud Data Lakes
Social
Weblogs & Feeds
Transactional
Mobile
IoT
Personal Data
© 2019 Cloudera, Inc. All rights reserved. 5
DIFFERENCES IN THE BIG DATA CONTEXT
• Breaking down silos: fantastic for analytics, but security challenges
 Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained
authorization
• Data democratization and the Data Scientist role (often a data super user with
elevated privileges)
• Data is maintained over a long duration
• Cloud and Hybrid architectures spanning data center and (multiple) public clouds
further broaden the attack surface area and present novel authentication and
authorization challenges
• Along with adherence to security fundamentals and defense in-depth, a data-centric
approach to security becomes critical
Watch Towers
Limited Entry Points
Moat
Kerberos
Securing your data lake
High Hard Walls
Check Identity
Inner Walls
Firewall
Encryption, TLS, Key
Trustee, Navigator
Encrypt, Ranger KMS
LDAP/AD
Apache Knox: AuthN, API
Gateway, Proxy, SSO
Apache Ranger : ABAC
AuthZ, Audits,
Anonymization
Apache Metron:
Detection
© 2019 Cloudera, Inc. All rights reserved. 7
DATA PROTECTION IN HADOOP
• Must be applied in three different levels
Storage – encrypt data at rest
• HDFS encryption zones
• Volume encryption
Transmission – encrypt data in motion
• SSL
Upon access – apply restrictions when accessed
• Ranger dynamic masking and row filtering
Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive
User 2: Ivanna
Location : EU
Group: HR
User 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx
xxxx
null John Doe
US xxxxx7465 5391 xxxx xxxx
xxxx
null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies:
Filter rows by region & apply relevant column
masking
Users from US Analyst group see data for
US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country,
nationalid, name, mrn
FROM ww_customers
Analysts
HR Marketing
Ranger
WHAT ARE WE FIXING?
© 2019 Cloudera, Inc. All rights reserved. 10
FRAMING THE PROBLEM…
• Related data, different security requirements
• Authorization – who can see it
• Audit – track who read it
• Encrypt on disk – regulatory
• File-level (or blob) granularity isn’t enough
• File systems don’t understand columns
© 2019 Cloudera, Inc. All rights reserved. 11
REQUIREMENTS
• Readers should transparently decrypt data
If and only if the user has access to the key
The data must be decrypted locally
Old readers must not break!
• Columns are only decrypted as necessary
• Master keys must be managed securely
Support for Key Management Server & hardware
Support for key rolling
EARLIER WORKAROUNDS
© 2019 Cloudera, Inc. All rights reserved. 13
PARTIAL SOLUTION – HDFS ENCRYPTION ZONES
• Transparent HDFS Encryption
• Encryption zones
• HDFS directory trees
• Unique master key for each zone
• Client decrypts data
• Key Management via KeyProvider API
© 2019 Cloudera, Inc. All rights reserved. 14
HDFS ENCRYPTION ZONE LIMITATIONS
• Very coarse protection
• Only entire directory subtrees
• No ability to protect columns
• A lot of users need access to keys
• Moves between zones is painful
• When writing with Hive, data is moved multiple times per a query
© 2019 Cloudera, Inc. All rights reserved. 15
PARTIAL SOLUTION – HIVE SERVER 2
• Limit access to warehouse data to Hive
• Only “hive” user has HDFS access
• Breaks Hadoop’s multi-paradigm data access
• Many customers use both Hive & Spark
• JDBC isn’t a distributed protocol
• Funneling large data through a small pipe
• Spark Data Warehouse Connector to LLAP fixes this
© 2019 Cloudera, Inc. All rights reserved. 16
PARTIAL SOLUTION – SEPARATE TABLES
• Split private information out of tables
• Separate directories in HDFS
• HDFS and/or HS2 authorization
• Enables HDFS encryption
• Limitations
• Need to join with other tables
• Higher operational overhead
© 2019 Cloudera, Inc. All rights reserved. 17
PARTIAL SOLUTION – ENCRYPTION UDF
• Hive has user defined functions
• aes_encrypt and aes_decrypt
• Limitations
• Key management is problematic
• Encryption is not seeded
• Size of value leaks information
THE WINNER IS …
© 2019 Cloudera, Inc. All rights reserved. 19
COLUMNAR ENCRYPTION
• Columnar file formats, such as ORC and Parquet
• Write data in columns
• Column projection
• Better compression
• Encryption works really well
• Only encrypt bytes for column
• Can store multiple variants of data
© 2019 Cloudera, Inc. All rights reserved. 20
ORC FILE FORMAT
© 2019 Cloudera, Inc. All rights reserved. 21
USER EXPERIENCE
• Set table properties for encryption
• orc.encrypt = ”pii:ssn,email;credit:card_info”
• orc.mask = “sha256:card_info”
• Define where to get the encryption keys
• Configuration defines the key provider via URI
© 2019 Cloudera, Inc. All rights reserved. 22
KEY MANAGEMENT
• Create a master key for each use case
• “pii”, “pci”, or “hipaa”
• Each column in each file uses unique local key
• Allows audit of which users read which files
• Ranger policies limit access to keys
• Who, What, When, Where
© 2019 Cloudera, Inc. All rights reserved. 23
KEY PROVIDER API
• Provides limited access to encryption keys
• Encrypts or decrypts local keys
• Users are never given master keys
• Key versions and key rolling of master keys
• Allows 3rd party plugins
• Supports Cloud, Hadoop or Ranger KMS
© 2019 Cloudera, Inc. All rights reserved. 24
ENCRYPTION DATA FLOW
© 2019 Cloudera, Inc. All rights reserved. 25
ENCRYPTION FLOW
• Local key
• Random for each encrypted column in file
• Encrypted w/ master key by KMS
• Encrypted local key is stored in file metadata
• IV is generated to be unique
• Column, kind, stripe, & counter
© 2019 Cloudera, Inc. All rights reserved. 26
STATIC DATA MASKING
• What happens without key access?
• Define static masks
• Nullify – all values become null
• Redact – mask values ‘Xxxxx Xxxxx!’
• Can define ranges to unmask
• SHA256 – replace with SHA256
• Custom - user defined
© 2019 Cloudera, Inc. All rights reserved. 27
DATA ANONYMIZATION
• Anonymization is hard!
• AOL search logs
• Netflix prize datasets
• NYC taxi dataset
• Always evaluate security tradeoffs
• Tokenization is a useful technique
• Assigns arbitrary replacements
© 2019 Cloudera, Inc. All rights reserved. 28
KEY DISPOSAL
• Often need to keep data for 90 days
• Currently the data is written twice
• With column encryption:
• Roll keys daily
• Delete master key after 90 days
© 2019 Cloudera, Inc. All rights reserved. 29
ORC ENCRYPTION DESIGN
• Write both variants of streams
• Masked unencrypted
• Unmasked encrypted
• Encrypt both data and statistics
• Maintain compatibility for old readers
• Read unencrypted variant
• Preserve ability to seek in file
© 2019 Cloudera, Inc. All rights reserved. 30
ORC WRITE PIPELINE
• Streams go through pipeline
• Run length encoding
• Compression (zlib, snappy, lzo, lz4, zstd, or none)
• Encryption
• Encryption is AES/CTR
• Allows seek
• No padding
CONCLUSIONS
© 2019 Cloudera, Inc. All rights reserved. 32
CONCLUSIONS
• ORC column encryptions provides
Transparent encryption
Multi-paradigm column security
Compatible with old readers
Static masking
Audit logging (via KMS logging)
• Supports file merging
• Released in ORC 1.6
© 2019 Cloudera, Inc. All rights reserved. 33
INTEGRATION WITH OTHER TOOLS
• Hive & Spark
No change other than defining table properties
• Apache Hive’s LLAP
Cache and fast processing of SQL queries
Column encryption changes internal interfaces
Cache both encrypted and unencrypted variants
Ensure audit log reflects end-user and what they accessed
© 2019 Cloudera, Inc. All rights reserved. 34
LIMITATIONS
• Need encryption policy for write
Current Atlas & Ranger tags lag data
Auto-discovery requires pre-access
• Changes to masking policy
Need to re-write files
• Need additional data masks
Credit card, addresses, etc.
• Decrypted local keys could be saved
THANK YOU
Owen O’Malley
owen@cloudera.com
@owen_omalley

More Related Content

What's hot (20)

PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
Using Queryable State for Fun and Profit
Flink Forward
 
PPTX
Alfresco tuning part2
Luis Cabaceira
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
PDF
Exadata master series_asm_2020
Anil Nair
 
PDF
Scaling Data Analytics Workloads on Databricks
Databricks
 
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Using PostgreSQL for Data Privacy
Mason Sharp
 
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
PPTX
IBM Spectrum Scale Authentication for File Access - Deep Dive
Shradha Nayak Thakare
 
ODP
Hadoop at aadhaar
Regunath B
 
PDF
Spark shuffle introduction
colorant
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
オラクルエンジニア通信
 
PDF
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...
bupbechanhgmail
 
PDF
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
Databricks
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Using Queryable State for Fun and Profit
Flink Forward
 
Alfresco tuning part2
Luis Cabaceira
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
Exadata master series_asm_2020
Anil Nair
 
Scaling Data Analytics Workloads on Databricks
Databricks
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Using PostgreSQL for Data Privacy
Mason Sharp
 
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
IBM Spectrum Scale Authentication for File Access - Deep Dive
Shradha Nayak Thakare
 
Hadoop at aadhaar
Regunath B
 
Spark shuffle introduction
colorant
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
オラクルエンジニア通信
 
Ibm db2 10.5 for linux, unix, and windows data movement utilities guide and...
bupbechanhgmail
 
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
Databricks
 

Similar to Protect your private data with ORC column encryption (20)

PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
PPTX
Project Rhino: Enhancing Data Protection for Hadoop
Cloudera, Inc.
 
PPTX
ORC Deep Dive 2020
Owen O'Malley
 
PPTX
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
PPTX
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Cloudera, Inc.
 
PPTX
Fighting cyber fraud with hadoop
Niel Dunnage
 
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
Cloudera, Inc.
 
PPTX
The Future of Data Management - the Enterprise Data Hub
DataWorks Summit
 
PDF
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
PDF
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Big Data Spain
 
PDF
AWS Cloud Based Encryption Decryption System
IRJET Journal
 
PPTX
Transparent Encryption in HDFS
DataWorks Summit
 
PPTX
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Cloudera, Inc.
 
PDF
Hadoop security implementationon 20171003
lee tracie
 
PPTX
Security implementation on hadoop
Wei-Chiu Chuang
 
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
PDF
Cloudera GoDataFest Security and Governance
GoDataDriven
 
PPTX
Securing Your Apache Spark Applications
Cloudera, Inc.
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Project Rhino: Enhancing Data Protection for Hadoop
Cloudera, Inc.
 
ORC Deep Dive 2020
Owen O'Malley
 
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Cloudera, Inc.
 
Fighting cyber fraud with hadoop
Niel Dunnage
 
The Future of Hadoop Security - Hadoop Summit 2014
Cloudera, Inc.
 
The Future of Data Management - the Enterprise Data Hub
DataWorks Summit
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Big Data Spain
 
AWS Cloud Based Encryption Decryption System
IRJET Journal
 
Transparent Encryption in HDFS
DataWorks Summit
 
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Cloudera, Inc.
 
Hadoop security implementationon 20171003
lee tracie
 
Security implementation on hadoop
Wei-Chiu Chuang
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Cloudera GoDataFest Security and Governance
GoDataDriven
 
Securing Your Apache Spark Applications
Cloudera, Inc.
 
Ad

More from Owen O'Malley (19)

PPTX
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
PPTX
Big Data's Journey to ACID
Owen O'Malley
 
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
PDF
Strata NYC 2018 Iceberg
Owen O'Malley
 
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
PPTX
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
PPTX
Data protection2015
Owen O'Malley
 
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
PPT
Hadoop Security Architecture
Owen O'Malley
 
PPTX
Adding ACID Updates to Hive
Owen O'Malley
 
PPTX
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
PDF
ORC Files
Owen O'Malley
 
PPTX
ORC File Introduction
Owen O'Malley
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PDF
Next Generation MapReduce
Owen O'Malley
 
PDF
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
PDF
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
Big Data's Journey to ACID
Owen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Strata NYC 2018 Iceberg
Owen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
Data protection2015
Owen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
Hadoop Security Architecture
Owen O'Malley
 
Adding ACID Updates to Hive
Owen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
ORC Files
Owen O'Malley
 
ORC File Introduction
Owen O'Malley
 
Optimizing Hive Queries
Owen O'Malley
 
Next Generation Hadoop Operations
Owen O'Malley
 
Next Generation MapReduce
Owen O'Malley
 
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Ad

Recently uploaded (20)

PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 

Protect your private data with ORC column encryption

  • 1. PROTECT YOUR PRIVATE DATA WITH ORC COLUMN ENCRYPTION Owen O’Malley [email protected] September 2019 @owen_omalley
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 WHO AM I? • First committer added to Hadoop  MapReduce  Scaling  Security • Hive  ACID transactions • ORC  Creator
  • 3. SECURITY AND DATA PROTECTION IN HADOOP
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 EXAMPLE DATA LAKE SCENARIO Marketing Demographics Electronic medical records CRM POS (Structured)(Structured) (Structured) (Structured) (Structured) Cluster 1: Dublin Cluster 2: San Francisco (Unstructured)(Unstructured)(Unstructured) Cluster 3: Prague (Structured) On Premise Data Lakes (Unstructured)(Structured) (Unstructured) (Structured) Cloud Data Lakes Social Weblogs & Feeds Transactional Mobile IoT Personal Data
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 DIFFERENCES IN THE BIG DATA CONTEXT • Breaking down silos: fantastic for analytics, but security challenges  Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained authorization • Data democratization and the Data Scientist role (often a data super user with elevated privileges) • Data is maintained over a long duration • Cloud and Hybrid architectures spanning data center and (multiple) public clouds further broaden the attack surface area and present novel authentication and authorization challenges • Along with adherence to security fundamentals and defense in-depth, a data-centric approach to security becomes critical
  • 6. Watch Towers Limited Entry Points Moat Kerberos Securing your data lake High Hard Walls Check Identity Inner Walls Firewall Encryption, TLS, Key Trustee, Navigator Encrypt, Ranger KMS LDAP/AD Apache Knox: AuthN, API Gateway, Proxy, SSO Apache Ranger : ABAC AuthZ, Audits, Anonymization Apache Metron: Detection
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 DATA PROTECTION IN HADOOP • Must be applied in three different levels Storage – encrypt data at rest • HDFS encryption zones • Volume encryption Transmission – encrypt data in motion • SSL Upon access – apply restrictions when accessed • Ranger dynamic masking and row filtering
  • 8. Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive User 2: Ivanna Location : EU Group: HR User 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing Ranger
  • 9. WHAT ARE WE FIXING?
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 FRAMING THE PROBLEM… • Related data, different security requirements • Authorization – who can see it • Audit – track who read it • Encrypt on disk – regulatory • File-level (or blob) granularity isn’t enough • File systems don’t understand columns
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 REQUIREMENTS • Readers should transparently decrypt data If and only if the user has access to the key The data must be decrypted locally Old readers must not break! • Columns are only decrypted as necessary • Master keys must be managed securely Support for Key Management Server & hardware Support for key rolling
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 PARTIAL SOLUTION – HDFS ENCRYPTION ZONES • Transparent HDFS Encryption • Encryption zones • HDFS directory trees • Unique master key for each zone • Client decrypts data • Key Management via KeyProvider API
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 HDFS ENCRYPTION ZONE LIMITATIONS • Very coarse protection • Only entire directory subtrees • No ability to protect columns • A lot of users need access to keys • Moves between zones is painful • When writing with Hive, data is moved multiple times per a query
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 PARTIAL SOLUTION – HIVE SERVER 2 • Limit access to warehouse data to Hive • Only “hive” user has HDFS access • Breaks Hadoop’s multi-paradigm data access • Many customers use both Hive & Spark • JDBC isn’t a distributed protocol • Funneling large data through a small pipe • Spark Data Warehouse Connector to LLAP fixes this
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 PARTIAL SOLUTION – SEPARATE TABLES • Split private information out of tables • Separate directories in HDFS • HDFS and/or HS2 authorization • Enables HDFS encryption • Limitations • Need to join with other tables • Higher operational overhead
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 PARTIAL SOLUTION – ENCRYPTION UDF • Hive has user defined functions • aes_encrypt and aes_decrypt • Limitations • Key management is problematic • Encryption is not seeded • Size of value leaks information
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 COLUMNAR ENCRYPTION • Columnar file formats, such as ORC and Parquet • Write data in columns • Column projection • Better compression • Encryption works really well • Only encrypt bytes for column • Can store multiple variants of data
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 ORC FILE FORMAT
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 USER EXPERIENCE • Set table properties for encryption • orc.encrypt = ”pii:ssn,email;credit:card_info” • orc.mask = “sha256:card_info” • Define where to get the encryption keys • Configuration defines the key provider via URI
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 KEY MANAGEMENT • Create a master key for each use case • “pii”, “pci”, or “hipaa” • Each column in each file uses unique local key • Allows audit of which users read which files • Ranger policies limit access to keys • Who, What, When, Where
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 KEY PROVIDER API • Provides limited access to encryption keys • Encrypts or decrypts local keys • Users are never given master keys • Key versions and key rolling of master keys • Allows 3rd party plugins • Supports Cloud, Hadoop or Ranger KMS
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 ENCRYPTION DATA FLOW
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 ENCRYPTION FLOW • Local key • Random for each encrypted column in file • Encrypted w/ master key by KMS • Encrypted local key is stored in file metadata • IV is generated to be unique • Column, kind, stripe, & counter
  • 26. © 2019 Cloudera, Inc. All rights reserved. 26 STATIC DATA MASKING • What happens without key access? • Define static masks • Nullify – all values become null • Redact – mask values ‘Xxxxx Xxxxx!’ • Can define ranges to unmask • SHA256 – replace with SHA256 • Custom - user defined
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 DATA ANONYMIZATION • Anonymization is hard! • AOL search logs • Netflix prize datasets • NYC taxi dataset • Always evaluate security tradeoffs • Tokenization is a useful technique • Assigns arbitrary replacements
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 KEY DISPOSAL • Often need to keep data for 90 days • Currently the data is written twice • With column encryption: • Roll keys daily • Delete master key after 90 days
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 ORC ENCRYPTION DESIGN • Write both variants of streams • Masked unencrypted • Unmasked encrypted • Encrypt both data and statistics • Maintain compatibility for old readers • Read unencrypted variant • Preserve ability to seek in file
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 ORC WRITE PIPELINE • Streams go through pipeline • Run length encoding • Compression (zlib, snappy, lzo, lz4, zstd, or none) • Encryption • Encryption is AES/CTR • Allows seek • No padding
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32 CONCLUSIONS • ORC column encryptions provides Transparent encryption Multi-paradigm column security Compatible with old readers Static masking Audit logging (via KMS logging) • Supports file merging • Released in ORC 1.6
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 INTEGRATION WITH OTHER TOOLS • Hive & Spark No change other than defining table properties • Apache Hive’s LLAP Cache and fast processing of SQL queries Column encryption changes internal interfaces Cache both encrypted and unencrypted variants Ensure audit log reflects end-user and what they accessed
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34 LIMITATIONS • Need encryption policy for write Current Atlas & Ranger tags lag data Auto-discovery requires pre-access • Changes to masking policy Need to re-write files • Need additional data masks Credit card, addresses, etc. • Decrypted local keys could be saved