SlideShare a Scribd company logo
SECURING DATA IN HYBRID ENVIRONMENTS
USING APACHE RANGER
Madhan Neethiraj , Hortonworks
Apache Ranger PMC, Apache Atlas PMC
Don Bosco Durai, Privacera
Apache Ranger PMC
Disclaimer
‣ This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be developed.
‣ Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from
inception to release through Apache, however, technical feasibility, market demand, user feedback
and the overarching Apache Software Foundation community development process can all effect
timing and final delivery.
‣ This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in any
generally available product.
‣ Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
‣ Since this document contains an outline of general product development plans, customers should
not rely upon it when making purchasing decisions.
Agenda
• Hybrid Environment Use Cases
• Security Requirements for Cloud
• Implementation Challenges
• Hybrid Environment Security Implementation Flow
• Demo
• Roadmap
• Q & A
HYBRID ENVIRONMENT
- Predictable workload
- ETL and data wrangling
- BI and DW use cases
- Multiple Hadoop enabled
services and tools
- On-demand processing
units
- Analytical and other
services from Cloud
provider
- Share data with 3rd party
vendors
- Micro-Services
On Premise
DATA
PREFERRED SECURITY REQUIREMENTS
FOR CLOUD
1. Access permissions should be consistent in all
environments
2. Protect personal and sensitive data by
anonymizing or tokenizing
IMPLEMENTATION CHALLENGES
● Permission Model between Hadoop and Cloud are not the
same
● Users/Groups between both the systems might not be the
same
● Data constantly moves within the cluster and permissions
might change along with it
Hybrid Environment: overview
Rang
er
Atlas
Discover
y
Privacera
Hortonwrok
s DSS
Hive
3. Classify
4. ETL
Ra
w
Da
ta
1. Ingest
Name Street City State
John Doe 345 First St Eureka CA
Jane
Smith
876 Main St Newark NJ
Sally Mark
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
7. Export
Name Street City State
BXPHDE YNiIkjoiTH Eureka CA
HNEQON WNDUHNd Newark NJ
S3
ACL
Sync
2. Scan
5.Metadata,
Lineage
6. Scan
DISCOVERY AND CLASSIFICATION
● Centrally store classifications in Apache Atlas
● Classify resources via
○ Apache Atlas UI
○ Apache Atlas API
● Auto classify using discovery tools like Privacera,
Hortonworks Data Steward Studio
● Different types of tags
○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.)
○ Business (SALES, HR, FINANCE, MARKETING, etc.)
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
RANGER TAG BASED ACCESS POLICY
RANGER TAG BASED DYNAMIC MASKING
DATA UPLOAD TO S3
● Use Hive Export feature
● Dynamic Anonymization based on ETL User (e.g. s3_etl)
● Classification Based Access Authorization Policies to
block copying highly restricted data
● Row level filter
● Ranger policy to restrict access to S3 from Hive
ANONYMIZED DATA IN S3
ACL SYNC TO S3
● S3 Permission model
● Bucket Policies  ACL Sync sets Bucket policies
● User Policies
● Object ACLs
ACL SYNC TO S3
S3
ACL
Sync
1. Kafka Notification
Rang
er
entity_type aws_s3_pseudo_dir
qualifiedName s3a://dws2018-demo/sales/sales_may_2018
tags SALES
2. Retrieve tag ACLs
3. Set ACLs in S3
POLICIES ON S3
DEMO - SALES DATA
User Access View
Sally (Sales) Y Clear/Raw
Mark (Marketing) Y Anonymized
Henry (HR) X X
Item Id Amount Email Name Street City State
435 439.34 jd@y.com John Doe 345 First St Eureka CA
894 592.02 js@m.com Jane Smith 876 Main St Newark NJ
DEMO FLOW
Copy Sales
Data into HDFS
Create Hive
Table from Sales
Data
Export Hive table
to S3
1. Atlas will create table meta and lineage
2. Privacera will scan table and send tags to Atlas
1. Privacera will anonymize data
2. Atlas will create lineage to S3
3. Ranger-S3 Policy Sync will set
permissions on S3 for the resource
TOOLS USED IN DEMO
Metadata, Lineage, Classification Apache Atlas
Auto Discovery Privacera
Tag Based Policies Apache Ranger
Data Transfer Apache Hive
Anonymization/Tokenization Privacera
Ranger to S3 Policy Sync Privacera
OTHER DATA MOVEMENT TOOLS
● Kafka Connect - Transformer Plugin
● Apache NiFi - Processor Plugin
● Apache Spark SQL - UDF
● Java API integrated with Ranger & Atlas
PENDING TASKS/OPEN ISSUES
● Support for mapping Ranger permissions for S3
resources
● Support advanced Ranger policies like wild card
● Monitor permission changes on the cloud side and take
actions (e.g. disallow them)
● Mapping of Ranger group-level permissions to S3ACLs
● AWS S3 bucket policy has max size of 20k
APACHE JIRAS
https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and
Audits for AWS S3
https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs
for Atlas (BARBARA ECKMAN)
https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to
create lineage between Hive table and S3 entities
QUESTIONS & ANSWERS

More Related Content

What's hot (20)

PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PDF
Web develop in flask
Jim Yeh
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PPTX
Apache tomcat
Shashwat Shriparv
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
PDF
CoC23_ Looking at the New Features of Apache NiFi
Timothy Spann
 
PDF
Apache Flink Stream Processing
Suneel Marthi
 
PPT
PHP - Introduction to PHP Fundamentals
Vibrant Technologies & Computers
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPT
Tomcat server
Utkarsh Agarwal
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Apache Knox - Hadoop Security Swiss Army Knife
DataWorks Summit
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Web develop in flask
Jim Yeh
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Apache tomcat
Shashwat Shriparv
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
CoC23_ Looking at the New Features of Apache NiFi
Timothy Spann
 
Apache Flink Stream Processing
Suneel Marthi
 
PHP - Introduction to PHP Fundamentals
Vibrant Technologies & Computers
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Tomcat server
Utkarsh Agarwal
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Apache Knox - Hadoop Security Swiss Army Knife
DataWorks Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 

Similar to Securing data in hybrid environments using Apache Ranger (20)

PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
PDF
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
PPTX
Classification based security in Hadoop
Madhan Neethiraj
 
PPTX
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Built-In Security for the Cloud
DataWorks Summit
 
PPTX
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PPTX
Apache Ranger
Rommel Garcia
 
PPTX
Atlas and ranger epam meetup
Alex Zeltov
 
PDF
GDPR/CCPA Compliance and Data Governance in Hadoop
Eyad Garelnabi
 
PPTX
Building a data-driven authorization framework
DataWorks Summit
 
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Madhan Neethiraj
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
GDPR Community Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Built-In Security for the Cloud
DataWorks Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
Saving the elephant—now, not later
DataWorks Summit
 
Apache Ranger
Rommel Garcia
 
Atlas and ranger epam meetup
Alex Zeltov
 
GDPR/CCPA Compliance and Data Governance in Hadoop
Eyad Garelnabi
 
Building a data-driven authorization framework
DataWorks Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
DataWorks Summit/Hadoop Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Digital Circuits, important subject in CS
contactparinay1
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 

Securing data in hybrid environments using Apache Ranger

  • 1. SECURING DATA IN HYBRID ENVIRONMENTS USING APACHE RANGER Madhan Neethiraj , Hortonworks Apache Ranger PMC, Apache Atlas PMC Don Bosco Durai, Privacera Apache Ranger PMC
  • 2. Disclaimer ‣ This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. ‣ Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. ‣ This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. ‣ Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. ‣ Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. Agenda • Hybrid Environment Use Cases • Security Requirements for Cloud • Implementation Challenges • Hybrid Environment Security Implementation Flow • Demo • Roadmap • Q & A
  • 4. HYBRID ENVIRONMENT - Predictable workload - ETL and data wrangling - BI and DW use cases - Multiple Hadoop enabled services and tools - On-demand processing units - Analytical and other services from Cloud provider - Share data with 3rd party vendors - Micro-Services On Premise DATA
  • 5. PREFERRED SECURITY REQUIREMENTS FOR CLOUD 1. Access permissions should be consistent in all environments 2. Protect personal and sensitive data by anonymizing or tokenizing
  • 6. IMPLEMENTATION CHALLENGES ● Permission Model between Hadoop and Cloud are not the same ● Users/Groups between both the systems might not be the same ● Data constantly moves within the cluster and permissions might change along with it
  • 7. Hybrid Environment: overview Rang er Atlas Discover y Privacera Hortonwrok s DSS Hive 3. Classify 4. ETL Ra w Da ta 1. Ingest Name Street City State John Doe 345 First St Eureka CA Jane Smith 876 Main St Newark NJ Sally Mark Name Street City State BXPHDE YNiIkjoiTH Eureka CA HNEQON WNDUHNd Newark NJ 7. Export Name Street City State BXPHDE YNiIkjoiTH Eureka CA HNEQON WNDUHNd Newark NJ S3 ACL Sync 2. Scan 5.Metadata, Lineage 6. Scan
  • 8. DISCOVERY AND CLASSIFICATION ● Centrally store classifications in Apache Atlas ● Classify resources via ○ Apache Atlas UI ○ Apache Atlas API ● Auto classify using discovery tools like Privacera, Hortonworks Data Steward Studio ● Different types of tags ○ Security (SSN, CC, EMAIL, NAME, ADDRESS, etc.) ○ Business (SALES, HR, FINANCE, MARKETING, etc.)
  • 11. RANGER TAG BASED ACCESS POLICY
  • 12. RANGER TAG BASED DYNAMIC MASKING
  • 13. DATA UPLOAD TO S3 ● Use Hive Export feature ● Dynamic Anonymization based on ETL User (e.g. s3_etl) ● Classification Based Access Authorization Policies to block copying highly restricted data ● Row level filter ● Ranger policy to restrict access to S3 from Hive
  • 15. ACL SYNC TO S3 ● S3 Permission model ● Bucket Policies  ACL Sync sets Bucket policies ● User Policies ● Object ACLs
  • 16. ACL SYNC TO S3 S3 ACL Sync 1. Kafka Notification Rang er entity_type aws_s3_pseudo_dir qualifiedName s3a://dws2018-demo/sales/sales_may_2018 tags SALES 2. Retrieve tag ACLs 3. Set ACLs in S3
  • 18. DEMO - SALES DATA User Access View Sally (Sales) Y Clear/Raw Mark (Marketing) Y Anonymized Henry (HR) X X Item Id Amount Email Name Street City State 435 439.34 [email protected] John Doe 345 First St Eureka CA 894 592.02 [email protected] Jane Smith 876 Main St Newark NJ
  • 19. DEMO FLOW Copy Sales Data into HDFS Create Hive Table from Sales Data Export Hive table to S3 1. Atlas will create table meta and lineage 2. Privacera will scan table and send tags to Atlas 1. Privacera will anonymize data 2. Atlas will create lineage to S3 3. Ranger-S3 Policy Sync will set permissions on S3 for the resource
  • 20. TOOLS USED IN DEMO Metadata, Lineage, Classification Apache Atlas Auto Discovery Privacera Tag Based Policies Apache Ranger Data Transfer Apache Hive Anonymization/Tokenization Privacera Ranger to S3 Policy Sync Privacera
  • 21. OTHER DATA MOVEMENT TOOLS ● Kafka Connect - Transformer Plugin ● Apache NiFi - Processor Plugin ● Apache Spark SQL - UDF ● Java API integrated with Ranger & Atlas
  • 22. PENDING TASKS/OPEN ISSUES ● Support for mapping Ranger permissions for S3 resources ● Support advanced Ranger policies like wild card ● Monitor permission changes on the cloud side and take actions (e.g. disallow them) ● Mapping of Ranger group-level permissions to S3ACLs ● AWS S3 bucket policy has max size of 20k
  • 23. APACHE JIRAS https://issues.apache.org/jira/browse/RANGER-1974 - Ranger Authorizer and Audits for AWS S3 https://issues.apache.org/jira/browse/ATLAS-2708 - AWS S3 data lake typedefs for Atlas (BARBARA ECKMAN) https://issues.apache.org/jira/browse/ATLAS-2760 - Atlas Hive hook updates to create lineage between Hive table and S3 entities

Editor's Notes

  • #4: We have a lot to cover, want to apologize in advance
  • #8: We have a lot to cover, want to apologize in advance