SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Row Filtering and Column Masking
with Apache Ranger
Srikanth Venkat
Senior Director, Product Management
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
 This document may contain product features and technology directions that are under development, may be
under development in the future or may ultimately not be developed.
 Project capabilities are based on information that is publicly available within the Apache Software
Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception
to release through Apache, however, technical feasibility, market demand, user feedback and the
overarching Apache Software Foundation community development process can all effect timing and final
delivery.
 This document’s description of these features and technology directions does not represent a contractual
commitment, promise or obligation from Hortonworks to deliver these features in any generally available
product.
 Product features and technology directions are subject to change, and must not be included in contracts,
purchase orders, or sales agreements of any kind.
 Since this document contains an outline of general product development plans, customers should not rely
upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Background
Dynamic Column Masking and Row Filtering
Spark SQL Security via Hive LLAP/Ranger
Demo
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Challenges of Today’s Data Platforms
 Central repository of critical and sensitive data
– Grey Data
 Data maintained over long duration
– Forever
 External ecosystem is in flux
– The Zoo
 Users can access and analyze data in new
and different ways
– Democratization
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
• Central audit location for all
access requests
• Support multiple destination
sources (HDFS, Solr, etc.)
• Real-time visual query interface
AuditingAuthorization
• Store and manage encryption keys
• Support HDFS Transparent Data
Encryption
• Integration with HSM
• Safenet LUNA
Ranger KMS
• Centralized platform to define, administer
and manage security policies consistently
across Hadoop components
• HDFS, Hive, HBase, YARN, Kafka, Solr,
Storm, Knox, NiFi, Atlas
• Extensible Architecture
• Custom policy conditions, user context
enrichers
• Easy to add new component types for
authorization
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Architecture
HDFS
Ranger Administration Portal
HBase
Hive Server2
Ranger Audit Server
Ranger Plugin
HadoopComponentsEnterprise
Users
Ranger Plugin
Ranger Plugin
Legacy Tools and Data Governance
HDFS
Knox
NifI
Ranger Plugin
Ranger Plugin
SolrRanger Plugin
Ranger Policy Server Integration API
KafkaRanger Plugin
YARNRanger Plugin
Ranger PluginStorm Ranger Plugin Atlas
Solr
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Simple Intuitive UI for Policy Editing and
Setup
⬢ Fine-grained specificity by resource type,
user context, tags, and operation
⬢ Supports Access, Tag Based, Dynamic Data
Masking, and Row Filtering Policy Types
Apache Ranger - Intuitive and Granular Policy Management
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger Audits - Data Access
⬢ Comprehensive scalable audit logging
⬢ Audits for:
⬢ Resource Access Events with user context
⬢ Policy Edits/Creation/Deletion
⬢ User session information
⬢ Component plugin policy sync operations
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row Filtering in Hive
R A N G E R
Control Access to Rows in Hive Tables based on Context!
Goal: Improve reliability and robustness of HDP by providing Row
Level Security to Hive tables and reducing surface area of security
system
⬢ Capabilities
– Restrict data row access based on
– user characteristics (e.g. group membership) AND
– runtime context
⬢ Access restriction logic at Hive layer => No changes to apps!
– Hive applies the access restrictions every time that data access is
attempted
– Seamless behind the scenes enforcement of row level segmentation
without having to add this logic to the predicate of the query
– No need for multiple views to filter rows for different groups and
users!
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row Filtering in Hive
R A N G E R
Control Access to Rows in Hive Tables based on Context!
⬢ Use Cases: Cross-industry application for data protection:
AT L A S
H I V E
Healthcare
• A hospital can create a security policy that allows doctors
to view data rows only for their own patients
• Insurance claims administrators can view only specific
rows for their specific site.
Financial Services
• A bank can create a policy to restrict access to rows of
financial data based on the employee’s business division,
locale, or based on the employee’s role
• Employees in the finance department are allowed to
see customer invoices, payments, and accrual data
• European HR employees can see European
employee data).
Information
Technology
A multi-tenant application can create logical separation of
each tenant’s data so that each tenant can see only their
own data rows.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Does not physically alter the data, or make a copy of it
– Original sensitive data also does not leave the data
store, but obfuscated when presenting to the user.
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Masking and Row Level Filtering
Country National ID CC No Name DOB MRN Policy ID
US 232323233 4539067047629850 John Doe 9/12/1969 8233054331 nj23j424
US 333287465 5391304868205600 Jane Doe 8/13/1979 3736885376 cadsd984
Germany T22000129 4532786256545550 Ernie Schwarz 3/5/1963 876452830A KK-2345909
Ranger Policy Enforcement
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
Users from US customer
support group see row
filtered data for US persons
with CC and National ID
(SSN) as masked values and
MRN is nullified
EU Health Policy Admins
view relevant columns of
data unmasked but are
restricted by row filtering
policies to see data for
EU persons only
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SparkSQL Security via Hive LLAP
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Security: Row Filtering and Column Masking
 Spark SQL + Hive enables users to explore very large data sets using SQL
 Enterprises want to enable Spark SQL for ad-hoc analysis using BI tools with
fine grain security
 Spark provides strong authentication via Kerberos and wire encryption via
SSL but as general purpose compute has no built in authorization sub-system
 Spark also does not have any way to define a pluggable module that contains
policies for fine grain authorization
– With structured data with columns and rows with Hive, fine grain security becomes a challenge
 Co-mingled data in the same table may belong to two different groups, each
with their own regulatory requirements.
 Data may have regional restrictions, time based availability restrictions,
departmental restrictions, etc.
all user passwords: hadoop
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Open Interfaces
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user. Doesn’t require views.
 Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per-User Row Filtering by Region in SparkSQL
Spark User 2
(East Region)
Spark User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Demo
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Setup
 Hortonia – mid-size financial services company expanding from US to
international markets
 Employees in EU and US
 Multiple business units need access to customer data: Analysts, HR
 Customer data is co-mingled as well as isolated
 Needs to have rational security policies to provide the right level of access
control to customer data across geographies, business functions, and to
comply with external regulations (PII, HIPAA, EU Privacy etc.)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Data
 Customer data in hortoniabank DB
• 2 Customer Tables: 50K customer records each with 38 fields (PII, PHI, PCI & non-
sensitive data)
–us_customers: USA person data only
–ww_customers: multi-language, multi-country, localized person
data across the world
• 1 Reference table: eu_countries (reference table for looking up EU
country codes to country mappings – with BRExit etc.)
all user passwords: hadoop
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies Setup for Demo
 Only US employees can see data in us_customers table and only from locations within the US
(access_us_customers)
 Only US employees can see data rows of US persons in ww_customers table (filter_ww_customers_table
+ access_ww_customers)
 Only EU employees can see rows with EU person data in ww_customers table (filter_ww_customers_table
+ access_ww_customers)
 US HR team members can see all original unmasked data (PCI, PII,….)
 Analysts can view masked versions of sensitive data from WW customers table but are prohibited from
viewing PII data in US tables (All masking policies under Masking Tab of Resource based policies)
 No combination of zip code, MRN, and bloodgroup data are permitted to be joined in any query
(prohibition policy)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Personas Setup for Demo
User Group Access Privileges
joe-analyst us_employees,
analyst
US Data Only, non-sensitive data only, rest masked or forbidden
depending on sensitivity
kate-hr us_employees, hr US Data Only, All sensitive data (PCI, PII, PHI)
ivana-eu-hr eu_employees, hr EU Data Only, All sensitive data
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Column Data
Column
Description
Masking
Type
Sample Output Ranger Masking Policy
password Password Hash 237672b21819462ff39fcea7d990c3e5 mask_password_hash
nationalid National ID Show Last 4 xx-xx-9324 mask_nationalid_last4
ccnumber Credit Card
Number
Show First 4 4532xxxxxxxxxxxx mask_ccnumber_first4
streetaddress Street
Address
Redact nnn Xxxxxx Xxxxx mask_streetaddress_redact
MRN MRN Nullify null mask_mrn_nullify
age Age CUSTOM (Adds a random number below 20 to
actual age)
mask_age_custom
birthday Date of
Brith
CUSTOM 01-01-1987 (Keep year of birth and
make date & month 01-01)
mask_dob_custom
Data Masking Policies setup for us_customers data for analyst group
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup

More Related Content

What's hot (20)

PDF
LDM Webinar: Data Modeling & Metadata Management
DATAVERSITY
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PPTX
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
PDF
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
카일린 Kylin, OLAP on hadoop
Doo Yong Kim
 
PPTX
Data Observability.pptx
SonaSamad1
 
PDF
Architecting Agile Data Applications for Scale
Databricks
 
PDF
Data-Ed Webinar: Data Quality Success Stories
DATAVERSITY
 
PPTX
Boost Your Neo4j with User-Defined Procedures
Neo4j
 
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
Big Data Architecture
Guido Schmutz
 
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
PDF
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
PPTX
bigquery.pptx
Harissh16
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PDF
Data Architecture PowerPoint Presentation Slides
SlideTeam
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
LDM Webinar: Data Modeling & Metadata Management
DATAVERSITY
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
카일린 Kylin, OLAP on hadoop
Doo Yong Kim
 
Data Observability.pptx
SonaSamad1
 
Architecting Agile Data Applications for Scale
Databricks
 
Data-Ed Webinar: Data Quality Success Stories
DATAVERSITY
 
Boost Your Neo4j with User-Defined Procedures
Neo4j
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Data Pipline Observability meetup
Omid Vahdaty
 
Big Data Architecture
Guido Schmutz
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
bigquery.pptx
Harissh16
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Architecture PowerPoint Presentation Slides
SlideTeam
 
Modern Data architecture Design
Kujambu Murugesan
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 

Viewers also liked (20)

PPTX
Edw Optimization Solution
Hortonworks
 
PPTX
Top 5 Strategies for Retail Data Analytics
Hortonworks
 
PPTX
Hortonworks Data Cloud for AWS
Hortonworks
 
PDF
Pivotal - Advanced Analytics for Telecommunications
Hortonworks
 
PDF
Getting involved with Open Source at the ASF
Hortonworks
 
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
PPTX
How to Use Apache Zeppelin with HWX HDB
Hortonworks
 
PPTX
S3Guard: What's in your consistency model?
Hortonworks
 
PPTX
Hive - 1455: Cloud Storage
Hortonworks
 
PDF
SAS - Hortonworks: Creating the Omnichannel Experience in Retail webinar marc...
Hortonworks
 
PDF
The path to a Modern Data Architecture in Financial Services
Hortonworks
 
PPTX
How Universities Use Big Data to Transform Education
Hortonworks
 
PPTX
Enabling the Real Time Analytical Enterprise
Hortonworks
 
PPTX
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Hortonworks
 
PDF
Hortonworks technical workshop operations with ambari
Hortonworks
 
PPTX
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
PPTX
Double Your Hadoop Hardware Performance with SmartSense
Hortonworks
 
PPT
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Hortonworks
 
PPTX
The Power of your Data Achieved - Next Gen Modernization
Hortonworks
 
PPTX
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Edw Optimization Solution
Hortonworks
 
Top 5 Strategies for Retail Data Analytics
Hortonworks
 
Hortonworks Data Cloud for AWS
Hortonworks
 
Pivotal - Advanced Analytics for Telecommunications
Hortonworks
 
Getting involved with Open Source at the ASF
Hortonworks
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
How to Use Apache Zeppelin with HWX HDB
Hortonworks
 
S3Guard: What's in your consistency model?
Hortonworks
 
Hive - 1455: Cloud Storage
Hortonworks
 
SAS - Hortonworks: Creating the Omnichannel Experience in Retail webinar marc...
Hortonworks
 
The path to a Modern Data Architecture in Financial Services
Hortonworks
 
How Universities Use Big Data to Transform Education
Hortonworks
 
Enabling the Real Time Analytical Enterprise
Hortonworks
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Hortonworks
 
Hortonworks technical workshop operations with ambari
Hortonworks
 
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
Double Your Hadoop Hardware Performance with SmartSense
Hortonworks
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Hortonworks
 
The Power of your Data Achieved - Next Gen Modernization
Hortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Ad

Similar to Dynamic Column Masking and Row-Level Filtering in HDP (20)

PPTX
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
DataWorks Summit
 
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
PPTX
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PPTX
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Big Data Spain
 
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
PDF
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
PPTX
Overview of new features in Apache Ranger
DataWorks Summit
 
PPTX
The Power of Data
DataWorks Summit
 
PDF
August 2014 HUG : Hive 13 Security
Yahoo Developer Network
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Hadoop with Apache Ranger
DataWorks Summit
 
PDF
BigData Security - A Point of View
Karan Alang
 
PPTX
What's new in apache hive
DataWorks Summit
 
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
DataWorks Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Big Data Spain
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Overview of new features in Apache Ranger
DataWorks Summit
 
The Power of Data
DataWorks Summit
 
August 2014 HUG : Hive 13 Security
Yahoo Developer Network
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Securing Hadoop with Apache Ranger
DataWorks Summit
 
BigData Security - A Point of View
Karan Alang
 
What's new in apache hive
DataWorks Summit
 
Ad

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
PDF
HDF 3.2 - What's New
Hortonworks
 
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
PDF
Premier Inside-Out: Apache Druid
Hortonworks
 
PDF
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
PDF
Making Enterprise Big Data Small with Ease
Hortonworks
 
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
PDF
Driving Digital Transformation Through Global Data Management
Hortonworks
 
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks
 
HDF 3.2 - What's New
Hortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Premier Inside-Out: Apache Druid
Hortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks
 
Making Enterprise Big Data Small with Ease
Hortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks
 
Driving Digital Transformation Through Global Data Management
Hortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 

Recently uploaded (20)

PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 

Dynamic Column Masking and Row-Level Filtering in HDP

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Row Filtering and Column Masking with Apache Ranger Srikanth Venkat Senior Director, Product Management
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer  This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.  Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.  This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.  Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.  Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Background Dynamic Column Masking and Row Filtering Spark SQL Security via Hive LLAP/Ranger Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Challenges of Today’s Data Platforms  Central repository of critical and sensitive data – Grey Data  Data maintained over long duration – Forever  External ecosystem is in flux – The Zoo  Users can access and analyze data in new and different ways – Democratization
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger • Central audit location for all access requests • Support multiple destination sources (HDFS, Solr, etc.) • Real-time visual query interface AuditingAuthorization • Store and manage encryption keys • Support HDFS Transparent Data Encryption • Integration with HSM • Safenet LUNA Ranger KMS • Centralized platform to define, administer and manage security policies consistently across Hadoop components • HDFS, Hive, HBase, YARN, Kafka, Solr, Storm, Knox, NiFi, Atlas • Extensible Architecture • Custom policy conditions, user context enrichers • Easy to add new component types for authorization
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Architecture HDFS Ranger Administration Portal HBase Hive Server2 Ranger Audit Server Ranger Plugin HadoopComponentsEnterprise Users Ranger Plugin Ranger Plugin Legacy Tools and Data Governance HDFS Knox NifI Ranger Plugin Ranger Plugin SolrRanger Plugin Ranger Policy Server Integration API KafkaRanger Plugin YARNRanger Plugin Ranger PluginStorm Ranger Plugin Atlas Solr
  • 7. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Simple Intuitive UI for Policy Editing and Setup ⬢ Fine-grained specificity by resource type, user context, tags, and operation ⬢ Supports Access, Tag Based, Dynamic Data Masking, and Row Filtering Policy Types Apache Ranger - Intuitive and Granular Policy Management
  • 8. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger Audits - Data Access ⬢ Comprehensive scalable audit logging ⬢ Audits for: ⬢ Resource Access Events with user context ⬢ Policy Edits/Creation/Deletion ⬢ User session information ⬢ Component plugin policy sync operations
  • 9. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Row Filtering in Hive R A N G E R Control Access to Rows in Hive Tables based on Context! Goal: Improve reliability and robustness of HDP by providing Row Level Security to Hive tables and reducing surface area of security system ⬢ Capabilities – Restrict data row access based on – user characteristics (e.g. group membership) AND – runtime context ⬢ Access restriction logic at Hive layer => No changes to apps! – Hive applies the access restrictions every time that data access is attempted – Seamless behind the scenes enforcement of row level segmentation without having to add this logic to the predicate of the query – No need for multiple views to filter rows for different groups and users! ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  • 10. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Row Filtering in Hive R A N G E R Control Access to Rows in Hive Tables based on Context! ⬢ Use Cases: Cross-industry application for data protection: AT L A S H I V E Healthcare • A hospital can create a security policy that allows doctors to view data rows only for their own patients • Insurance claims administrators can view only specific rows for their specific site. Financial Services • A bank can create a policy to restrict access to rows of financial data based on the employee’s business division, locale, or based on the employee’s role • Employees in the finance department are allowed to see customer invoices, payments, and accrual data • European HR employees can see European employee data). Information Technology A multi-tenant application can create logical separation of each tenant’s data so that each tenant can see only their own data rows.
  • 11. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Data Masking of Hive Columns R A N G E R Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output ⬢ Benefits – Does not physically alter the data, or make a copy of it – Original sensitive data also does not leave the data store, but obfuscated when presenting to the user. – No changes are required at the application or Hive layer – No need to produce additional protected duplicate versions of datasets – Simple & easy to setup masking policies ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  • 12. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Masking and Row Level Filtering Country National ID CC No Name DOB MRN Policy ID US 232323233 4539067047629850 John Doe 9/12/1969 8233054331 nj23j424 US 333287465 5391304868205600 Jane Doe 8/13/1979 3736885376 cadsd984 Germany T22000129 4532786256545550 Ernie Schwarz 3/5/1963 876452830A KK-2345909 Ranger Policy Enforcement Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A Users from US customer support group see row filtered data for US persons with CC and National ID (SSN) as masked values and MRN is nullified EU Health Policy Admins view relevant columns of data unmasked but are restricted by row filtering policies to see data for EU persons only
  • 13. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkSQL Security via Hive LLAP
  • 14. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Security: Row Filtering and Column Masking  Spark SQL + Hive enables users to explore very large data sets using SQL  Enterprises want to enable Spark SQL for ad-hoc analysis using BI tools with fine grain security  Spark provides strong authentication via Kerberos and wire encryption via SSL but as general purpose compute has no built in authorization sub-system  Spark also does not have any way to define a pluggable module that contains policies for fine grain authorization – With structured data with columns and rows with Hive, fine grain security becomes a challenge  Co-mingled data in the same table may belong to two different groups, each with their own regulatory requirements.  Data may have regional restrictions, time based availability restrictions, departmental restrictions, etc. all user passwords: hadoop
  • 15. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Open Interfaces
  • 16. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Features: Spark Column Security with LLAP  Fine-Grained Column Level Access Control for SparkSQL.  Fully dynamic policies per user. Doesn’t require views.  Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
  • 17. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example: Per-User Row Filtering by Region in SparkSQL Spark User 2 (East Region) Spark User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies LLAP Data Access User ID Region Total Spend 1 East 5,131 2 East 27,828 3 West 55,493 4 West 7,193 5 East 18,193 Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “west”
  • 18. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Demo
  • 19. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Setup  Hortonia – mid-size financial services company expanding from US to international markets  Employees in EU and US  Multiple business units need access to customer data: Analysts, HR  Customer data is co-mingled as well as isolated  Needs to have rational security policies to provide the right level of access control to customer data across geographies, business functions, and to comply with external regulations (PII, HIPAA, EU Privacy etc.)
  • 20. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Data  Customer data in hortoniabank DB • 2 Customer Tables: 50K customer records each with 38 fields (PII, PHI, PCI & non- sensitive data) –us_customers: USA person data only –ww_customers: multi-language, multi-country, localized person data across the world • 1 Reference table: eu_countries (reference table for looking up EU country codes to country mappings – with BRExit etc.) all user passwords: hadoop
  • 21. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies Setup for Demo  Only US employees can see data in us_customers table and only from locations within the US (access_us_customers)  Only US employees can see data rows of US persons in ww_customers table (filter_ww_customers_table + access_ww_customers)  Only EU employees can see rows with EU person data in ww_customers table (filter_ww_customers_table + access_ww_customers)  US HR team members can see all original unmasked data (PCI, PII,….)  Analysts can view masked versions of sensitive data from WW customers table but are prohibited from viewing PII data in US tables (All masking policies under Masking Tab of Resource based policies)  No combination of zip code, MRN, and bloodgroup data are permitted to be joined in any query (prohibition policy)
  • 22. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Personas Setup for Demo User Group Access Privileges joe-analyst us_employees, analyst US Data Only, non-sensitive data only, rest masked or forbidden depending on sensitivity kate-hr us_employees, hr US Data Only, All sensitive data (PCI, PII, PHI) ivana-eu-hr eu_employees, hr EU Data Only, All sensitive data
  • 23. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Column Data Column Description Masking Type Sample Output Ranger Masking Policy password Password Hash 237672b21819462ff39fcea7d990c3e5 mask_password_hash nationalid National ID Show Last 4 xx-xx-9324 mask_nationalid_last4 ccnumber Credit Card Number Show First 4 4532xxxxxxxxxxxx mask_ccnumber_first4 streetaddress Street Address Redact nnn Xxxxxx Xxxxx mask_streetaddress_redact MRN MRN Nullify null mask_mrn_nullify age Age CUSTOM (Adds a random number below 20 to actual age) mask_age_custom birthday Date of Brith CUSTOM 01-01-1987 (Keep year of birth and make date & month 01-01) mask_dob_custom Data Masking Policies setup for us_customers data for analyst group
  • 24. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup

Editor's Notes

  • #7: The Ranger Admin portal is the central interface for security administration. Users can create and update policies, which are then stored in a policy database. Plugins within each component poll these policies at regular intervals. The portal also consists of an audit server that sends audit data collected from the plugins for storage in HDFS or in a relational database. Ranger plugins: Plugins are lightweight Java programs which embed within processes of each cluster component. For example, the Apache Ranger plugin for Apache Hive is embedded within Hiveserver2.These plugins pull in policies from a central server and store them locally in a file. When a user request comes through the component, these plugins intercept the request and evaluate it against the security policy. Plugins also collect data from the user request and follow a separate thread to send this data back to the audit server. User group sync: Apache Ranger provides a user synchronization utility to pull users and groups from Unix or from LDAP or Active Directory. The user or group information is stored within Ranger portal and used for policy definition
  • #11: Row level filtering brings convenience to apps running on Hive. By moving the access restriction logic down into the Hive layer, Hive applies the access restrictions every time that data access is attempted, helping simplify authoring of the query and bringing in seamless behind the scenes enforcement of row level segmentation without having to add this logic to the predicate of the query
  • #13: Dynamic data masking via Apache Ranger enables security administrators to ensure that only authorized users can see the data they are permitted to see, while for other users or groups the same data is masked or anonymized to protect sensitive content.
  • #17: Interactive query: Low latency interactive query, persistent servers ready to process SQL Intelligent in-memory caching Builds on Hive engine + SQL capabilities Long running processes Ability to read from HDFS/S3, cache and serve it out Open interfaces/composable interfaces to read data Extensible interfaces to have Spark to read data out of LLAP and process Rely on LLAP that delivers trusted security Client side mechanisms can be circumvented so we have focused on server side enforcement of security
  • #18: Spark has its own exec engine and SQL dialect – so it needs to be able to deal w/ data in a raw manner Delegate all runtime and execution to Spark itself Spark plugin called LLAP context (aware of LLAP daemon, how to read data from LLAP daemon, & aware of Ranger query transformations) Spark SQL issue query, routed to HiveServer 2 into Ranger, Returns split locations Data read in based on split locations in parallel with assigned plan, Ranger applies query transformation to provide column masking and row filtering Then Spark is free to LLAP is trusted daemon