SlideShare a Scribd company logo
Scaling Privacy with
Apache Spark
Aaron Colcord
Sr. Director Engineering, Northwestern Mutual
Don Durai Bosco
CTO and Co-Founder, Privacera
Agenda
▪ Our background
▪ Why privacy, security,
compliance?
▪ Approaches
▪ Ideal problem solve
▪ Real life meets ideal life
Backgrounds
▪ Building an Enterprise Scale Unified
Framework
▪ Very Long, Respected History ~ 160 Years
▪ Compliance is extremely important to us
▪ Agile Data vs Compliant Data
▪ Founded in 2016 by the creators of Apache
Ranger & Apache Atlas
▪ Extends Ranger's capabilities beyond traditional
Big Data environments to cloud (Databricks,
AWS, Azure, GCP, and more)
▪ Specializes in democratizing data for analytics,
while ensuring compliance with privacy
regulations (GDPR, CCPA, LGPD, HIPAA, & more)
• Privacera
• Northwestern Mutual
Why do we suddenly care about privacy?
• You care if you are regulated in any form
• Simple you need to show you can pass an audit
• You care if you store any information about your users
• Simple because governments have woken up with GDPR and CCPA
• You care if you want to democratize your data
• Simple because the use of your data can be scrutinized
We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and
technology really showed how important it was.
Have you ever...
• Collecting information about your customers can
• Improve the experience
• Allow the company to understand their business better
• At the core, privacy is a policy and legal obligation
• You have the data, it used to be your business to just secure it.
• Do you want your information monetized? Sold? Traded?
• Most companies don’t do this. But the privacy policy is there for you.
• Clicked ‘accept all’ on website, used a digital assistant..
Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or
EULA?
And it’s only going to pick up speed.
• More Regulations are arriving around privacy
• Increasing your ability to execute against data means respecting your user’s rights
• A part of maturity is being able to manage governance
More importantly, why do we care so much?
• Technology like Apache Spark opens the capability to
democratize your data.
• Most every company wants the marketplace to enrich
and share their data.
• Who inside that company can view it? Do we have the
controls to protect your information? Can we verify
that the information is used for the right purposes?
What is the difference between these?
▪ Preventing unauthorized
usage of systems
▪ Ensuring users don’t see the
incorrect information
▪ Creating boundaries to
enforce right action of the
system
• The process of making sure
your company and
employees follow all laws,
regulations, standards, and
ethical practices that apply
to your organization
• Compliance
• Security
• “Data privacy may be
defined as the authorized,
fair, and legitimate
processing of personal
information”
• Consent rights
• Do not share
• Slippery space
• Privacy
Examine strategies to scale agile data w/privacy
• Build a metadata layer that defines PII in its schema
• Users and developers can and will change where PII is stored
• You can literally chase people to do the ‘right thing’ forever
• You could build views with permissions to certain users
• Not very scalable
• Plus you need to always show who accessed and why
• Are these security scenario?
Challenges to that strategy
• Is the metadata layer flexible enough or should we think in policies?
• Privacy is inherently your organization’s position which may evolve based on regulation
• Can your development keep up with views?
• When you discover the extra 10,000 fields, can you keep up?
• Implement a framework that scales
• Security is not Privacy.
• Security has a different domain and set of principles.
• Remember we are protecting the usage of your data.
How can we solve it?
Ideal scalable system
▪ Revocation of
Consent
▪ Portability
▪ Erasure
▪ Rectification
▪ How is data used?
▪ Rights follow Data
Reuse
▪ Flexible to change
▪ Should align with a
Data Governance
program
▪ Should adapt to
changing data
▪ Proactive.
▪ Reclassification
• Classification
• User Rights
▪ How was it used?
▪ How was it
accessed?
▪ How was it
protected?
▪ Did it cross
borders?
• Audit/Governance
▪ Authorization of
User may change
▪ Supports Agile
Access
▪ Business Use is
preserved
▪ Automated
Systems obey
Privacy
• Access
User Rights at Scale
▪Revocation of Consent/ Right To Be Forgotten
▪Portability
▪Erasure
▪Rectification
▪How is data used?
▪Rights follow Data Reuse
▪Flexible to change
S3 ADLS Redshift Snowflake Synapse
Privacy Challenges in Open Data Ecosystem
Athena Databricks HDInsight
EMR
Dremio Trino PrestoDB
PowerBI Tableau
Storage
SQL Engines
Data Virtualization
BI Tools
Marketing
Data
Analyst
Data
Scientist/A
rchitect
Governance blind spot
Tools & Technology
AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL
AUDIT COLLECTION AND REPORTING
Automated Data Discovery
● Automatically detect and catalog sensitive
data
● Detailed classification, e.g. EMAIL, SSN,
GENDER, CC, PHONE_NUMBER, etc.
● Eliminate manual processes
● Catalog data as it is ingested
● Track data movement and propagate tag
● Catalog data across multiple cloud
services
Centralized Access Control
● Global Tag/Classification-based policies
● Purpose and Persona based policies
● Dynamic row filters v/s Views
● Dynamic masking or decryption
● Approval workflows with time and
purpose constraints
Centralized Auditing and Reporting
● Centralize auditing
● Monitoring data access by classification
● Track usage by Purpose
● Generate attestation reports
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

PDF
Introduction to Azure Synapse Webinar
Peter Ward
 
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PDF
Auckland SQL Saturday - Azure Data Lake
Sergio Zenatti Filho
 
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
PDF
Analytics-Enabled Experiences: The New Secret Weapon
Databricks
 
PDF
Azure databricks c sharp corner toronto feb 2019 heather grandy
Nilesh Shah
 
PDF
Using Redash for SQL Analytics on Databricks
Databricks
 
PPTX
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PPTX
Modern data warehouse
Rakesh Jayaram
 
PDF
Data platform architecture
Sudheer Kondla
 
PPTX
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
PPTX
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PPTX
The Power of Data
DataWorks Summit
 
Introduction to Azure Synapse Webinar
Peter Ward
 
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Cathrine Wilhelmsen
 
Modernizing to a Cloud Data Architecture
Databricks
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Auckland SQL Saturday - Azure Data Lake
Sergio Zenatti Filho
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Analytics-Enabled Experiences: The New Secret Weapon
Databricks
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Nilesh Shah
 
Using Redash for SQL Analytics on Databricks
Databricks
 
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Azure Data Factory v2
Sergio Zenatti Filho
 
Modern data warehouse
Rakesh Jayaram
 
Data platform architecture
Sudheer Kondla
 
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
How to Build Continuous Ingestion for the Internet of Things
Cloudera, Inc.
 
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
The Power of Data
DataWorks Summit
 

Similar to Scaling Privacy in a Spark Ecosystem (20)

PDF
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera
 
PDF
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
PDF
Data Analytics Governance and Ethics
HPCC Systems
 
PDF
[Webinar Slides] Data Privacy – Learn What It Takes to Protect Your Information
AIIM International
 
PDF
Data Privacy Challenges in the Data Analytics Future
prasathsankar7
 
PPTX
CCPA Compliance for Analytics and Data Science Use Cases with Databricks and ...
Jeff Kelly
 
PDF
Privacera Databricks CCPA Webinar Feb 2020
Privacera
 
PDF
[AIIM18] GDPR: whose job is it now? - Paul Lanois
AIIM International
 
PDF
Polina Zvyagina - Airbnb - Privacy & GDPR Compliance - Stanford Engineering -...
Burton Lee
 
PDF
Setting the right GDPR priorities
Alberto Canadè
 
PPTX
Teradata's approach to addressing GDPR
Paul O'Carroll
 
PDF
TrustArc Webinar - Unlocking AI Potential: Leveraging PIA Processes for Compr...
TrustArc
 
PPTX
Privacy by design
Lars Albertsson
 
PPTX
New opportunities and business risks with evolving privacy regulations
Ulf Mattsson
 
PDF
Closing the Governance Gap - Enabling Governed Self-Service Analytics
Privacera
 
PDF
Data Privacy: A runbook for engineers 1st Edition Nishant Bhajaria
caxajdoisehm
 
PDF
Steven Meister GDPR and Regulatory Compliance and Big Data Excelerator Profes...
Steven Meister
 
PPTX
Tim Willoughby - Presentation to Innovation Masters 2016
Tim Willoughby
 
PDF
Toreon adding privacy by design in secure application development oss18 v20...
Sebastien Deleersnyder
 
PPTX
Data Privacy: Protecting Information in the Digital Age
Sajal Jain
 
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera
 
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
Data Analytics Governance and Ethics
HPCC Systems
 
[Webinar Slides] Data Privacy – Learn What It Takes to Protect Your Information
AIIM International
 
Data Privacy Challenges in the Data Analytics Future
prasathsankar7
 
CCPA Compliance for Analytics and Data Science Use Cases with Databricks and ...
Jeff Kelly
 
Privacera Databricks CCPA Webinar Feb 2020
Privacera
 
[AIIM18] GDPR: whose job is it now? - Paul Lanois
AIIM International
 
Polina Zvyagina - Airbnb - Privacy & GDPR Compliance - Stanford Engineering -...
Burton Lee
 
Setting the right GDPR priorities
Alberto Canadè
 
Teradata's approach to addressing GDPR
Paul O'Carroll
 
TrustArc Webinar - Unlocking AI Potential: Leveraging PIA Processes for Compr...
TrustArc
 
Privacy by design
Lars Albertsson
 
New opportunities and business risks with evolving privacy regulations
Ulf Mattsson
 
Closing the Governance Gap - Enabling Governed Self-Service Analytics
Privacera
 
Data Privacy: A runbook for engineers 1st Edition Nishant Bhajaria
caxajdoisehm
 
Steven Meister GDPR and Regulatory Compliance and Big Data Excelerator Profes...
Steven Meister
 
Tim Willoughby - Presentation to Innovation Masters 2016
Tim Willoughby
 
Toreon adding privacy by design in secure application development oss18 v20...
Sebastien Deleersnyder
 
Data Privacy: Protecting Information in the Digital Age
Sajal Jain
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
deep dive data management sharepoint apps.ppt
novaprofk
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 

Scaling Privacy in a Spark Ecosystem

  • 1. Scaling Privacy with Apache Spark Aaron Colcord Sr. Director Engineering, Northwestern Mutual Don Durai Bosco CTO and Co-Founder, Privacera
  • 2. Agenda ▪ Our background ▪ Why privacy, security, compliance? ▪ Approaches ▪ Ideal problem solve ▪ Real life meets ideal life
  • 3. Backgrounds ▪ Building an Enterprise Scale Unified Framework ▪ Very Long, Respected History ~ 160 Years ▪ Compliance is extremely important to us ▪ Agile Data vs Compliant Data ▪ Founded in 2016 by the creators of Apache Ranger & Apache Atlas ▪ Extends Ranger's capabilities beyond traditional Big Data environments to cloud (Databricks, AWS, Azure, GCP, and more) ▪ Specializes in democratizing data for analytics, while ensuring compliance with privacy regulations (GDPR, CCPA, LGPD, HIPAA, & more) • Privacera • Northwestern Mutual
  • 4. Why do we suddenly care about privacy? • You care if you are regulated in any form • Simple you need to show you can pass an audit • You care if you store any information about your users • Simple because governments have woken up with GDPR and CCPA • You care if you want to democratize your data • Simple because the use of your data can be scrutinized We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and technology really showed how important it was.
  • 5. Have you ever... • Collecting information about your customers can • Improve the experience • Allow the company to understand their business better • At the core, privacy is a policy and legal obligation • You have the data, it used to be your business to just secure it. • Do you want your information monetized? Sold? Traded? • Most companies don’t do this. But the privacy policy is there for you. • Clicked ‘accept all’ on website, used a digital assistant.. Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or EULA?
  • 6. And it’s only going to pick up speed. • More Regulations are arriving around privacy • Increasing your ability to execute against data means respecting your user’s rights • A part of maturity is being able to manage governance
  • 7. More importantly, why do we care so much? • Technology like Apache Spark opens the capability to democratize your data. • Most every company wants the marketplace to enrich and share their data. • Who inside that company can view it? Do we have the controls to protect your information? Can we verify that the information is used for the right purposes?
  • 8. What is the difference between these? ▪ Preventing unauthorized usage of systems ▪ Ensuring users don’t see the incorrect information ▪ Creating boundaries to enforce right action of the system • The process of making sure your company and employees follow all laws, regulations, standards, and ethical practices that apply to your organization • Compliance • Security • “Data privacy may be defined as the authorized, fair, and legitimate processing of personal information” • Consent rights • Do not share • Slippery space • Privacy
  • 9. Examine strategies to scale agile data w/privacy • Build a metadata layer that defines PII in its schema • Users and developers can and will change where PII is stored • You can literally chase people to do the ‘right thing’ forever • You could build views with permissions to certain users • Not very scalable • Plus you need to always show who accessed and why • Are these security scenario?
  • 10. Challenges to that strategy • Is the metadata layer flexible enough or should we think in policies? • Privacy is inherently your organization’s position which may evolve based on regulation • Can your development keep up with views? • When you discover the extra 10,000 fields, can you keep up? • Implement a framework that scales • Security is not Privacy. • Security has a different domain and set of principles. • Remember we are protecting the usage of your data.
  • 11. How can we solve it?
  • 12. Ideal scalable system ▪ Revocation of Consent ▪ Portability ▪ Erasure ▪ Rectification ▪ How is data used? ▪ Rights follow Data Reuse ▪ Flexible to change ▪ Should align with a Data Governance program ▪ Should adapt to changing data ▪ Proactive. ▪ Reclassification • Classification • User Rights ▪ How was it used? ▪ How was it accessed? ▪ How was it protected? ▪ Did it cross borders? • Audit/Governance ▪ Authorization of User may change ▪ Supports Agile Access ▪ Business Use is preserved ▪ Automated Systems obey Privacy • Access
  • 13. User Rights at Scale ▪Revocation of Consent/ Right To Be Forgotten ▪Portability ▪Erasure ▪Rectification ▪How is data used? ▪Rights follow Data Reuse ▪Flexible to change
  • 14. S3 ADLS Redshift Snowflake Synapse Privacy Challenges in Open Data Ecosystem Athena Databricks HDInsight EMR Dremio Trino PrestoDB PowerBI Tableau Storage SQL Engines Data Virtualization BI Tools Marketing Data Analyst Data Scientist/A rchitect
  • 16. Tools & Technology AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL AUDIT COLLECTION AND REPORTING
  • 17. Automated Data Discovery ● Automatically detect and catalog sensitive data ● Detailed classification, e.g. EMAIL, SSN, GENDER, CC, PHONE_NUMBER, etc. ● Eliminate manual processes ● Catalog data as it is ingested ● Track data movement and propagate tag ● Catalog data across multiple cloud services
  • 18. Centralized Access Control ● Global Tag/Classification-based policies ● Purpose and Persona based policies ● Dynamic row filters v/s Views ● Dynamic masking or decryption ● Approval workflows with time and purpose constraints
  • 19. Centralized Auditing and Reporting ● Centralize auditing ● Monitoring data access by classification ● Track usage by Purpose ● Generate attestation reports
  • 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.