SlideShare a Scribd company logo
© 2018 Bloomberg Finance L.P. All rights reserved.
Data Gloveboxes: A
Philosophy of Data Science
Data Security
DataWorks Summit - Barcelona
March 21, 2019
Clay Baenziger
Hadoop Infrastructure
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Bloomberg by the Numbers
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Bloomberg By the Numbers
• Founded in 1981
• 325,000 subscribers in 170 countries
• Over 19,000 employees in 192 locations
— Over 5,000 software engineers
— 100+ machine learning data scientists and engineers
• More News reporters than The New York Times + Washington Post +
Chicago Tribune
— News content from 125K+ sources
— >1.5M news stories ingested / published each day (that's 500 news
stories ingested/second)
• One of the largest private networks in the world
• 100B+ tick messages per day, with a peak of more than 10 million
messages/second
• More than a billion messages (E-Mails and IB chats) processed each day
Nuclear Materials Manufacturing
Image: Office of Legacy Management, U.S. D.O.E., Rocky Flats Plant History & Information Used to Process EEOICPA Claim
Requests. 16 April, 2014
Former U.S. Department of Energy Rocky Flats Plant - South of Boulder, CO
Plutonium Dropbox? Isolation Glovebox?
Dropbox: [n] a container where one can deposit something to be retrieved later
Glovebox: [n] a sealed protective container in which one may safely manipulate a
dangerous substance using gloves attached to holes
Images:
(Top) Office of Legacy Management, U.S. D.O.E., CO-83-M-2 -
Interior view of X-Y retriever. 29 Nov, 1988
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-K-15 -
View of safe geometry station from the inside of an input-output
station. 3 Dec, 1988
Data Dropbox? Data Glovebox?
Dropbox: [n] a data-system where one can deposit a file for later reading and
processing dependent on client (network) location; ideally providing a positive
verification of file contents
Glovebox: [n] a sealed compute environment in which one may safely
manipulate data using restricted access - with strong exfiltration controls
MySQL
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Plutonium Enclave
Image: Office of Legacy Management, U.S. D.O.E.,CO-83-M-14 - Downdraft Table, 20 Aug. 2014
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Enclave
Centralized Models:
• Curator Model: Restrict access, operations and results
— “The curator must remain present throughout the lifetime of the database”
(Dwork, Cynthia. “Differential Privacy: A Survey of Results”, 1, Apr. 2008)
— Statistical Disclosure Control
• Data Enclave:
(Lane and Shipp. “Using a Remote Access Data Enclave for Data Dissemination”. Intl. Journal of Digital Curation. 1.2 (2007))
— Allow for Direct and Exact Access
— Allow Arbitrary Computation
— Prevent Business Automation
Image: Denver Public Library, Rocky Mountain News Photographic Archives, “Rocky Flats
employee handles a robotic arm assembly.” 11 Nov. 1987
Plutonium Glovebox
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox
• Leaded Pane of Glass: Remote Desktop Without Download
• Glove Ports: Arbitrary Code Execution (Run code to manipulate)
• Robotics: Workflow Management
— Deployment
— Routine operations
• Pass-throughs:
— Firewalls are insufficient
— Protocol aware deep packet inspection
— Databases
• Firewalls: Ensure user and workload isolation
— Distributed file-systems
— Local file-systems
— Processing
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Glovebox Architecture
Copyrights: Git Logo, Jason Long; Python “Two Snakes” Logo, Python Software Foundation
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Client Nodes
Client Nodes
Master Node
HBase Master
YARN Resource Manager
HDFS Namenode
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Region Server
HBase
HDFS
Map/ReduceNovel Application
YARN
Spark
Hadoop Architecture
Client Nodes
Cluster Nodes
HBase
Region
Server
HDFS Datanode
Map/ReduceNovel Application
YARN Nodemanager
Spark
Master Node
HBase Master
YARN Resource Manager
HDFS Namenode
<HTTP REST API>
YARN Job:
• Submission
• Status
• Logs
• App. WebUIs
<HTTP/2 REST API>
WebHDFS:
• GET Methods:
— Open (read)
— GetFileChecksum
• PUT Methods:
— Create
• POST Methods:
— Append
<HBaseProtobuf RPC>
• Get
• Put
• Scan
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Handling Material
Image: Office of Legacy Management, U.S. D.O.E., “Rocky Flats Overview”, 20 Aug. 2014. Pg. 13
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Properties of Radioactive Materials
Radioactive Materials:
• Can be harmful to people in small quantities
• Can have a very long hazard life if released
• Should be isolated to prevent their spread
• Should be cataloged and characterized to assess harm
• Can still be machined and worked with proper technique
— Robotics
— Personal Protective Equipment
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Properties of Material Data
Material Data:
• Can be harmful to people in small quantities
• Can have a very long hazard life if released
• Should be isolated to prevent their spread
• Should be cataloged and characterized to assess harm
• Can still be used and analyzed with proper technique
— Continuous/Automated Deployment
— Workflow Automation and Gloveboxes
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Nuclear Material (non)Proliferation
Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 59
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Proliferation
• Terrorists? (Well, certainly hackers...)
• Accidental loss (USB sticks, laptops, etc.)
• No Price-Anderson Act for Data Incidents
— Quite the opposite with GDPR!
— GDPR limits untraceable mixing of data
• Data Sovereignty
— Requires data to remain geographically stationary
— Must move computation to the data
• Data:
— Swamps
— Lineage (Visibility e.g. via Apache Atlas)
— Masking (Curator Model e.g. via Apache Ranger)
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Lock Everything Down
Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 21
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Lock Data Down
Quality Attributes of a Dropbox
• Perimeter Controls (Network Firewall)
• Encryption:
— At rest
— On the wire
• Authentication (Kerberos)
• Client Location Controls Usage (Directionality)
— Data goes in from insecure networks
— Data cannot come back out to an insecure network
— Allow validation of transmission from anywhere
— “Normal” usage from trusted networks
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
DropboxFilter for HDFS (WebHDFS API)
• Upload:
$curl -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
HTTP/1.1 307 TEMPORARY_REDIRECT
$curl -X PUT -T <File> "http://<DN>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..."
HTTP/1.1 201 Created
• Download:
$curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
• Checksum:
$curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"
{"FileChecksum": {
"algorithm": "MD5-of-0MD5-of-512CRC32C",
"bytes": "[...]00eb745ad2f5bd1dccab359b12f7f9411b00000000",
"length": 28
}}
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
DropboxFilter for HDFS (Architecture)
• HDFS Protocols
— Protobuf RPC
— RESTful API over HTTPS/2
• Servlet Based Web Server Design
— Filter
— Request Handler
HTTP
Server
(Netty)
Client
(curl)
Servlet Container
(Jetty)
WebHDFS Handler
AuthenticationFilter(request…)
DropboxFilter(request…)
Provides User Info
Provides User Info
DropboxFilter for HDFS (Examples)
<head><title>Error 401 Authentication
required</title></head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /webhdfs/v1/user/ubuntu/foo. Reason:
<pre>Authentication required</pre></p>
<hr /><i><small>Powered by Jetty://</small></i><br/>
HTTP
Server
(Netty)
Client
(curl)
Servlet Container
(Jetty)
WebHDFS Handler
AuthenticationFilter(request…)
DropboxFilter(request…)
Provides User Info
Provides User Info
<head><title>Error 403 WebHDFS is configured write-only for
clay</title></head>
<body><h2>HTTP ERROR 403</h2>
<p>Problem accessing /webhdfs/v1/user/clay/foo. Reason:
<pre> WebHDFS is configured write-only for clay</pre></p>
<hr/><i><small>Powered by Jetty://</small></i><br/>
Download from:
https://issues.apache.org/jira/browse/HDFS-14234
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Isolation Glovebox
Images:
(Left) Library of Congress, U.S. D.O.E.,View Of A Worker Holding A Plutonium 'Button.' 19 Sep. 1973
(Right) Office of Legacy Management, U.S. D.O.E.,CO-83-M-8 - View of foundry induction furnaces. N.D.
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Glovebox Production Line
Images:
(Left) Library of Congress, U.S. D.O.E., View Of A Glovebox Line Used In Plutonium Operations. 5 May. 1970
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-3 - View of Chainveyor. 25 Jan. 1993
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Leaded Pane of Glass)
Avoid Overexposure to Raw Data
• Remote Desktop:
— Limited RDP
• Key Attributes:
— No copy out
— No file shares
— Isolation per user
• Useful to Have Tools:
— Web browser for Jupyter/Zeppelin
— SSH client for command-line access
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Glove Ports & Robotics)
Manipulate Your Data - With Code
• Run on a compute cloud using Apache YARN; submit:
— SQL to Apache Hive
— Python or Scala to Apache Spark
— An arbitrary application
• Automation to ensure consistency (e.g. Apache Oozie)
— A workflow manager for Hive and Spark jobs
— Data transformations for expected reports -- known
processes generating “decontaminated” results
— Can run as a non-human service accounts to drop data in
directory for data exfiltration
— Can provide repeatable deployment of code using Git
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox (Pass-Through)
Negative pressure (one-way) network; exfiltrate only “decontaminated” data
• Provide a process for data hand-off through an environment
• Firewalls:
— Mostly a transport OSI Layer 4 device (TCP/IP)
— Can do “deep packet inspection” - but need to MITM traffic
— Policy rules for which users can manipulate which data become extensive
— Prohibitively expensive
• Technology Specific:
— DropboxFilter for WebHDFS
— Database RPCs are complex but:
— GRANT INSERT ON DATABASE.* TO write_only@'%';
— GRANT SELECT ON DATABASE.* TO read_only@'%';
— HBase today has no built in client location filtering
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Firewalls (Workload and User Isolation)
Don’t let your data spontaneously combust; clean up “chips”
File Systems Leak
• Permission on data sets
• User collaboration locations
• Temporary/failed job data
• Temporary data locations
— Distributed file systems
— Hive Warehouse
— /tmp
— Local file systems
— /tmp, /var/tmp, /dev/shm
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Take out the Trash
Image: State of Idaho Oversight Monitor. Nov. 2006. Pg. 10
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Private Temporary Directories
• To provide isolation, one can use pam_namespaces
• To setup directories and clean-up, one can use pam_exec
See also: Our integration of the work in https://github.com/bloomberg/chef-bach/pull/1278
Initial Mount Namespace
tmp (inode 100)
polyinst (inode 101)
tmp_clay (inode 201)
tmp_foo (inode 201)
home (inode 300)
User clay’s Mount Namespace
tmp (inode 201)
polyinst (inode 101)
tmp_clay (inode 201)
tmp_foo (inode 201)
home (inode 300)
/ (inode 2)/ (inode 2)
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Keeping the Pipes Flowing
Image: Office of Legacy Management, U.S. D.O.E., CO-83-AF-1 - View of Building 215A. N.D.
YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables
YARN Node
HDFS
Data Node
YARN Node
HDFS
Data Node
YARN Network Isolation (Example)
YARN-7468 - Provide means for container network policy control
Database A WebService ADatabase B
YARN Nodes
HDFS
Data Node
Network Class 1
User A:
Novel
Application
YARN
Nodemanager
Network Class 2
User B:
Sparkiptables
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Firewalls Are Important
Images:
(Left) Office of Legacy Management, U.S. D.O.E., CO-83-N-3 - Damaged Filter Plenums. 16 Sept. 1957
(Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-5 - View of a glove box firewall detail. 8 May. 1970
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Data Glovebox
• Leaded Pane of Glass: Remote Desktop Without Copy
• Glove Ports: Manipulate your Data at An Arm’s Length
• Robotics: Workflow Management
• Pass-throughs: Negative Pressure to Keep the Bits Flowing
• Firewalls: Ensure User and Workload Isolation
© 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved.
Cleanup Is Messy
Image: CO Dept. of Pub. Health, “Citizen Summary Rocky Flats Historical Public Exposures
Studies 1969 Fire”,
© 2018 Bloomberg Finance L.P. All rights reserved.
Thank You
Connect with Hadoop Team: hadoop@bloomberg.net

More Related Content

What's hot (20)

PPTX
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PPTX
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
PDF
Hadoop: The Unintended Benefits
DataWorks Summit
 
PDF
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
PPTX
Presto query optimizer: pursuit of performance
DataWorks Summit
 
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
PPTX
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
PPTX
Shaping a Digital Vision
DataWorks Summit/Hadoop Summit
 
PPTX
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
PPTX
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 
PPTX
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
DataWorks Summit
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PPTX
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 
PDF
Fast SQL on Hadoop, Really?
DataWorks Summit
 
PPTX
Log I am your father
DataWorks Summit/Hadoop Summit
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
Hadoop: The Unintended Benefits
DataWorks Summit
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
Presto query optimizer: pursuit of performance
DataWorks Summit
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Shaping a Digital Vision
DataWorks Summit/Hadoop Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
DataWorks Summit/Hadoop Summit
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
DataWorks Summit
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Log I am your father
DataWorks Summit/Hadoop Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 

Similar to Data Gloveboxes: A Philosophy of Data Science Data Security (20)

PDF
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
PPTX
Nyc web perf-final-july-23
Dan Boutin
 
PDF
Filtering From the Firehose: Real Time Social Media Streaming
Cloud Elements
 
PDF
Big Data and OSS at IBM
Boulder Java User's Group
 
PDF
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
PPTX
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
PPTX
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
PDF
Mesoscon 2015
Skand Gupta
 
PDF
Instrumenting and Scaling Databases with Envoy
Daniel Hochman
 
PPTX
巨量資料入門 The evolution of data architecture
Wei-Chiu Chuang
 
PDF
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
PPTX
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
PPTX
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 
PDF
Big Data to SMART Data : Process Scenario
CHAKER ALLAOUI
 
PPTX
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Software
 
PPTX
Stream processing for the practitioner: Blueprints for common stream processi...
Aljoscha Krettek
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PDF
Container and Kubernetes without limits
Antje Barth
 
PPTX
Automating Big Data with the Automic Hadoop Agent
CA | Automic Software
 
PPTX
Building a system for machine and event-oriented data with Rocana
Treasure Data, Inc.
 
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
Nyc web perf-final-july-23
Dan Boutin
 
Filtering From the Firehose: Real Time Social Media Streaming
Cloud Elements
 
Big Data and OSS at IBM
Boulder Java User's Group
 
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Mesoscon 2015
Skand Gupta
 
Instrumenting and Scaling Databases with Envoy
Daniel Hochman
 
巨量資料入門 The evolution of data architecture
Wei-Chiu Chuang
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 
Big Data to SMART Data : Process Scenario
CHAKER ALLAOUI
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Software
 
Stream processing for the practitioner: Blueprints for common stream processi...
Aljoscha Krettek
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Container and Kubernetes without limits
Antje Barth
 
Automating Big Data with the Automic Hadoop Agent
CA | Automic Software
 
Building a system for machine and event-oriented data with Rocana
Treasure Data, Inc.
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 

Data Gloveboxes: A Philosophy of Data Science Data Security

  • 1. © 2018 Bloomberg Finance L.P. All rights reserved. Data Gloveboxes: A Philosophy of Data Science Data Security DataWorks Summit - Barcelona March 21, 2019 Clay Baenziger Hadoop Infrastructure
  • 2. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg by the Numbers
  • 3. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Bloomberg By the Numbers • Founded in 1981 • 325,000 subscribers in 170 countries • Over 19,000 employees in 192 locations — Over 5,000 software engineers — 100+ machine learning data scientists and engineers • More News reporters than The New York Times + Washington Post + Chicago Tribune — News content from 125K+ sources — >1.5M news stories ingested / published each day (that's 500 news stories ingested/second) • One of the largest private networks in the world • 100B+ tick messages per day, with a peak of more than 10 million messages/second • More than a billion messages (E-Mails and IB chats) processed each day
  • 4. Nuclear Materials Manufacturing Image: Office of Legacy Management, U.S. D.O.E., Rocky Flats Plant History & Information Used to Process EEOICPA Claim Requests. 16 April, 2014 Former U.S. Department of Energy Rocky Flats Plant - South of Boulder, CO
  • 5. Plutonium Dropbox? Isolation Glovebox? Dropbox: [n] a container where one can deposit something to be retrieved later Glovebox: [n] a sealed protective container in which one may safely manipulate a dangerous substance using gloves attached to holes Images: (Top) Office of Legacy Management, U.S. D.O.E., CO-83-M-2 - Interior view of X-Y retriever. 29 Nov, 1988 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-K-15 - View of safe geometry station from the inside of an input-output station. 3 Dec, 1988
  • 6. Data Dropbox? Data Glovebox? Dropbox: [n] a data-system where one can deposit a file for later reading and processing dependent on client (network) location; ideally providing a positive verification of file contents Glovebox: [n] a sealed compute environment in which one may safely manipulate data using restricted access - with strong exfiltration controls MySQL
  • 7. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Plutonium Enclave Image: Office of Legacy Management, U.S. D.O.E.,CO-83-M-14 - Downdraft Table, 20 Aug. 2014
  • 8. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Enclave Centralized Models: • Curator Model: Restrict access, operations and results — “The curator must remain present throughout the lifetime of the database” (Dwork, Cynthia. “Differential Privacy: A Survey of Results”, 1, Apr. 2008) — Statistical Disclosure Control • Data Enclave: (Lane and Shipp. “Using a Remote Access Data Enclave for Data Dissemination”. Intl. Journal of Digital Curation. 1.2 (2007)) — Allow for Direct and Exact Access — Allow Arbitrary Computation — Prevent Business Automation
  • 9. Image: Denver Public Library, Rocky Mountain News Photographic Archives, “Rocky Flats employee handles a robotic arm assembly.” 11 Nov. 1987 Plutonium Glovebox
  • 10. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox • Leaded Pane of Glass: Remote Desktop Without Download • Glove Ports: Arbitrary Code Execution (Run code to manipulate) • Robotics: Workflow Management — Deployment — Routine operations • Pass-throughs: — Firewalls are insufficient — Protocol aware deep packet inspection — Databases • Firewalls: Ensure user and workload isolation — Distributed file-systems — Local file-systems — Processing
  • 11. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Glovebox Architecture Copyrights: Git Logo, Jason Long; Python “Two Snakes” Logo, Python Software Foundation
  • 12. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Client Nodes Client Nodes Master Node HBase Master YARN Resource Manager HDFS Namenode Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Region Server HBase HDFS Map/ReduceNovel Application YARN Spark Hadoop Architecture Client Nodes Cluster Nodes HBase Region Server HDFS Datanode Map/ReduceNovel Application YARN Nodemanager Spark Master Node HBase Master YARN Resource Manager HDFS Namenode <HTTP REST API> YARN Job: • Submission • Status • Logs • App. WebUIs <HTTP/2 REST API> WebHDFS: • GET Methods: — Open (read) — GetFileChecksum • PUT Methods: — Create • POST Methods: — Append <HBaseProtobuf RPC> • Get • Put • Scan
  • 13. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Handling Material Image: Office of Legacy Management, U.S. D.O.E., “Rocky Flats Overview”, 20 Aug. 2014. Pg. 13
  • 14. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Properties of Radioactive Materials Radioactive Materials: • Can be harmful to people in small quantities • Can have a very long hazard life if released • Should be isolated to prevent their spread • Should be cataloged and characterized to assess harm • Can still be machined and worked with proper technique — Robotics — Personal Protective Equipment
  • 15. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Properties of Material Data Material Data: • Can be harmful to people in small quantities • Can have a very long hazard life if released • Should be isolated to prevent their spread • Should be cataloged and characterized to assess harm • Can still be used and analyzed with proper technique — Continuous/Automated Deployment — Workflow Automation and Gloveboxes
  • 16. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Nuclear Material (non)Proliferation Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 59
  • 17. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Proliferation • Terrorists? (Well, certainly hackers...) • Accidental loss (USB sticks, laptops, etc.) • No Price-Anderson Act for Data Incidents — Quite the opposite with GDPR! — GDPR limits untraceable mixing of data • Data Sovereignty — Requires data to remain geographically stationary — Must move computation to the data • Data: — Swamps — Lineage (Visibility e.g. via Apache Atlas) — Masking (Curator Model e.g. via Apache Ranger)
  • 18. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Lock Everything Down Image: Office of Legacy Management, U.S. D.O.E.,”Rocky Flats Overview”, 20 Aug. 2014. Pg. 21
  • 19. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Lock Data Down Quality Attributes of a Dropbox • Perimeter Controls (Network Firewall) • Encryption: — At rest — On the wire • Authentication (Kerberos) • Client Location Controls Usage (Directionality) — Data goes in from insecure networks — Data cannot come back out to an insecure network — Allow validation of transmission from anywhere — “Normal” usage from trusted networks
  • 20. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. DropboxFilter for HDFS (WebHDFS API) • Upload: $curl -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE HTTP/1.1 307 TEMPORARY_REDIRECT $curl -X PUT -T <File> "http://<DN>:<PORT>/webhdfs/v1/<PATH>?op=CREATE..." HTTP/1.1 201 Created • Download: $curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN • Checksum: $curl -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM" {"FileChecksum": { "algorithm": "MD5-of-0MD5-of-512CRC32C", "bytes": "[...]00eb745ad2f5bd1dccab359b12f7f9411b00000000", "length": 28 }}
  • 21. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. DropboxFilter for HDFS (Architecture) • HDFS Protocols — Protobuf RPC — RESTful API over HTTPS/2 • Servlet Based Web Server Design — Filter — Request Handler HTTP Server (Netty) Client (curl) Servlet Container (Jetty) WebHDFS Handler AuthenticationFilter(request…) DropboxFilter(request…) Provides User Info Provides User Info
  • 22. DropboxFilter for HDFS (Examples) <head><title>Error 401 Authentication required</title></head> <body><h2>HTTP ERROR 401</h2> <p>Problem accessing /webhdfs/v1/user/ubuntu/foo. Reason: <pre>Authentication required</pre></p> <hr /><i><small>Powered by Jetty://</small></i><br/> HTTP Server (Netty) Client (curl) Servlet Container (Jetty) WebHDFS Handler AuthenticationFilter(request…) DropboxFilter(request…) Provides User Info Provides User Info <head><title>Error 403 WebHDFS is configured write-only for clay</title></head> <body><h2>HTTP ERROR 403</h2> <p>Problem accessing /webhdfs/v1/user/clay/foo. Reason: <pre> WebHDFS is configured write-only for clay</pre></p> <hr/><i><small>Powered by Jetty://</small></i><br/> Download from: https://issues.apache.org/jira/browse/HDFS-14234
  • 23. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Isolation Glovebox Images: (Left) Library of Congress, U.S. D.O.E.,View Of A Worker Holding A Plutonium 'Button.' 19 Sep. 1973 (Right) Office of Legacy Management, U.S. D.O.E.,CO-83-M-8 - View of foundry induction furnaces. N.D.
  • 24. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Glovebox Production Line Images: (Left) Library of Congress, U.S. D.O.E., View Of A Glovebox Line Used In Plutonium Operations. 5 May. 1970 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-3 - View of Chainveyor. 25 Jan. 1993
  • 25. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Leaded Pane of Glass) Avoid Overexposure to Raw Data • Remote Desktop: — Limited RDP • Key Attributes: — No copy out — No file shares — Isolation per user • Useful to Have Tools: — Web browser for Jupyter/Zeppelin — SSH client for command-line access
  • 26. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Glove Ports & Robotics) Manipulate Your Data - With Code • Run on a compute cloud using Apache YARN; submit: — SQL to Apache Hive — Python or Scala to Apache Spark — An arbitrary application • Automation to ensure consistency (e.g. Apache Oozie) — A workflow manager for Hive and Spark jobs — Data transformations for expected reports -- known processes generating “decontaminated” results — Can run as a non-human service accounts to drop data in directory for data exfiltration — Can provide repeatable deployment of code using Git
  • 27. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox (Pass-Through) Negative pressure (one-way) network; exfiltrate only “decontaminated” data • Provide a process for data hand-off through an environment • Firewalls: — Mostly a transport OSI Layer 4 device (TCP/IP) — Can do “deep packet inspection” - but need to MITM traffic — Policy rules for which users can manipulate which data become extensive — Prohibitively expensive • Technology Specific: — DropboxFilter for WebHDFS — Database RPCs are complex but: — GRANT INSERT ON DATABASE.* TO write_only@'%'; — GRANT SELECT ON DATABASE.* TO read_only@'%'; — HBase today has no built in client location filtering
  • 28. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Firewalls (Workload and User Isolation) Don’t let your data spontaneously combust; clean up “chips” File Systems Leak • Permission on data sets • User collaboration locations • Temporary/failed job data • Temporary data locations — Distributed file systems — Hive Warehouse — /tmp — Local file systems — /tmp, /var/tmp, /dev/shm
  • 29. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Take out the Trash Image: State of Idaho Oversight Monitor. Nov. 2006. Pg. 10
  • 30. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Private Temporary Directories • To provide isolation, one can use pam_namespaces • To setup directories and clean-up, one can use pam_exec See also: Our integration of the work in https://github.com/bloomberg/chef-bach/pull/1278 Initial Mount Namespace tmp (inode 100) polyinst (inode 101) tmp_clay (inode 201) tmp_foo (inode 201) home (inode 300) User clay’s Mount Namespace tmp (inode 201) polyinst (inode 101) tmp_clay (inode 201) tmp_foo (inode 201) home (inode 300) / (inode 2)/ (inode 2)
  • 31. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Keeping the Pipes Flowing Image: Office of Legacy Management, U.S. D.O.E., CO-83-AF-1 - View of Building 215A. N.D.
  • 32. YARN Node HDFS Data Node YARN Node HDFS Data Node YARN Network Isolation (Example) YARN-7468 - Provide means for container network policy control Database A WebService ADatabase B YARN Nodes HDFS Data Node Network Class 1 User A: Novel Application YARN Nodemanager Network Class 2 User B: Sparkiptables
  • 33. YARN Node HDFS Data Node YARN Node HDFS Data Node YARN Network Isolation (Example) YARN-7468 - Provide means for container network policy control Database A WebService ADatabase B YARN Nodes HDFS Data Node Network Class 1 User A: Novel Application YARN Nodemanager Network Class 2 User B: Sparkiptables
  • 34. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Firewalls Are Important Images: (Left) Office of Legacy Management, U.S. D.O.E., CO-83-N-3 - Damaged Filter Plenums. 16 Sept. 1957 (Right) Office of Legacy Management, U.S. D.O.E., CO-83-M-5 - View of a glove box firewall detail. 8 May. 1970
  • 35. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Data Glovebox • Leaded Pane of Glass: Remote Desktop Without Copy • Glove Ports: Manipulate your Data at An Arm’s Length • Robotics: Workflow Management • Pass-throughs: Negative Pressure to Keep the Bits Flowing • Firewalls: Ensure User and Workload Isolation
  • 36. © 2018 Bloomberg Finance L.P. All rights reserved.© 2018 Bloomberg Finance L.P. All rights reserved. Cleanup Is Messy Image: CO Dept. of Pub. Health, “Citizen Summary Rocky Flats Historical Public Exposures Studies 1969 Fire”,
  • 37. © 2018 Bloomberg Finance L.P. All rights reserved. Thank You Connect with Hadoop Team: [email protected]