Data Privacy with Apache Spark: Defensive and Offensive Approaches

Data Privacy with Apache Spark:
Defensive and Offensive
Approaches
Serge Smertin
Resident Solutions Architect
Databricks

About me
▪ Worked in all stages of data
lifecycle for the past 13 years
▪ Built data science platforms from
scratch
▪ Tracked cyber criminals through
massively scaled data forensics
▪ Built anti-PII analysis measures
for payments industry
▪ Bringing Databricks strategic
customers to next level as
full-time job now

About you
▪ Most likely very hands-on with
Apache Spark™
▪ Background in data engineering,
information security and a bit of
cloud infrastructure
▪ Want (or asked)
to limit the data to
maintain privacy and comply with
regulations
▪ About to be familiar with GDPR or
CCPA
▪ Genuine curiosity how to do that with
the least number of different tools

ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box
in the middle. The required surrounding infrastructure is vast and complex. - “Hidden Technical Debt in
Machine Learning Systems,” Google NIPS 2015
Monitoring
Data Privacy
(this talk)
not based on
Google paper

Open-source intelligence (OSINT)
... is a multi-methods methodology for collecting, analyzing and making
decisions about data accessible in publicly available sources to be used in an
intelligence context. In the intelligence community, the term "open" refers to
overt, publicly available sources.
https://inteltechniques.com/JE/OSINT_Packet_2019.pdf

Offensive techniques
▪ Linkage attacks
▪ Sequence attacks
▪ Homogeneity
attacks
A.k.a. day-to-day data science

Data Privacy with Apache Spark: Defensive and Offensive Approaches

Technique comparison dimensions
▪ Usefulness
▪ how useful data would still be for applied data science?
▪ Difficulty to implement
▪ Effort it might take to implement and support solution in the
short and long term?
▪ Schema preservation
▪ Do we need to make special schema considerations?
▪ Format preservation
▪ Would original and anonymized data points look the same
to people analysing it?
We’ll compare each of mentioned techniques across few common dimensions to help picking up the
best for your use-case
▪ Performance impact
▪ How would it affect performance of entire data pipeline?
▪ Will it affect reads or writes?
▪ Is it going to involve shuffle?
▪ Re-identiﬁcation
▪ What kinds of data forensic attacks could be performed
to de-anonymise individuals?

Pseudonymization
Protects datasets on record level
for Machine Learning.
Switches original data point with
pseudonym for later
re-identiﬁcation, inaccessible to
unauthorized users.
A pseudonym is still considered to be
personal data according to the GDPR.
Anonymization
Protects entire tables, databases or
entire data catalogues mostly for
Business Intelligence.
Personal data is irreversibly altered in
such a way that a data subject can no
longer be identiﬁed directly or
indirectly.
Usually a combination of more than one
technique used in real-world scenarios.
kpkrdiTAnqvfxuyE *********************
1AYrGFCTTRYOwO *********************
72AjraZ8sU9EsNw *********************

A.k.a. Machine Learning engineers asking you when that concrete
dataset is ﬁnally going to be pseudonymized, so that they can
.ﬁt_and_predict() their models
Pseudonymization

Encryption
▪ Usefulness: high
▪ Difficulty: medium
▪ Schema: same
▪ Format: different
▪ Performance: more data, slow to
encrypt & decrypt
▪ Re-identiﬁcation: encryption key
leak will allow re-identifying
When someone thinks that AWS S3 or Azure ADLS
encryption is not enough.

Hashing
▪ Usefulness: high
▪ Difficulty: easy
▪ Schema: same
▪ Performance: group by is slower
▪ Re-identiﬁcation: hashcat,
dictionary and/or combinator
attacks
Just SHA512() the sensitive data

Making hash cracking a bit more difficult
resource "random_password" "salt" {
special = true
upper = true
length = 32
}
resource "databricks_secret_scope" "data_privacy" {
name = "dais2020-data-privacy-talk"
}
resource "databricks_secret" "salt" {
key = "salt"
string_value = random_password.salt.result
scope = databricks_secret_scope.data_privacy.name
}
val secretSalt = dbutils.secrets.get(
"dais2020-data-privacy-talk", "salt")
def obscureHash(x: Column) = translate(
base64( // perform base64 encoding
unhex( // instead of standard HEX one
substring( // to confuse bad guys
sha2(x, 512) // that it is just SHA-512
, 0, 32) // but truncated to first 16 bytes
)), "=", "") // and some base64 characters removed
def saltedObscureHash(x: Column) = obscureHash(
concat(lit(secretSalt), x))
val df = spark.table("resellers")
.select('email)
.withColumn("hash", obscureHash('email))
.withColumn("salt", lit(secretSalt))
.withColumn("salted_hash", saltedObscureHash('email))
Hashing datasetTerraform conﬁguration

If you think you can invent
new salting technique,
probably it’s already cracked
by GPU’s with hashcat

Free-form data
❏ Fully depends on your specific
dataset, generic solution is
hardly possible
❏ Use ensemble of different
techniques to remove sensitive
pieces out of free-form text
❏ %pip install names-dataset can
get you 160k+ different names,
that you can use as a filter.
Enhance it with more region and
business-specific data.
secretSalt = dbutils.secrets.get("dais2020-data-privacy-talk", "salt")
def purePythonSaltedObscureHash(x):
import hashlib, base64
sha512 = hashlib.sha512()
sha512.update((secretSalt + x).encode('utf-8'))
first_bytes = bytes.fromhex(sha512.hexdigest()[0:32])
return base64.b64encode(first_bytes).decode('utf-8').replace('=', '')
@pandas_udf('string', PandasUDFType.SCALAR)
def free_form_cleanup(series):
def inner(text):
from pkg_resources import resource_filename
all_names = {'first': set(), 'last': set()}
for t in all_names.keys():
with open(resource_filename('names_dataset', f'{t}_names.all.txt'), 'r') as x:
all_names[t] = set(x.read().strip().split('n'))
new_text = []
for word in text.split(" "):
# Only simplest techniques are shown. Recommended ensemble of:
# 1) Regex rules for IP/Emails/ZIP codes
# 2) Named Entity recognition
# 3) Everything is very specific to data you have
if word.lower() in all_names['first']:
word = purePythonSaltedObscureHash(word)
if word.lower() in all_names['last']:
word = purePythonSaltedObscureHash(word)
new_text.append(word)
return " ".join(new_text)
return series.apply(inner)
(spark.table('free_form')
.select('id', 'subject')
.withColumn('safer_subject', free_form_cleanup(col('subject'))))

Combinator attacks
❏ Adapted from combinator
attack in Hashcat manual.
❏ Can be combined with
permutations of name
N-grams to cover for typos.
E.g. Trigrams of “Serge”: ser,
erg, rge.
❏ Simple, but still elegant,
addition might involve ﬁtting
Markov chain to generate
random names per region.

Credit card numbers
● 2.8 billion credit cards in use
worldwide.
● Around 6k Bank Identiﬁcation
Number (BIN) ranges.
● Requires PCI DSS compliant
storage infrastructure.
● Rules are very convoluted and
sometimes contradictory.
● It’s best to use Tokenization
instead of hashing.

Tokenization
▪ Usefulness: very high
▪ Difficulty: high
▪ Schema: almost
▪ strings become longs, making performance higher
▪ Performance: slower to write,
faster to read
▪ Re-identiﬁcation: depends
Non-inferrable source-to-destination mapping
def replaceTokensFromVault(columns: String*)(df:
DataFrame) = {
val vault = df.sparkSession.table("token_vault")
columns.foldLeft(df)((df, c) =>
df.withColumnRenamed(c, s"${c}_token")
.join(vault.where('k === c)
.withColumnRenamed("v", c)
.withColumnRenamed("token",
s"${c}_token"),
Seq(s"${c}_token"), "left"))
.select(df.columns.map(col(_)): _*)
}
import spark._
val campaign = Seq(
(399, 103),
(327, 290),
(353, 217))
.toDF
.withColumnRenamed("_1", "email")
.withColumnRenamed("_2", "name")
.transform(replaceTokensFromVault("email",
"name"))
display(campaign)

val normalize = (x: Column) => lower(x)
def fanoutOriginals(columns: Seq[String], df: DataFrame) =
df.withColumn("wrapper", array(
columns.map(c => struct(lit(c) as "k", col(c) as "v")): _*))
.select(explode('wrapper))
.selectExpr("col.*")
.withColumn("v", normalize('v))
.dropDuplicates("k", "v")
def replaceOriginalsWithTokens(columns: Seq[String],
df:DataFrame, tokens: DataFrame) =
columns.foldLeft(df)((df, c) =>
df.withColumn(c, normalize(col(c)))
.withColumnRenamed(c, s"${c}_normalized")
.join(tokens.where('k === c)
.withColumnRenamed("v", s"${c}_normalized")
.withColumnRenamed("token", c),
Seq(s"${c}_normalized"), "left"))
.select(df.columns.map(col(_)): _*)
def tokenizeSnapshot(columns: String*)(df: DataFrame) = {
val newToken = row_number() over Window.orderBy(rand())
val tokens = fanoutOriginals(columns, df)
.withColumn("token", newToken)
replaceOriginalsWithTokens(columns, df, tokens)
}
val generic = spark.table("resellers").transform(
tokenizeSnapshot("email", "name", "joindate", "city", "industry"))
display(generic)
● protects from sequence attacks by
randomizing token allocation within append
microbatch
● good for a quick demonstration of concepts
● doesn't do the most important thing - persist
token <-> value relationships in a vault
● generically applies tokenization to speciﬁed
columns

import java.lang.Thread
import java.util.ConcurrentModificationException
import org.apache.spark.sql.expressions.Window
def allocateNewTokens(columns: Seq[String],
df: DataFrame): DataFrame = {
val vault = df.sparkSession.table("token_vault")
for (i <- 1 to 10) try {
val startingToken = vault.agg(
coalesce(max('token), lit(0))).first()(0)
val newToken = lit(startingToken) + (
row_number() over Window.orderBy(rand()))
fanoutOriginals(columns, df)
.join(vault, Seq("k", "v"), "left")
.where('token.isNull)
.withColumn("token", newToken)
.write.mode("append").format("delta")
.saveAsTable("token_vault")
return vault
} catch { case e: ConcurrentModificationException if i < 10 =>
print("Retrying token allocation")
}
return vault
}
def tokenizeWithVault(columns: String*)(df: DataFrame) =
replaceOriginalsWithTokens(columns, df,
allocateNewTokens(columns, df))
● The busiest place of data ingestion, so
consider running OPTIMIZE token_vault
ZORDER BY (v) to help with writes.
● Randomization is done on microbatch level, so
we get shuffle only for new tokens, keeping
implementation simpler
● token is assigned as long, populated with
row_number() over random window.
monotonically_increasing_id() is not used
because it gives inaccurate results, as it's
computed within Spark partition. UUID is not
used because of consistency-related
performance reasons
Token Vault with
Databricks Delta

GDPR with
Vault
GDPR without
any vault

A.k.a. Removing outliers from data
Anonymization

Synthetic data
▪ Usefulness: lower
▪ Difficulty: look at Synthetic Minority
Oversampling Technique
▪ Schema: same
▪ Performance: not very usable for
streaming. Requires ML ﬁtting
▪ Re-identiﬁcation: harder on the
snapshot, though may drift for
appendable data
a.k.a. Adding random records, that do not look
random

Column suppression
▪ Accuracy: medium value
▪ Difficulty: easy
▪ Schema: fewer columns
▪ Performance: less data
▪ Re-identiﬁcation: be aware of
outliers
Is not always enough for both data scientists and
those, who care about protecting the data, because
few outliers may lead to identiﬁcation, when joined
with external sources
CREATE OR REPLACE VIEW resellers_suppressed AS SELECT
-- NOT INCLUDING COLUMNS: id, email, name,
-- joindate, commission
city, industry, leads_eur_90days, sales_eur_90days
FROM resellers

Row suppression
Recommended with
either low frequency
data
Can still be
attacked by joining
with other datasets
available

Generalisation
▪ Accuracy: medium to high
▪ Difficulty: trivial with rules. Fun
with Federated Learning.
▪ Schema: same
▪ Format: same
▪ Performance: minimal impact
▪ Re-identiﬁcation: difficult, but
possible
Remove precision from data

Binning
APPROX_PERCENTILE is super
performant and generic way to add
contextual binning to any existing Spark
DataFrame

Truncating: IP addresses
➔ Rounding IP address to /24 CIDR
is considered anonymous
enough, if other properties of a
dataset allow.
➔ IP-geolocation (MaxMind,
IP2Location) databases would
generally represent it as city or
neighbourhood level, where the
real IP would represent a street.

Rounding
Example rounding rules:
⟳ All numbers are rounded to the
nearest multiple of 15
⟳ Any number lower than 7.5 is
rounded suppressed
⟳ Halves are always rounded
upwards (e.g. 2.5 is rounded to 5)

Column, Row and Table level security on Databricks

Watermarking
▪ Accuracy: high on snapshot,
makes no sense globally
▪ Difficulty: extreme
▪ Schema: almost same
▪ Performance: extreme overhead
▪ Re-identiﬁcation: easy to identify
who leaked the data
Same as tokenization, but with token vaults on
every select. Makes life of external Data Scientist
hard. Data Lake
Per-snapshot
token vault Snapshot
Physical Plan Rewrite
Leak prevention
data
Data Theft
Monitoring
Data Science
Live scoring

A.k.a. make life of the data scientist harder
External controls

Auditing
▪ Difficulty: signiﬁcant
infrastructure investment
▪ Performance: data scientist
required to track data scientists
▪ Track ﬁlters & usual access
patterns
track every action of data scientist everywhere

Remote desktop
▪ Accuracy: bulletproof
▪ Difficulty: make sure that data is
accessible only through RDP
▪ Schema: n/a
▪ Performance: data scientists can
see all the data they need
▪ Re-identiﬁcation: depends
And prevent copy-paste

Screenshot prevention
▪ Not COVID-19-friendly
▪ Physical desktop in the office
connecting to remote desktop
▪ Motion sensors to detect phones
lifting up to take a photo of the
screen. It’s actually a real thing.
▪ Or simply prevent data scientists
to bring phone, pen & paper to
their workstation
A.k.a. remote desktop, but even next level

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Data Privacy with Apache Spark: Defensive and Offensive Approaches

More Related Content

What's hot (20)

Similar to Data Privacy with Apache Spark: Defensive and Offensive Approaches (20)

More from Databricks (20)

Recently uploaded (20)

Data Privacy with Apache Spark: Defensive and Offensive Approaches