Knowledge Discovery
& Data Mining
22nd ACM SIGKDD 2016
André Karpištšenko
~80 sessions for 2,700 participants
• Business Applications and Frameworks at Scale
• Data Streams Mining
• DashOpt features
• Outlier Detection
• Bayesian Optimization
• Deep Learning
• Investing into AI and Data
• Bonus keywords
88 countries, 35% YoY, 15-20% acceptance
Business Application Examples
• Consumer Internet focus: Content Ranking,
Recommendation, User Intent and Context Prediction
• Industrial Internet focus: Autonomy, Predictive
Maintenance, Operational Intelligence, Production Planning
• B2B focus: Targeting, Lead Generation, Sales Development,
Opportunity Management, Account Management
• Web content analytics: Image, Video, Text Classification for
Relevance, Products Categorization, Sentiments
• Other: Cyber Security, Fraud/Spam Detection, NLP, Speech
Recognition, Image/Video Recognition
Predictive Modeling Flow
DashOpt
Feature
Engineering
Raw
Data
Raw
Features
Labels
Feature
Integration
Features
with Labels
Data
Partitioning
Training
Data
Validation
Data
Testing
Data
Model Training
Evaluate for
model selection
Compute offline
evaluation metrics
Best model
Offline scoring
and indexing
Online/offline
systems
Online A/B test
Label
preparation Log data
Scoring
features
Raw features
Feature
integration
Model
Performance
Test Results
Data Technologies
• Most common: HDFS, MapReduce, Spark, Hive
• Decision spectrum: build, assemble, buy
• Factors: network, open-source, maturity, needs
Exploratory Production
Classic ML Platform at Scale
Workflows
HDFS Feature Mart Ground Truth Models Scores
MapReduce / Yarn
Workflow Scheduler & Manager
Workflows Workflows
Intelligence Engine
Metadata
Store
Pig, Hive,
Python, Scala,
Shell script, …
Feature Engineering
Libraries
Machine Learning
Libraries
Drivers
Drivers
Application 1 Application 2 Application N
…
http://github.com/szilard/benchm-ml
Baseline Performance Benchmark
https://sites.google.com/site/iotminingtutorial/
IoT Data Streams Mining
• 10 billion devices in 2015 -> 34 billion devices by 2020
• Continuous data, dynamic models, distributed, few seconds
Streams Mining: Actors Model
Data processing pipeline Distributed processing
Kappa Architecture
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Outlier Detection
• Single point anomaly detection: likelihood over distribution
• Finding anomalous groups: divergence estimation
• Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme
Studentized Deviate) test, Seasonal Hybrid ESD, etc.
• Goal: move from detection to automated response
Outlier Detection in Practice
• Too many detections of too little value
• Use methods for thresholds
• Breakout detection and Concept Drift
• For changing distributions move baselines over time
• Risk of overfitting to known anomalies, not finding unknown anomalies
Bayesian aka Active Optimization
• Examples: Design of Experiments, hyper-parameters of
supervised learning, algorithms tested with simulations
f is an unknown expensive black-box function with the goal to
approximately optimize f with as few experiments as possible
• No free lunch theorem
• Other bio-inspired
algorithms for
optimization exploitation
and exploration: neural
networks, genetic
algorithms, swarm
intelligence, ant colony
optimisation, etc.
Bayesian Optimization in Practice
• SigOpt experience: 20 dimensions, above human capacity.
• Uber ATC experience: scaling active optimization to high
dimensions as the default works reliably for 5-7 dim.
• Variables are added during optimization.
• Choose fidelity using heuristics.
Deep Learning
Deep Learning
• Compute power, GPU, learning architectures and a lot of
labeled data are what drive DL
• Applied for Vision and Speech: matches human performance
• Not possible where experiments are costly: biotech
• Kaggle winners are not DL models: tree ensembles, SVMs
• Common technologies: TensorFlow, Caffe, Theano, Keras
• Thousands of pieces of software: modules and layers
• Explainability and interpretability are the next big things
• EU regulation. Tradeoff: accuracy vs explainability.
Deep Learning Trends
• Vision nets are deeper and structured (Larsson 2016)
• Language nets have also dynamics, memory and attention
(Rocktaschel 2016, Miller 2016)
• Probabilistic programming (Lake, Tenenbaum)
• Programs as networks (Riedel)
• The Neural Programmer and Interpreter for learning programs
(Reed et al 2016)
• Computation graphs interacting with memory
• Loop for reasoning for nested questions (Miller 2016)
• Generative adversarial networks (Reed 2016). Models capable of
imagining images, videos and text.
Investing into AI and Data
• Data acquisition, real-time detection and visualization not solved yet.
• Empower more people to do data science. Automate routines.
• Unsolved problems are learning from unlabeled data, planning,
reasoning, problem solving, concept formulation, 1/10k compute.
• Key decisions: Timing, accuracy in what is hard, find verticals and
focus, identify differentiation & size of the prize & people & partners
Business of outliers: 1% capital returns 526x, 48% returns 0
0
450
900
Q2'15 Q3'15 Q4'15 Q1'16 Q2'16
Peak Data
Peak ML
VC assessment:
Bonus Keywords
• Lifelong Machine Learning: systems approach, transfer learning,
never-ending learners. Useful for knowledge build.
• Graphons: graph convergence and limits through infinite number of
vertices. Useful for privacy preserving mining.
• Computational Social Science: how individuals interact to produce
collective behaviour. Individuals exert more effort by themselves than
groups.
• Information Security: trusted key management is most sensitive.
Secret must be changed frequently. Confidentiality easier to violate
than authenticity. Integrity. Offence more lucrative than defence.
• Enterprise Data: in reality “random data salad” prone to constant
change due to M&A, politics, dynamic schema DBs (e.g. Mongo), legacy
burden, restructuring, leadership changes, data hoarding. Machine
driven, human guided processes required.
Thinking about Value from Data Science

More Related Content

PDF
Knowledge Discovery in Production
PDF
Machine learning in action at Pipedrive
PDF
Azure Machine Learning
PDF
Automatic machine learning (AutoML) 101
PDF
Modern Machine Learning Infrastructure and Practices
PDF
The Machine Learning Workflow with Azure
PDF
alphablues - ML applied to text and image in chat bots
PDF
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Knowledge Discovery in Production
Machine learning in action at Pipedrive
Azure Machine Learning
Automatic machine learning (AutoML) 101
Modern Machine Learning Infrastructure and Practices
The Machine Learning Workflow with Azure
alphablues - ML applied to text and image in chat bots
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning

What's hot (20)

PPTX
Production machine learning_infrastructure
PPTX
Getting Started With Dato - August 2015
PPTX
Introduction to Azure Machine Learning
PDF
Architecting for Data Science
PPTX
Danny Bickson - Python based predictive analytics with GraphLab Create
PPTX
Dealing with uncertainty in fintech using AI
PPTX
Predicting Medical Test Results using Driverless AI
PPTX
Azure Machine Learning 101
PDF
The Data Science Process - Do we need it and how to apply?
PPT
Building Personalized Data Products with Dato
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PDF
H2O for Medicine and Intro to H2O in Python
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
PDF
The path to be a data scientist
PPTX
Machine Learning with GraphLab Create
PPTX
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
PPTX
Machine Learning in Production with Dato Predictive Services
PPTX
Introduction to Azure machine learning
PPTX
Teaching the cloud to think
Production machine learning_infrastructure
Getting Started With Dato - August 2015
Introduction to Azure Machine Learning
Architecting for Data Science
Danny Bickson - Python based predictive analytics with GraphLab Create
Dealing with uncertainty in fintech using AI
Predicting Medical Test Results using Driverless AI
Azure Machine Learning 101
The Data Science Process - Do we need it and how to apply?
Building Personalized Data Products with Dato
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
H2O for Medicine and Intro to H2O in Python
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
The path to be a data scientist
Machine Learning with GraphLab Create
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
A Beginner's Guide to Machine Learning with Scikit-Learn
Machine Learning in Production with Dato Predictive Services
Introduction to Azure machine learning
Teaching the cloud to think
Ad

Similar to Knowledge Discovery (20)

PDF
Think Big | Enterprise Artificial Intelligence
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
predictive analysis and usage in procurement ppt 2017
PPTX
AI in the Enterprise at Scale
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
PDF
Gse uk-cedrinemadera-2018-shared
PPTX
Presentation_02 classification ML AI.pptx
PPTX
Advanced Analytics and Data Science Expertise
PPTX
Deep learning
PPTX
fINAL Lesson_1_Course_Introduction_v1.pptx
PDF
Customer value analysis of big data products
PDF
Barga Galvanize Sept 2015
PDF
Intro to Data Science for Non-Data Scientists
PPTX
Data Mining - The Big Picture!
PPTX
Managing AI Products
PPTX
Bigdata analytics
PPTX
Data Science Training in Chandigarh h
RTF
Data mining
PDF
Intro to machine learning
Think Big | Enterprise Artificial Intelligence
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
predictive analysis and usage in procurement ppt 2017
AI in the Enterprise at Scale
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Gse uk-cedrinemadera-2018-shared
Presentation_02 classification ML AI.pptx
Advanced Analytics and Data Science Expertise
Deep learning
fINAL Lesson_1_Course_Introduction_v1.pptx
Customer value analysis of big data products
Barga Galvanize Sept 2015
Intro to Data Science for Non-Data Scientists
Data Mining - The Big Picture!
Managing AI Products
Bigdata analytics
Data Science Training in Chandigarh h
Data mining
Intro to machine learning
Ad

More from André Karpištšenko (7)

PDF
Lingvist - Statistical Methods in Language Learning
PDF
Starship, Building Intelligent Delivery Robots
PDF
Cognitive plausibility in learning algorithms
PDF
Practical Deep Learning
PDF
Data science for everyone
PDF
PDF
Deep learning
Lingvist - Statistical Methods in Language Learning
Starship, Building Intelligent Delivery Robots
Cognitive plausibility in learning algorithms
Practical Deep Learning
Data science for everyone
Deep learning

Recently uploaded (20)

PPTX
ifsm.pptx, institutional food service management
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPT
What is life? We never know the answer exactly
PPTX
cyber row.pptx for cyber proffesionals and hackers
PPTX
Introduction to Fundamentals of Data Security
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PDF
Mcdonald's : a half century growth . pdf
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPT
Classification methods in data analytics.ppt
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PPTX
Capstone Presentation a.pptx on data sci
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
research framework and review of related literature chapter 2
PPTX
Hushh.ai: Your Personal Data, Your Business
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
ifsm.pptx, institutional food service management
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
What is life? We never know the answer exactly
cyber row.pptx for cyber proffesionals and hackers
Introduction to Fundamentals of Data Security
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Mcdonald's : a half century growth . pdf
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
The Role of Pathology AI in Translational Cancer Research and Education
Classification methods in data analytics.ppt
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
Capstone Presentation a.pptx on data sci
AI AND ML PROPOSAL PRESENTATION MUST.pptx
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
research framework and review of related literature chapter 2
Hushh.ai: Your Personal Data, Your Business
Grey Minimalist Professional Project Presentation (1).pdf
inbound6529290805104538764.pptxmmmmmmmmm

Knowledge Discovery

  • 1. Knowledge Discovery & Data Mining 22nd ACM SIGKDD 2016 André Karpištšenko
  • 2. ~80 sessions for 2,700 participants • Business Applications and Frameworks at Scale • Data Streams Mining • DashOpt features • Outlier Detection • Bayesian Optimization • Deep Learning • Investing into AI and Data • Bonus keywords 88 countries, 35% YoY, 15-20% acceptance
  • 3. Business Application Examples • Consumer Internet focus: Content Ranking, Recommendation, User Intent and Context Prediction • Industrial Internet focus: Autonomy, Predictive Maintenance, Operational Intelligence, Production Planning • B2B focus: Targeting, Lead Generation, Sales Development, Opportunity Management, Account Management • Web content analytics: Image, Video, Text Classification for Relevance, Products Categorization, Sentiments • Other: Cyber Security, Fraud/Spam Detection, NLP, Speech Recognition, Image/Video Recognition
  • 4. Predictive Modeling Flow DashOpt Feature Engineering Raw Data Raw Features Labels Feature Integration Features with Labels Data Partitioning Training Data Validation Data Testing Data Model Training Evaluate for model selection Compute offline evaluation metrics Best model Offline scoring and indexing Online/offline systems Online A/B test Label preparation Log data Scoring features Raw features Feature integration Model Performance Test Results
  • 5. Data Technologies • Most common: HDFS, MapReduce, Spark, Hive • Decision spectrum: build, assemble, buy • Factors: network, open-source, maturity, needs Exploratory Production
  • 6. Classic ML Platform at Scale Workflows HDFS Feature Mart Ground Truth Models Scores MapReduce / Yarn Workflow Scheduler & Manager Workflows Workflows Intelligence Engine Metadata Store Pig, Hive, Python, Scala, Shell script, … Feature Engineering Libraries Machine Learning Libraries Drivers Drivers Application 1 Application 2 Application N …
  • 8. https://sites.google.com/site/iotminingtutorial/ IoT Data Streams Mining • 10 billion devices in 2015 -> 34 billion devices by 2020 • Continuous data, dynamic models, distributed, few seconds
  • 9. Streams Mining: Actors Model Data processing pipeline Distributed processing Kappa Architecture https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  • 10. Outlier Detection • Single point anomaly detection: likelihood over distribution • Finding anomalous groups: divergence estimation • Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme Studentized Deviate) test, Seasonal Hybrid ESD, etc. • Goal: move from detection to automated response
  • 11. Outlier Detection in Practice • Too many detections of too little value • Use methods for thresholds • Breakout detection and Concept Drift • For changing distributions move baselines over time • Risk of overfitting to known anomalies, not finding unknown anomalies
  • 12. Bayesian aka Active Optimization • Examples: Design of Experiments, hyper-parameters of supervised learning, algorithms tested with simulations f is an unknown expensive black-box function with the goal to approximately optimize f with as few experiments as possible • No free lunch theorem • Other bio-inspired algorithms for optimization exploitation and exploration: neural networks, genetic algorithms, swarm intelligence, ant colony optimisation, etc.
  • 13. Bayesian Optimization in Practice • SigOpt experience: 20 dimensions, above human capacity. • Uber ATC experience: scaling active optimization to high dimensions as the default works reliably for 5-7 dim. • Variables are added during optimization. • Choose fidelity using heuristics.
  • 15. Deep Learning • Compute power, GPU, learning architectures and a lot of labeled data are what drive DL • Applied for Vision and Speech: matches human performance • Not possible where experiments are costly: biotech • Kaggle winners are not DL models: tree ensembles, SVMs • Common technologies: TensorFlow, Caffe, Theano, Keras • Thousands of pieces of software: modules and layers • Explainability and interpretability are the next big things • EU regulation. Tradeoff: accuracy vs explainability.
  • 16. Deep Learning Trends • Vision nets are deeper and structured (Larsson 2016) • Language nets have also dynamics, memory and attention (Rocktaschel 2016, Miller 2016) • Probabilistic programming (Lake, Tenenbaum) • Programs as networks (Riedel) • The Neural Programmer and Interpreter for learning programs (Reed et al 2016) • Computation graphs interacting with memory • Loop for reasoning for nested questions (Miller 2016) • Generative adversarial networks (Reed 2016). Models capable of imagining images, videos and text.
  • 17. Investing into AI and Data • Data acquisition, real-time detection and visualization not solved yet. • Empower more people to do data science. Automate routines. • Unsolved problems are learning from unlabeled data, planning, reasoning, problem solving, concept formulation, 1/10k compute. • Key decisions: Timing, accuracy in what is hard, find verticals and focus, identify differentiation & size of the prize & people & partners Business of outliers: 1% capital returns 526x, 48% returns 0 0 450 900 Q2'15 Q3'15 Q4'15 Q1'16 Q2'16 Peak Data Peak ML VC assessment:
  • 18. Bonus Keywords • Lifelong Machine Learning: systems approach, transfer learning, never-ending learners. Useful for knowledge build. • Graphons: graph convergence and limits through infinite number of vertices. Useful for privacy preserving mining. • Computational Social Science: how individuals interact to produce collective behaviour. Individuals exert more effort by themselves than groups. • Information Security: trusted key management is most sensitive. Secret must be changed frequently. Confidentiality easier to violate than authenticity. Integrity. Offence more lucrative than defence. • Enterprise Data: in reality “random data salad” prone to constant change due to M&A, politics, dynamic schema DBs (e.g. Mongo), legacy burden, restructuring, leadership changes, data hoarding. Machine driven, human guided processes required.
  • 19. Thinking about Value from Data Science