Big Data Science
Hype?
Levente Török
Blinkbox Music Ltd ... GE Hungary
Disclaimer
All statements appearing in slides or in the presentation represent my personal
opinion. They are not in connection to any companies nor any person I had or
have connection to.
I reserve these statements with risk of error.
Summary
- Big data? Data Science? Hype?
- Continuous improvement of Online Systems
- A/B testing
Data Science, hype?
Harvard Business Review in 2012
Data Science, hype?
Forbes in 2015
Whether employers know or don’t know what data scientists do, they have been
using—in rapidly-growing numbers—the term“data scientist” in job
descriptions in the past two years as Indeed.com’s data demonstrates.
Developers, developers ...
“Data Science” in media
Yahoo Finance:
“If you take a cue from the Harvard Business Review, the title goes to data
scientists. That’s right. Data scientist, as in, the type of person who can write
computer code and harness the power of big data to come up with innovative
ways to use it for companies like Google (GOOG), Amazon (AMZN), and
Yahoo! (YHOO).”
“Data Science” in media
Nature Jobs:
Data Science, what is this?
Wikipedia
“Data Science is the extraction of knowledge from data, which is a continuation
of the field data mining and predictive analytics”
Data? Science... ?
1) Big Data Engineer
- Hive, Yarn, Spark, Impala
2) Data Miner
- SAS, Knime, Rapid Miner, Weka,
IBM Clementine
3) Big Data & Data Miner
- Apache - Mahout
- Spark - MLlib, Spark - GraphX
- Apache - Giraph
- GraphLab ?
Data Scientist?
Big data - big failure:
If an algo doesn’t work on small data, it wont work on big data.
4) Data Scientist is a real scientist:
Follows scientific principles in data modeling:
- conjectures hypothesis on statistical structure of data
- validates it offline and online
- improves model iteratively
Tools: R / Python / C++
http://bit.ly/1B3bSS1
Tools: verdict
other -> R -> python = 0.44 * 0.26 = 0.11
other -> python -> R = 0.23 * 0.18 = 0.04
Is this correct?
However ... what?
Improving Online Systems
Examples
Recommender systems (ie. RecSys)
What to listen next?
What ad to display?
Anomaly detection:
Is this user/system behaviour “normal”?
Does this system going to fail soon?
Data Flow in Online Sys
Online sys -> log -> daily aggregation -> long term -> batch model bld.
storage
queue -> async online model updates
near optimal online data model
The major difficulty
daily aggregates
datasource
batch model training
online model training
1. batch model training starts: 4:00, finishes 4:30
2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00
to 4:30 but new events arrived in the mean time
.... -> streaming architectures
queue
Offline data modelling
Train Test
Model Prediction
Parameters
Offline modeling
1. Data splits for train / test / quiz
- time based: eg 2 weeks / 1 day / 1 day
- entity based: set of users
- session: set of sessions of users
Test data preparation:
- manual pos/neg sample data points labeled, or injected
2. Train by batch training
Given a data set, we try to fit the model to the data set controlled by model
parameters.
Offline data modelling
3. Prediction phase: Given a model
- for each users we met in train, we give predictions
- for each event we can see in test set, we predict likelihood
4. Evaluation phase: prediction and test data similarity is measured
- RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics
- Artificially labelled data set for anomaly detection: C2B (AUC),
weighted AUC ...
- Sanity check! -> Q/A team
Offline data modelling
4. Parameter search in parallel
The output of the searching is the parameter vector (+ model id) that
returns the optimal solution offline according to our belief
NB: usually we are unsure which offline measure is going to reflect the best
online results, so we have number of optimal parameter vectors according to
different offline measures.
A/B testing
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
??
Online performance tuning
Train_A Model_A Online pred_A
Parameters
Performance_A
Model_B Online pred_B Performance_BTrain_B
Online traffic split adj.
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
Offline-Online matching
Model NDCG AUC ... Avg Sess Len
A 1 1 1
B 2 3 3
C 3 2 2
Offline measures Online measure
compare with Pearsons corr. coeff.
On-line testing
5. A/B testing
- control model
- tested model (model with an offline optimal parameter set)
6. Evaluation of online results:
Measures:
- Session length, station length
- Return rate, CLTV
Filter and compare models -> wow!
On-line testing
7. Run many models one-by-one according phase 4.
8. Figure out the best offline metrics:
Compare order statistics of offline and online models
(ie Pearsons correlation) to figure out which of the offline metrics matter the
most in online performance.
Model comparisons
Problems:
1. Day 1 A is better, Day 2 B is better
2. The version with the longest session length != the version with the highest
full play ratio of tracks
3. Outliers are dominates the session length average:
- Number of users listen the service “forever”
- Bouncing users pollutes the session length average with high noise
A/B testing
1. Version A: Control group
2. Version B: Treatment group
With n_A, n_B users, we have successes of k_A, k_B.
Is it enough if I compare k_A / n_A with k_B / n_B ?
A/B testing?
Questions:
- What if one day A wins, next the B wins?
- How many users should I use for testing?
- How long should I run test?
- What if we have A, B, C ... versions we want to test?
Classical Statistics
Hypothesis testing:
- Does treatment B have any effect?
- up to probability: (1-alpha)
- given: a sample size of N
Even the most well known A/B testing platforms can lead you illusory results.
Command: “Sample size estimator”
Binomial ?
Note that:
Binomial distribution:
Beta distribution: where
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
Chance2beat:
x
f_A(x;...)
f_B(x;...)
Chance 2 beat
- This is a probability, we want to increase by testing. For example:
- Can be:
- Gaussians,
- distributions w/ priors
- empiric distributions, or
- small sample size data sets directly
- Sometimes it is not enough: use bootstrapping!
Thanks

More Related Content

PPTX
Data Integration and Transformation in Data mining
PPT
PPTX
Certinity Factor and Dempster-shafer theory .pptx
PPT
Mining Frequent Patterns, Association and Correlations
PPTX
Web Mining & Text Mining
PDF
BTech Pattern Recognition Notes
PPTX
04 Classification in Data Mining
PDF
Introduction to Statistical Machine Learning
Data Integration and Transformation in Data mining
Certinity Factor and Dempster-shafer theory .pptx
Mining Frequent Patterns, Association and Correlations
Web Mining & Text Mining
BTech Pattern Recognition Notes
04 Classification in Data Mining
Introduction to Statistical Machine Learning

What's hot (20)

PPTX
Ensemble learning
PDF
Machine Learning Algorithms
PPT
4.2 spatial data mining
PDF
Classification Based Machine Learning Algorithms
PPTX
Big Data Analytics with Hadoop
PPTX
Knapsack problem using greedy approach
PPTX
Mining Data Streams
PPTX
Uncertainty in AI
PPTX
lazy learners and other classication methods
PPTX
Data reduction
PPTX
Data preprocessing in Machine learning
PPTX
Register allocation and assignment
PPTX
Spell checker using Natural language processing
PPTX
Rules of data mining
PPTX
Matching techniques
PPT
Data preprocessing
PPTX
Introduction to data science
PPT
Data mining techniques unit 1
PDF
Information retrieval-systems notes
PPTX
Lecture 1- Artificial Intelligence - Introduction
Ensemble learning
Machine Learning Algorithms
4.2 spatial data mining
Classification Based Machine Learning Algorithms
Big Data Analytics with Hadoop
Knapsack problem using greedy approach
Mining Data Streams
Uncertainty in AI
lazy learners and other classication methods
Data reduction
Data preprocessing in Machine learning
Register allocation and assignment
Spell checker using Natural language processing
Rules of data mining
Matching techniques
Data preprocessing
Introduction to data science
Data mining techniques unit 1
Information retrieval-systems notes
Lecture 1- Artificial Intelligence - Introduction
Ad

Viewers also liked (10)

PDF
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
PDF
What's New in JHipsterLand - Devoxx Poland 2017
PPTX
Swift -Helyzetjelentés az iOS programozás új nyelvéről
PDF
Linux Kernel – Hogyan csapjunk bele?
PDF
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
PDF
Progressive Web Apps / GDG DevFest - Season 2016
PDF
CDI 2.0 is upon us Devoxx
PPTX
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
ODP
DevAssistant, Docker and You
PDF
Devoxx : being productive with JHipster
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
What's New in JHipsterLand - Devoxx Poland 2017
Swift -Helyzetjelentés az iOS programozás új nyelvéről
Linux Kernel – Hogyan csapjunk bele?
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
Progressive Web Apps / GDG DevFest - Season 2016
CDI 2.0 is upon us Devoxx
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
DevAssistant, Docker and You
Devoxx : being productive with JHipster
Ad

Similar to Big Data Science - hype? (20)

PPTX
Data and Business Team Collaboration
PDF
Data Science for Business Managers - An intro to ROI for predictive analytics
PPTX
Cloudera Data Science Challenge 3 Solution by Doug Needham
PDF
Data Analysis - Making Big Data Work
PPT
Testing Software Solutions
PDF
Implementation of Spam Classifier using Naïve Bayes Algorithm
PDF
Crack the DP-100 Exam in 2025: Real Dumps, ML Practice Questions & Study Tips
PDF
Top 10 Data Science Practitioner Pitfalls
PDF
PDF
Machine learning at b.e.s.t. summer university
PPTX
Predicting Medical Test Results using Driverless AI
PDF
Machine learning in production
PPTX
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
PDF
The math behind big systems analysis.
PDF
When UX (guy) Meets Operations
PDF
Test Bank for Systems Analysis and Design 11th Edition by Tilley
PDF
Test Bank for Systems Analysis and Design 11th Edition by Tilley
PDF
Test Bank for Systems Analysis and Design 11th Edition by Tilley
PPTX
Market Basket Analysis Revisited using SQL Pattern Matching
Data and Business Team Collaboration
Data Science for Business Managers - An intro to ROI for predictive analytics
Cloudera Data Science Challenge 3 Solution by Doug Needham
Data Analysis - Making Big Data Work
Testing Software Solutions
Implementation of Spam Classifier using Naïve Bayes Algorithm
Crack the DP-100 Exam in 2025: Real Dumps, ML Practice Questions & Study Tips
Top 10 Data Science Practitioner Pitfalls
Machine learning at b.e.s.t. summer university
Predicting Medical Test Results using Driverless AI
Machine learning in production
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
The math behind big systems analysis.
When UX (guy) Meets Operations
Test Bank for Systems Analysis and Design 11th Edition by Tilley
Test Bank for Systems Analysis and Design 11th Edition by Tilley
Test Bank for Systems Analysis and Design 11th Edition by Tilley
Market Basket Analysis Revisited using SQL Pattern Matching

More from BalaBit (18)

PDF
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
PDF
NIAS 2015 - The value add of open source for innovation
PDF
Les Assises 2015 - Why people are the most important aspect of IT security?
PDF
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
PDF
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
PDF
syslog-ng: from log collection to processing and information extraction
PPTX
eCSI - The Agile IT security
PDF
Top 10 reasons to monitor privileged users
PDF
Hogyan maradj egészséges irodai munka mellett?
PDF
Regulatory compliance and system logging
ODP
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
PPTX
Techreggeli - Logmenedzsment
PPTX
State of the art logging
PPTX
Why proper logging is important
PDF
Balabit Company Overview
PDF
BalaBit IT Security cégismertető prezentációja
PDF
The Future of Electro Car
PDF
Compliance needs transparency
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
NIAS 2015 - The value add of open source for innovation
Les Assises 2015 - Why people are the most important aspect of IT security?
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
syslog-ng: from log collection to processing and information extraction
eCSI - The Agile IT security
Top 10 reasons to monitor privileged users
Hogyan maradj egészséges irodai munka mellett?
Regulatory compliance and system logging
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
Techreggeli - Logmenedzsment
State of the art logging
Why proper logging is important
Balabit Company Overview
BalaBit IT Security cégismertető prezentációja
The Future of Electro Car
Compliance needs transparency

Recently uploaded (20)

PPTX
2025 High Blood Pressure Guideline Slide Set.pptx
PDF
Civil Department's presentation Your score increases as you pick a category
PPTX
Thinking Routines and Learning Engagements.pptx
PPTX
UNIT_2-__LIPIDS[1].pptx.................
PPTX
ACFE CERTIFICATION TRAINING ON LAW.pptx
PDF
faiz-khans about Radiotherapy Physics-02.pdf
PDF
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
PDF
Journal of Dental Science - UDMY (2020).pdf
PDF
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI Syllabus.pdf
DOCX
Ibrahim Suliman Mukhtar CV5AUG2025.docx
PDF
Fun with Grammar (Communicative Activities for the Azar Grammar Series)
PDF
1.Salivary gland disease.pdf 3.Bleeding and Clotting Disorders.pdf important
PDF
M.Tech in Aerospace Engineering | BIT Mesra
PDF
Controlled Drug Delivery System-NDDS UNIT-1 B.Pharm 7th sem
PPTX
PLASMA AND ITS CONSTITUENTS 123.pptx
PDF
semiconductor packaging in vlsi design fab
PPTX
Macbeth play - analysis .pptx english lit
PPTX
CAPACITY BUILDING PROGRAMME IN ADOLESCENT EDUCATION
PDF
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
2025 High Blood Pressure Guideline Slide Set.pptx
Civil Department's presentation Your score increases as you pick a category
Thinking Routines and Learning Engagements.pptx
UNIT_2-__LIPIDS[1].pptx.................
ACFE CERTIFICATION TRAINING ON LAW.pptx
faiz-khans about Radiotherapy Physics-02.pdf
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
Journal of Dental Science - UDMY (2020).pdf
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI Syllabus.pdf
Ibrahim Suliman Mukhtar CV5AUG2025.docx
Fun with Grammar (Communicative Activities for the Azar Grammar Series)
1.Salivary gland disease.pdf 3.Bleeding and Clotting Disorders.pdf important
M.Tech in Aerospace Engineering | BIT Mesra
Controlled Drug Delivery System-NDDS UNIT-1 B.Pharm 7th sem
PLASMA AND ITS CONSTITUENTS 123.pptx
semiconductor packaging in vlsi design fab
Macbeth play - analysis .pptx english lit
CAPACITY BUILDING PROGRAMME IN ADOLESCENT EDUCATION
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf

Big Data Science - hype?

  • 1. Big Data Science Hype? Levente Török Blinkbox Music Ltd ... GE Hungary
  • 2. Disclaimer All statements appearing in slides or in the presentation represent my personal opinion. They are not in connection to any companies nor any person I had or have connection to. I reserve these statements with risk of error.
  • 3. Summary - Big data? Data Science? Hype? - Continuous improvement of Online Systems - A/B testing
  • 4. Data Science, hype? Harvard Business Review in 2012
  • 5. Data Science, hype? Forbes in 2015 Whether employers know or don’t know what data scientists do, they have been using—in rapidly-growing numbers—the term“data scientist” in job descriptions in the past two years as Indeed.com’s data demonstrates.
  • 7. “Data Science” in media Yahoo Finance: “If you take a cue from the Harvard Business Review, the title goes to data scientists. That’s right. Data scientist, as in, the type of person who can write computer code and harness the power of big data to come up with innovative ways to use it for companies like Google (GOOG), Amazon (AMZN), and Yahoo! (YHOO).”
  • 8. “Data Science” in media Nature Jobs:
  • 9. Data Science, what is this? Wikipedia “Data Science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics”
  • 10. Data? Science... ? 1) Big Data Engineer - Hive, Yarn, Spark, Impala 2) Data Miner - SAS, Knime, Rapid Miner, Weka, IBM Clementine 3) Big Data & Data Miner - Apache - Mahout - Spark - MLlib, Spark - GraphX - Apache - Giraph - GraphLab ?
  • 11. Data Scientist? Big data - big failure: If an algo doesn’t work on small data, it wont work on big data. 4) Data Scientist is a real scientist: Follows scientific principles in data modeling: - conjectures hypothesis on statistical structure of data - validates it offline and online - improves model iteratively
  • 12. Tools: R / Python / C++ http://bit.ly/1B3bSS1
  • 13. Tools: verdict other -> R -> python = 0.44 * 0.26 = 0.11 other -> python -> R = 0.23 * 0.18 = 0.04 Is this correct? However ... what?
  • 14. Improving Online Systems Examples Recommender systems (ie. RecSys) What to listen next? What ad to display? Anomaly detection: Is this user/system behaviour “normal”? Does this system going to fail soon?
  • 15. Data Flow in Online Sys Online sys -> log -> daily aggregation -> long term -> batch model bld. storage queue -> async online model updates near optimal online data model
  • 16. The major difficulty daily aggregates datasource batch model training online model training 1. batch model training starts: 4:00, finishes 4:30 2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00 to 4:30 but new events arrived in the mean time .... -> streaming architectures queue
  • 17. Offline data modelling Train Test Model Prediction Parameters
  • 18. Offline modeling 1. Data splits for train / test / quiz - time based: eg 2 weeks / 1 day / 1 day - entity based: set of users - session: set of sessions of users Test data preparation: - manual pos/neg sample data points labeled, or injected 2. Train by batch training Given a data set, we try to fit the model to the data set controlled by model parameters.
  • 19. Offline data modelling 3. Prediction phase: Given a model - for each users we met in train, we give predictions - for each event we can see in test set, we predict likelihood 4. Evaluation phase: prediction and test data similarity is measured - RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics - Artificially labelled data set for anomaly detection: C2B (AUC), weighted AUC ... - Sanity check! -> Q/A team
  • 20. Offline data modelling 4. Parameter search in parallel The output of the searching is the parameter vector (+ model id) that returns the optimal solution offline according to our belief NB: usually we are unsure which offline measure is going to reflect the best online results, so we have number of optimal parameter vectors according to different offline measures.
  • 21. A/B testing Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B ??
  • 22. Online performance tuning Train_A Model_A Online pred_A Parameters Performance_A Model_B Online pred_B Performance_BTrain_B
  • 23. Online traffic split adj. Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B
  • 24. Offline-Online matching Model NDCG AUC ... Avg Sess Len A 1 1 1 B 2 3 3 C 3 2 2 Offline measures Online measure compare with Pearsons corr. coeff.
  • 25. On-line testing 5. A/B testing - control model - tested model (model with an offline optimal parameter set) 6. Evaluation of online results: Measures: - Session length, station length - Return rate, CLTV Filter and compare models -> wow!
  • 26. On-line testing 7. Run many models one-by-one according phase 4. 8. Figure out the best offline metrics: Compare order statistics of offline and online models (ie Pearsons correlation) to figure out which of the offline metrics matter the most in online performance.
  • 27. Model comparisons Problems: 1. Day 1 A is better, Day 2 B is better 2. The version with the longest session length != the version with the highest full play ratio of tracks 3. Outliers are dominates the session length average: - Number of users listen the service “forever” - Bouncing users pollutes the session length average with high noise
  • 28. A/B testing 1. Version A: Control group 2. Version B: Treatment group With n_A, n_B users, we have successes of k_A, k_B. Is it enough if I compare k_A / n_A with k_B / n_B ?
  • 29. A/B testing? Questions: - What if one day A wins, next the B wins? - How many users should I use for testing? - How long should I run test? - What if we have A, B, C ... versions we want to test?
  • 30. Classical Statistics Hypothesis testing: - Does treatment B have any effect? - up to probability: (1-alpha) - given: a sample size of N Even the most well known A/B testing platforms can lead you illusory results. Command: “Sample size estimator”
  • 31. Binomial ? Note that: Binomial distribution: Beta distribution: where
  • 32. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question:
  • 33. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question: Chance2beat: x f_A(x;...) f_B(x;...)
  • 34. Chance 2 beat - This is a probability, we want to increase by testing. For example: - Can be: - Gaussians, - distributions w/ priors - empiric distributions, or - small sample size data sets directly - Sometimes it is not enough: use bootstrapping!