SlideShare a Scribd company logo
Me, A Data Scientist?
Fabricio Quintanilla, MSc, PhD
fabricio.quintanilla@gmail.com
@fabrixq
/fquintanilla
http://www.inteligenciadenegocios.net
MCP, MCPD, MCTS
Organiza
5/21/2016 Me, A Data Scientist?2 |
Patrocinadores del SQL Saturday
5/21/2016 Me, A Data Scientist?3 |
Agenda
Not Rocket Science….
Just Data Science…
5/21/2016 Me, A Data Scientist?4 |
Man on the Moon – 1969
5/21/2016 Me, A Data Scientist?5 |
Man on the Moon – Small Data
Computer Program
Date: 1969
64Kb, 2Kb RAM,
Fortran
Must Work 1st time
5/21/2016 Me, A Data Scientist?6 |
Apollo XI
Speed: 3,500 Km/h
Weight: 13,500 Kg
Lots of complex data
Man on the Moon
Distance: 356,500 Km
Never been there before
Must return to Earth
Skydive Stratos, 2012
5/21/2016 Me, A Data Scientist?7 |
Tens of Gigabytes!!!
Think about it ... We live in crazy times…
What is Big Data? mumbo-jumbo
§ A fashionable term typically used by some IT
vendors to remarket old fashioned software
and hardware
5/21/2016 Me, A Data Scientist?8 |
Big Data is not about Data Volume
5/21/2016 Me, A Data Scientist?9 |
No way!!!! Water Coller Chat
§ We need to parallelize data operations but it’s too costly & complex…
§ The business can’t get access to all the relevant data, we need external data
§ We can’t match customer master data to live customer interactions…
§ We can’t just force everything into a star-schema…
§ These BI reports and chart don’t tell us anything we didn’t know…
§ We are missing the ETL window, the data we needed didn’t arrive on time…
§ We can’t predict with confidence if we can’t explore data & develop our own
models
5/21/2016 Me, A Data Scientist?10 |
What is big data?
11
Big Data is
any thing
which is
crash Excel.
Small Data is
when is fit in RAM.
Big Data is when is
crash because is
not fit in RAM.
Or, in other words, Big Data is data
in volumes too great to process by
traditional methods.
https://twitter.com/devops_borat
What is Big Data? Force of Change
§ Big Data forces you to change the way you
collect, store, manage, analyze and visualize
data.
5/21/2016 Me, A Data Scientist?12 |
Big Data = “Crude Oil” [not useful oil]
§ Think data as ‘Crude Oil’
§ Big data is about extracting the ‘Crude Oil’,
transporting it in ‘mega-tankers’, siphoning it
through ‘pipelines’and storing it in massive
‘silos’…
§ All ‘this’ is about IT Big Data… fine and well…
§ BUT………..
5/21/2016 Me, A Data Scientist?13 |
You need to refine the ‘Crude Oil’
Enter Data Science
5/21/2016 Me, A Data Scientist?14 |
The Science [and Art] of…
§ Discovering what we don’t know from data
§ Obtaining predictive, actionable insight from data
§ Creating Data Products that have business impact now
§ Communicating relevant business stories from data
§ Building confidence in decisions that drive business value
5/21/2016 Me, A Data Scientist?15 |
What is a data scientist?
5/21/2016 Me, A Data Scientist?16 |
Class DataScientist {
Is skeptical, curious. Has inquisitive mind
Knows Machine Learning, Statistics, Probability
Applies Scientific Method. Runs Experiment
Is good at Coding & Hacking
Able to deal IT Data Engineering
Knows how to build data products
Able to find answers to known unknowns
Tells relevant business stories from data
Has Domain Knowledge
}
5/21/2016 Me, A Data Scientist?17 |
What does a Data Scientist Do?
5/21/2016 Me, A Data Scientist?18 |
10 Things [most] Data Scientists Do
§ Ask Good Questions, What is What
§ …we don’t know?
§ …we’d like to know?
§ Define and Test an Hypothesis, Run experiments
§ Scoop, Scrap, Sink & Sample Business Relevant Data
§ Purge and Wrestle Data, Tame Data
§ Explore Data, Discover Data Playfully. Discover
Unknowns.
§ Model Data. Model Algorithms
§ Understand Data Relationships
§ Tell the Machine How to Learn from Data
§ Create Data Products that DeliverActionable insight
§ Tell Relevant Business Stories from Data
5/21/2016 Me, A Data Scientist?19 |
[Sort of a] Data Scientist Toolkit
§ Java, R, Phyton… (bonus: Clojure, Haskell, Scala)
§ Hadoop, HDFS & MapReduce… (bonus: Spark, Storm)
§ Hbase, Pig & Hive… (bonus: Shark, Impala, Cascalog)
§ ETL, Webscrapers, Flume, Sqoop… (bonus: Hume)
§ SQL, RDBMS, DW, OLAP…
§ Knime, Weka, RapidMiner… (bonus: SciPy, NumPy, scikit-
learn, pandas)
§ D3.js, Gephi, ggplot2, Tableu, Flare, Shiny…
§ SPSS, Matlab, SAS… (the Enterprise man)
§ NoSQL, MongoDB, Couchbase, Cassandra…
§ And Yes!!! … MS-Excel: the most used, most underrated
DS tool…
5/21/2016 Me, A Data Scientist?20 |
Types of algorithms
21
§ Clustering
§ Association learning
§ Parameter estimation
§ Recommendation engines
§ Classification
§ Similarity matching
§ Neural networks
§ Bayesian networks
§ Genetic algorithms
Basically, it’s all maths...
22
§ Linear algebra
§ Calculus
§ Probability theory
§ Graph theory
§ ...
22
https://twitter.com/devops_borat
Only 10% in
devopsknow
how to work
with Big Data.
Only 1% are
realize they need
2 Big Data for
fault tolerance
Big data skills gap
§ Hardly anyone knows this stuff
§ It’s a big field, with lots and lots of theory
§ And it’s all maths, so it’s tricky to learn
23
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-
gap
Two orthogonal aspects
24
§ Analytics / machine learning
§ learning insights from data
§ Big data
§ handling massive data volumes
§ Can be combined, or used separately
Data science?
25
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
How to process Big Data?
26
§ If relational databases are not enough,
what is?
https://twitter.com/devops_borat
Mining ofBig
Data is
problem solved
in 2013 with
zgrep
MapReduce
27
§ A framework for writing massively
parallel code
§ Simple, straightforward model
§ Based on “map” and “reduce” functions
from functional programming (LISP)
NoSQL and Big Data
28
§ Not really that relevant
§ Traditional databases handle big data
sets, too
§ NoSQL databases have poor analytics
§ MapReduce often works from text files
§ can obviously work from SQL and NoSQL,
too
§ NoSQL is more for high throughput
§ basically, AP from the CAP theorem, instead
of CP
§ In practice, really Big Data is likely to be a
mix
§ text files, NoSQL, and SQL
The 4th V: Veracity
29
“The greatest enemy of knowledge is not
ignorance, it is the illusion of knowledge.”
Daniel Borstin, in The Discoverers
(1983)
https://twitter.com/devops_borat
95% of time,
when is clean Big
Data is get Little
Data
Data quality
§ A huge problem in practice
§ any manually entered data is suspect
§ most data sets are in practice deeply
problematic
§ Even automatically gathered data can be
a problem
§ systematic problems with sensors
§ errors causing data loss
§ incorrect metadata about the sensor
§ Never, never, never trust the data without
checking it!
§ garbage in, garbage out, etc
30
31
http://www.slideshare.net/Hadoop_Summit/scaling-big-data-mining-infrastructure-twitter-experience/12
Conclusion
§ Vast potential
§ to both big data and machine learning
§ Very difficult to realize that potential
§ requires mathematics, which nobody
knows
§ We need to wake up!
32
Theory
33
Two kinds of learning
34
§ Supervised
§ we have training data with correct
answers
§ use training data to prepare the algorithm
§ then apply it to data without a correct
answer
§ Unsupervised
§ no training data
§ throw data into the algorithm, hope it
makes some kind of sense out of the data
Some types of algorithms
§ Prediction
§ predicting a variable from data
§ Classification
§ assigning records to predefined groups
§ Clustering
§ splitting records into groups based on similarity
§ Association learning
§ seeing what often appears together with what
35
Issues
§ Data is usually noisy in some way
§ imprecise input values
§ hidden/latent input values
§ Inductive bias
§ basically, the shape of the algorithm we
choose
§ may not fit the data at all
§ may induce underfitting or overfitting
§ Machine learning without inductive bias
is not possible
36
Testing
37
§ When doing this for real, testing is crucial
§ Testing means splitting your data set
§ training data (used as input to algorithm)
§ test data (used for evaluation only)
§ Need to compute some measure of performance
§ precision/recall
§ root mean square error
§ A huge field of theory here
§ will not go into it in this course
§ very important in practice
Missing values
38
§ Usually, there are missing values in the data set
§ that is, some records have some NULL values
§ These cause problems for many machine
learning algorithms
§ Need to solve somehow
§ remove all records with NULLs
§ use a default value
§ estimate a replacement value
§ ...
Terminology
39
§ Vector
§ one-dimensional array
§ Matrix
§ two-dimensional array
§ Linear algebra
§ algebra with vectors and matrices
§ addition, multiplication, transposition, ...
Top 10 algorithms
40
Top 10 machine learning algs
1. C4.5 No
2. k-means clustering Yes
3. Support vector machines No
4. the Apriori algorithm No
5. the EM algorithm No
6. PageRank No
7. AdaBoost No
8. k-nearest neighbours class. Kind of
9. Naïve Bayes Yes
10.CART No
41
From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006.“Top
10 algorithms in data mining”,by X. Wu et al
C4.5
42
§ Algorithm for building decision trees
§ basically trees of boolean expressions
§ each node split the data set in two
§ leaves assign items to classes
§ Decision trees are useful not just for classification
§ they can also teach you something about the classes
§ C4.5 is a bit involved to learn
§ the ID3 algorithm is much simpler
§ CART (#10) is another algorithm for learning
decision trees
Support Vector Machines
43
§ A way to do binary classification on
matrices
§ Support vectors are the data points
nearest to the hyperplane that divides the
classes
§ SVMs maximize the distance between
SVs and the boundary
§ Particularly valuable because of “the
kernel trick”
§ using a transformation to a higher dimension
to handle more complex class boundaries
§ A bit of work to learn, but manageable
Apriori
44
§ An algorithm for “frequent itemsets”
§ basically, working out which items frequently
appear together
§ for example, what goods are often bought
together in the supermarket?
§ used for Amazon’s “customers who bought
this...”
§ Can also be used to find association rules
§ that is, “people who buy X often buy Y” or
similar
§ Apriori is slow
§ a faster, further development is FP-growth
http://www.dssresources.com/newsletters/66.php
Expectation Maximization
45
§ A deeply interesting algorithm I’ve seen
used in a number of contexts
§ very hard to understand what it does
§ very heavy on the maths
§ Essentially an iterative algorithm
§ skips between “expectation” step and
“maximization” step
§ tries to optimize the output of a function
§ Can be used for
§ clustering
§ a number of more specialized examples, too
PageRank
46
§ Basically a graph analysis algorithm
§ identifies the most prominent nodes
§ used for weighting search results on Google
§ Can be applied to any graph
§ for example an RDF data set
§ Basically works by simulating random walk
§ estimating the likelihood that a walker would be
on a given node at a given time
§ actual implementation is linear algebra
§ The basic algorithm has some issues
§ “spider traps”
§ graph must be connected
§ straightforward solutions to these exist
AdaBoost
47
§ Algorithm for “ensemble learning”
§ That is, for combining several algorithms
§ and training them on the same data
§ Combining more algorithms can be very
effective
§ usually better than a single algorithm
§ AdaBoost basically weights training
samples
§ giving the most weight to those which are
classified the worst
Recommendations
48
Collaborative filtering
§ Basically, you’ve got some set of items
§ these can be movies, books, beers, whatever
§ You’ve also got ratings from users
§ on a scale of 1-5, 1-10, whatever
§ Can you use this to recommend items to a
user, based on their ratings?
§ if you use the connection between their
ratings and other people’s ratings, it’s called
collaborative filtering
§ other approaches are possible
49
Feature-based recommendation
50
§ Use user’s ratings of items
§ run an algorithm to learn what features of
items the user likes
§ Can be difficult to apply because
§ requires detailed information about items
§ key features may not be present in data
§ Recommending music may be difficult,
for example
Naïve Bayes
51
Bayes’s Theorem
52
§ Basically a theorem for combining
probabilities
§ I’ve observed A, which indicates H is true with
probability 70%
§ I’ve also observed B, which indicates H is
true with probability 85%
§ what should I conclude?
§ Naïve Bayes is basically using this
theorem
§ with the assumption that A and B are
indepedent
§ this assumption is nearly always false, hence
“naïve”
Simple example
53
§ Is the coin fair or not?
§ we throw it 10 times, get 9 heads and one tail
§ we try again, get 8 heads and two tails
§ What do we know now?
§ can combine data and recompute
§ or just use Bayes’s Theorem directly
http://www.bbc.co.uk/news/magazine-22310186
MapReduce
54
University pre-lecture, 1991
55
§ My first meeting with university was Open
University Day, in 1991
§ Professor Bjørn Kirkerud gave the computer
science talk
§ His subject
§ some day processors will stop becoming faster
§ we’re already building machines with many
processors
§ what we need is a way to parallelize software
§ preferably automatically, by feeding in normal
source code and getting it parallelized back
§ MapReduce is basically the state of the art
on that today
MapReduce
56
§ A framework for writing massively
parallel code
§ Simple, straightforward model
§ Based on “map” and “reduce” functions
from functional programming (LISP)
57
http://research.google.com/archive/mapreduce.html
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design
and Implementation,
San Francisco, CA, December, 2004.
map and reduce
58
>>> "1 2 3 4 5 6 7 8".split()
['1', '2', '3', '4', '5', '6', '7', '8']
>>> l = map(int, "1 2 3 4 5 6 7 8".split())
>>> l
[1, 2, 3, 4, 5, 6, 7, 8]
>>> import operator
>>> reduce(operator.add, l)
36
MapReduce
59
1. Split data into fragments
2. Create a Map task for each fragment
§ the task outputs a set of (key, value) pairs
3. Group the pairs by key
4. Call Reduce once for each key
§ all pairs with same key passed in together
§ reduce outputs new (key, value) pairs
Communications
60
§ HDFS
§ Hadoop Distributed File System
§ input data, temporary results, and results are stored
as files here
§ Hadoop takes care of making files available to nodes
§ Hadoop RPC
§ how Hadoop communicates between nodes
§ used for scheduling tasks, heartbeat etc
§ Most of this is in practice hidden from the
developer
The Hadoop ecosystem
61
§ Pig
§ dataflow language for setting up MR jobs
§ HBase
§ NoSQL database to store MR input in
§ Hive
§ SQL-like query language on top of Hadoop
§ Mahout
§ machine learning library on top of Hadoop
§ Hadoop Streaming
§ utility for writing mappers and reducers as
command-line tools in other languages
Applications of MapReduce
62
§ Linear algebra operations
§ easily mapreducible
§ SQL queries over heterogeneous data
§ basically requires only a mapping to tables
§ relational algebra easy to do in MapReduce
§ PageRank
§ basically one big set of matrix multiplications
§ the original application of MapReduce
§ Recommendation engines
§ the SON algorithm
§ ...
Apache Mahout
63
§ Has three main application areas
§ others are welcome, but this is mainly what’s there now
§ Recommendation engines
§ several different similarity measures
§ collaborative filtering
§ Slope-one algorithm
§ Clustering
§ k-means and fuzzy k-means
§ Latent Dirichlet Allocation
§ Classification
§ stochastic gradient descent
§ Support Vector Machines
§ Naïve Bayes
Lots of SQL-on-MapReduce tools
64
§ Tenzing Google
§ Hive Apache Hadoop
§ YSmart Ohio State
§ SQL-MR AsterData
§ HadoopDB Hadapt
§ Polybase Microsoft
§ RainStor RainStor Inc.
§ ParAccel ParAccel Inc.
§ Impala Cloudera
§ ...
Conclusion
65
Big data & machine learning
66
§ This is a huge field, growing very fast
§ Many algorithms and techniques
§ can be seen as a giant toolbox with wide-ranging
applications
§ Ranging from the very simple to the extremely
sophisticated
§ Difficult to see the big picture
§ Huge range of applications
§ Math skills are crucial
Take a look around Data Scientists’ Tools
Using SQL Server!!!
5/21/2016 Me, A Data Scientist?67 |
Fabricio	
Quintanilla
fabricio.quintanilla@gmail.co
m inteligenciadenegocios.net
@fabrixq
PREGUNTAS Y RESPUESTAS
5/21/2016 Me, A Data Scientist?68 |

More Related Content

PDF
Data Scientist 101 BI Dutch
Jos van Dongen
 
PPTX
Machine Learning using Big data
Vaibhav Kurkute
 
PPTX
Mauritius Big Data and Machine Learning JEDI workshop
CosmoAIMS Bassett
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PPTX
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
PDF
Introduction to Data Science (Data Science Thailand Meetup #1)
Data Science Thailand
 
PPTX
Introduction to data science
Sampath Kumar
 
Data Scientist 101 BI Dutch
Jos van Dongen
 
Machine Learning using Big data
Vaibhav Kurkute
 
Mauritius Big Data and Machine Learning JEDI workshop
CosmoAIMS Bassett
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Data Science Thailand
 
Introduction to data science
Sampath Kumar
 

What's hot (20)

PDF
Unit 3 part 2
MohammadAsharAshraf
 
PDF
Data Science Folk Knowledge
Krishna Sankar
 
PPTX
Data Science presentation for elementary school students
Melanie Manning, CFA
 
PDF
How to become a Data Scientist?
HackerEarth
 
PDF
Data science and_analytics_for_ordinary_people_ebook
Jeffrey Strickland, Ph.D., CMSP
 
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
PDF
Open Data, Big Data and Machine Learning
Steven Van Vaerenbergh
 
PPTX
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
 
PDF
Data science presentation 2nd CI day
Mohammed Barakat
 
PPTX
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
PDF
Using hadoop for big data
Data Science Thailand
 
PDF
Data science e machine learning
Giuseppe Manco
 
PPTX
Machine Learning Introduction for Digital Business Leaders
Sudha Jamthe
 
PPTX
Machine Learning in Big Data
DataWorks Summit/Hadoop Summit
 
PPTX
Crowdsourced Data Processing: Industry and Academic Perspectives
Aditya Parameswaran
 
PDF
Introduction to Data Science
Edureka!
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
Intro to Machine Learning
Corey Chivers
 
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
PPTX
Three Tools for "Human-in-the-loop" Data Science
Aditya Parameswaran
 
Unit 3 part 2
MohammadAsharAshraf
 
Data Science Folk Knowledge
Krishna Sankar
 
Data Science presentation for elementary school students
Melanie Manning, CFA
 
How to become a Data Scientist?
HackerEarth
 
Data science and_analytics_for_ordinary_people_ebook
Jeffrey Strickland, Ph.D., CMSP
 
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Open Data, Big Data and Machine Learning
Steven Van Vaerenbergh
 
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
 
Data science presentation 2nd CI day
Mohammed Barakat
 
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Using hadoop for big data
Data Science Thailand
 
Data science e machine learning
Giuseppe Manco
 
Machine Learning Introduction for Digital Business Leaders
Sudha Jamthe
 
Machine Learning in Big Data
DataWorks Summit/Hadoop Summit
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Aditya Parameswaran
 
Introduction to Data Science
Edureka!
 
Data science presentation
MSDEVMTL
 
Intro to Machine Learning
Corey Chivers
 
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
Three Tools for "Human-in-the-loop" Data Science
Aditya Parameswaran
 
Ad

Viewers also liked (20)

PDF
So you want to be a Data Scientist?
Mohd Izhar Firdaus Ismail
 
PDF
Hack Kid Con - Learn to be a Data Scientist for $1
Adrian Cockcroft
 
PDF
Your Data Scientist Hates You
Bradford Stephens
 
PPTX
The First Data Scientist: Forgotten Lessons From Ancient Greece On Winning Wi...
Joe Clements
 
PDF
Data scientist start now!
Agnieszka Zdebiak
 
PDF
What kind of Data Scientist do you need?
Agnieszka Zdebiak
 
PDF
Data Scientist Toolbox
Andrei Savu
 
PDF
Becoming a Data Scientist: Advice From My Podcast Guests
Renee Teate
 
PPTX
Data Scientist: The Sexiest Job in the 21st Century
Lyn Fenex
 
PDF
What is a Data Scientist
Experian_US
 
PDF
Data science vs. Data scientist by Jothi Periasamy
Peter Kua
 
PDF
Be a Data Scientist in 8 steps!
PromptCloud
 
PPTX
Data Scientist Why now?
Agnieszka Zdebiak
 
PDF
The path to be a data scientist
Poo Kuan Hoong
 
PDF
A Data Scientist Experiment
Jan Chipchase
 
PDF
Вебинар: Инструменты для работы Data Scientist
FlyElephant
 
PPT
Data Science Day New York: Data Scientist - The New Data Analyst
Cloudera, Inc.
 
PDF
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Dataconomy Media
 
PDF
Is Data Scientist still the sexiest job of 21st century? Find Out!
Edureka!
 
PDF
How Will AI Change the Role of the Data Scientist?
Hugo Gävert
 
So you want to be a Data Scientist?
Mohd Izhar Firdaus Ismail
 
Hack Kid Con - Learn to be a Data Scientist for $1
Adrian Cockcroft
 
Your Data Scientist Hates You
Bradford Stephens
 
The First Data Scientist: Forgotten Lessons From Ancient Greece On Winning Wi...
Joe Clements
 
Data scientist start now!
Agnieszka Zdebiak
 
What kind of Data Scientist do you need?
Agnieszka Zdebiak
 
Data Scientist Toolbox
Andrei Savu
 
Becoming a Data Scientist: Advice From My Podcast Guests
Renee Teate
 
Data Scientist: The Sexiest Job in the 21st Century
Lyn Fenex
 
What is a Data Scientist
Experian_US
 
Data science vs. Data scientist by Jothi Periasamy
Peter Kua
 
Be a Data Scientist in 8 steps!
PromptCloud
 
Data Scientist Why now?
Agnieszka Zdebiak
 
The path to be a data scientist
Poo Kuan Hoong
 
A Data Scientist Experiment
Jan Chipchase
 
Вебинар: Инструменты для работы Data Scientist
FlyElephant
 
Data Science Day New York: Data Scientist - The New Data Analyst
Cloudera, Inc.
 
Girish Sathyanarayana, Senior Data Scientist at AppLift, " Business Value Thr...
Dataconomy Media
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Edureka!
 
How Will AI Change the Role of the Data Scientist?
Hugo Gävert
 
Ad

Similar to Sql saturday el salvador 2016 - Me, A Data Scientist? (20)

PDF
Big Data & Social Analytics presentation
gustavosouto
 
PDF
Intro to Data Science
TJ Stalcup
 
PDF
Thinkful DC - Intro to Data Science
TJ Stalcup
 
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Robert Williams
 
PDF
00-01 DSnDA.pdf
SugumarSarDurai
 
PPTX
Workshop_Presentation.pptx
RUDRAPRASADSABAR
 
PDF
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
PDF
2017 06-14-getting started with data science
Thinkful
 
PDF
From Rocket Science to Data Science
Sanghamitra Deb
 
PPTX
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 
PDF
How to become a data scientist
Manjunath Sindagi
 
PDF
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
 
PPTX
Big dataorig
Vikas Thada
 
PPSX
Intro to Data Science Big Data
Indu Khemchandani
 
PDF
Ds01 data science
DotNetCampus
 
PPTX
Data scientist roadmap
Sonu Kumar
 
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
PPTX
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Mathieu DESPRIEE
 
PPTX
Data Science Overview
Davide Mauri
 
PPTX
Data Mining With Big Data
Muhammad Rumman Islam Nur
 
Big Data & Social Analytics presentation
gustavosouto
 
Intro to Data Science
TJ Stalcup
 
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Robert Williams
 
00-01 DSnDA.pdf
SugumarSarDurai
 
Workshop_Presentation.pptx
RUDRAPRASADSABAR
 
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
2017 06-14-getting started with data science
Thinkful
 
From Rocket Science to Data Science
Sanghamitra Deb
 
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 
How to become a data scientist
Manjunath Sindagi
 
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
 
Big dataorig
Vikas Thada
 
Intro to Data Science Big Data
Indu Khemchandani
 
Ds01 data science
DotNetCampus
 
Data scientist roadmap
Sonu Kumar
 
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Mathieu DESPRIEE
 
Data Science Overview
Davide Mauri
 
Data Mining With Big Data
Muhammad Rumman Islam Nur
 

Recently uploaded (20)

PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 

Sql saturday el salvador 2016 - Me, A Data Scientist?

  • 1. Me, A Data Scientist? Fabricio Quintanilla, MSc, PhD [email protected] @fabrixq /fquintanilla http://www.inteligenciadenegocios.net MCP, MCPD, MCTS
  • 2. Organiza 5/21/2016 Me, A Data Scientist?2 |
  • 3. Patrocinadores del SQL Saturday 5/21/2016 Me, A Data Scientist?3 |
  • 4. Agenda Not Rocket Science…. Just Data Science… 5/21/2016 Me, A Data Scientist?4 |
  • 5. Man on the Moon – 1969 5/21/2016 Me, A Data Scientist?5 |
  • 6. Man on the Moon – Small Data Computer Program Date: 1969 64Kb, 2Kb RAM, Fortran Must Work 1st time 5/21/2016 Me, A Data Scientist?6 | Apollo XI Speed: 3,500 Km/h Weight: 13,500 Kg Lots of complex data Man on the Moon Distance: 356,500 Km Never been there before Must return to Earth
  • 7. Skydive Stratos, 2012 5/21/2016 Me, A Data Scientist?7 | Tens of Gigabytes!!! Think about it ... We live in crazy times…
  • 8. What is Big Data? mumbo-jumbo § A fashionable term typically used by some IT vendors to remarket old fashioned software and hardware 5/21/2016 Me, A Data Scientist?8 |
  • 9. Big Data is not about Data Volume 5/21/2016 Me, A Data Scientist?9 |
  • 10. No way!!!! Water Coller Chat § We need to parallelize data operations but it’s too costly & complex… § The business can’t get access to all the relevant data, we need external data § We can’t match customer master data to live customer interactions… § We can’t just force everything into a star-schema… § These BI reports and chart don’t tell us anything we didn’t know… § We are missing the ETL window, the data we needed didn’t arrive on time… § We can’t predict with confidence if we can’t explore data & develop our own models 5/21/2016 Me, A Data Scientist?10 |
  • 11. What is big data? 11 Big Data is any thing which is crash Excel. Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM. Or, in other words, Big Data is data in volumes too great to process by traditional methods. https://twitter.com/devops_borat
  • 12. What is Big Data? Force of Change § Big Data forces you to change the way you collect, store, manage, analyze and visualize data. 5/21/2016 Me, A Data Scientist?12 |
  • 13. Big Data = “Crude Oil” [not useful oil] § Think data as ‘Crude Oil’ § Big data is about extracting the ‘Crude Oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’and storing it in massive ‘silos’… § All ‘this’ is about IT Big Data… fine and well… § BUT……….. 5/21/2016 Me, A Data Scientist?13 |
  • 14. You need to refine the ‘Crude Oil’ Enter Data Science 5/21/2016 Me, A Data Scientist?14 |
  • 15. The Science [and Art] of… § Discovering what we don’t know from data § Obtaining predictive, actionable insight from data § Creating Data Products that have business impact now § Communicating relevant business stories from data § Building confidence in decisions that drive business value 5/21/2016 Me, A Data Scientist?15 |
  • 16. What is a data scientist? 5/21/2016 Me, A Data Scientist?16 |
  • 17. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiment Is good at Coding & Hacking Able to deal IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge } 5/21/2016 Me, A Data Scientist?17 |
  • 18. What does a Data Scientist Do? 5/21/2016 Me, A Data Scientist?18 |
  • 19. 10 Things [most] Data Scientists Do § Ask Good Questions, What is What § …we don’t know? § …we’d like to know? § Define and Test an Hypothesis, Run experiments § Scoop, Scrap, Sink & Sample Business Relevant Data § Purge and Wrestle Data, Tame Data § Explore Data, Discover Data Playfully. Discover Unknowns. § Model Data. Model Algorithms § Understand Data Relationships § Tell the Machine How to Learn from Data § Create Data Products that DeliverActionable insight § Tell Relevant Business Stories from Data 5/21/2016 Me, A Data Scientist?19 |
  • 20. [Sort of a] Data Scientist Toolkit § Java, R, Phyton… (bonus: Clojure, Haskell, Scala) § Hadoop, HDFS & MapReduce… (bonus: Spark, Storm) § Hbase, Pig & Hive… (bonus: Shark, Impala, Cascalog) § ETL, Webscrapers, Flume, Sqoop… (bonus: Hume) § SQL, RDBMS, DW, OLAP… § Knime, Weka, RapidMiner… (bonus: SciPy, NumPy, scikit- learn, pandas) § D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… § SPSS, Matlab, SAS… (the Enterprise man) § NoSQL, MongoDB, Couchbase, Cassandra… § And Yes!!! … MS-Excel: the most used, most underrated DS tool… 5/21/2016 Me, A Data Scientist?20 |
  • 21. Types of algorithms 21 § Clustering § Association learning § Parameter estimation § Recommendation engines § Classification § Similarity matching § Neural networks § Bayesian networks § Genetic algorithms
  • 22. Basically, it’s all maths... 22 § Linear algebra § Calculus § Probability theory § Graph theory § ... 22 https://twitter.com/devops_borat Only 10% in devopsknow how to work with Big Data. Only 1% are realize they need 2 Big Data for fault tolerance
  • 23. Big data skills gap § Hardly anyone knows this stuff § It’s a big field, with lots and lots of theory § And it’s all maths, so it’s tricky to learn 23 http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap http://www.ibmbigdatahub.com/blog/addressing-big-data-skills- gap
  • 24. Two orthogonal aspects 24 § Analytics / machine learning § learning insights from data § Big data § handling massive data volumes § Can be combined, or used separately
  • 26. How to process Big Data? 26 § If relational databases are not enough, what is? https://twitter.com/devops_borat Mining ofBig Data is problem solved in 2013 with zgrep
  • 27. MapReduce 27 § A framework for writing massively parallel code § Simple, straightforward model § Based on “map” and “reduce” functions from functional programming (LISP)
  • 28. NoSQL and Big Data 28 § Not really that relevant § Traditional databases handle big data sets, too § NoSQL databases have poor analytics § MapReduce often works from text files § can obviously work from SQL and NoSQL, too § NoSQL is more for high throughput § basically, AP from the CAP theorem, instead of CP § In practice, really Big Data is likely to be a mix § text files, NoSQL, and SQL
  • 29. The 4th V: Veracity 29 “The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” Daniel Borstin, in The Discoverers (1983) https://twitter.com/devops_borat 95% of time, when is clean Big Data is get Little Data
  • 30. Data quality § A huge problem in practice § any manually entered data is suspect § most data sets are in practice deeply problematic § Even automatically gathered data can be a problem § systematic problems with sensors § errors causing data loss § incorrect metadata about the sensor § Never, never, never trust the data without checking it! § garbage in, garbage out, etc 30
  • 32. Conclusion § Vast potential § to both big data and machine learning § Very difficult to realize that potential § requires mathematics, which nobody knows § We need to wake up! 32
  • 34. Two kinds of learning 34 § Supervised § we have training data with correct answers § use training data to prepare the algorithm § then apply it to data without a correct answer § Unsupervised § no training data § throw data into the algorithm, hope it makes some kind of sense out of the data
  • 35. Some types of algorithms § Prediction § predicting a variable from data § Classification § assigning records to predefined groups § Clustering § splitting records into groups based on similarity § Association learning § seeing what often appears together with what 35
  • 36. Issues § Data is usually noisy in some way § imprecise input values § hidden/latent input values § Inductive bias § basically, the shape of the algorithm we choose § may not fit the data at all § may induce underfitting or overfitting § Machine learning without inductive bias is not possible 36
  • 37. Testing 37 § When doing this for real, testing is crucial § Testing means splitting your data set § training data (used as input to algorithm) § test data (used for evaluation only) § Need to compute some measure of performance § precision/recall § root mean square error § A huge field of theory here § will not go into it in this course § very important in practice
  • 38. Missing values 38 § Usually, there are missing values in the data set § that is, some records have some NULL values § These cause problems for many machine learning algorithms § Need to solve somehow § remove all records with NULLs § use a default value § estimate a replacement value § ...
  • 39. Terminology 39 § Vector § one-dimensional array § Matrix § two-dimensional array § Linear algebra § algebra with vectors and matrices § addition, multiplication, transposition, ...
  • 41. Top 10 machine learning algs 1. C4.5 No 2. k-means clustering Yes 3. Support vector machines No 4. the Apriori algorithm No 5. the EM algorithm No 6. PageRank No 7. AdaBoost No 8. k-nearest neighbours class. Kind of 9. Naïve Bayes Yes 10.CART No 41 From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006.“Top 10 algorithms in data mining”,by X. Wu et al
  • 42. C4.5 42 § Algorithm for building decision trees § basically trees of boolean expressions § each node split the data set in two § leaves assign items to classes § Decision trees are useful not just for classification § they can also teach you something about the classes § C4.5 is a bit involved to learn § the ID3 algorithm is much simpler § CART (#10) is another algorithm for learning decision trees
  • 43. Support Vector Machines 43 § A way to do binary classification on matrices § Support vectors are the data points nearest to the hyperplane that divides the classes § SVMs maximize the distance between SVs and the boundary § Particularly valuable because of “the kernel trick” § using a transformation to a higher dimension to handle more complex class boundaries § A bit of work to learn, but manageable
  • 44. Apriori 44 § An algorithm for “frequent itemsets” § basically, working out which items frequently appear together § for example, what goods are often bought together in the supermarket? § used for Amazon’s “customers who bought this...” § Can also be used to find association rules § that is, “people who buy X often buy Y” or similar § Apriori is slow § a faster, further development is FP-growth http://www.dssresources.com/newsletters/66.php
  • 45. Expectation Maximization 45 § A deeply interesting algorithm I’ve seen used in a number of contexts § very hard to understand what it does § very heavy on the maths § Essentially an iterative algorithm § skips between “expectation” step and “maximization” step § tries to optimize the output of a function § Can be used for § clustering § a number of more specialized examples, too
  • 46. PageRank 46 § Basically a graph analysis algorithm § identifies the most prominent nodes § used for weighting search results on Google § Can be applied to any graph § for example an RDF data set § Basically works by simulating random walk § estimating the likelihood that a walker would be on a given node at a given time § actual implementation is linear algebra § The basic algorithm has some issues § “spider traps” § graph must be connected § straightforward solutions to these exist
  • 47. AdaBoost 47 § Algorithm for “ensemble learning” § That is, for combining several algorithms § and training them on the same data § Combining more algorithms can be very effective § usually better than a single algorithm § AdaBoost basically weights training samples § giving the most weight to those which are classified the worst
  • 49. Collaborative filtering § Basically, you’ve got some set of items § these can be movies, books, beers, whatever § You’ve also got ratings from users § on a scale of 1-5, 1-10, whatever § Can you use this to recommend items to a user, based on their ratings? § if you use the connection between their ratings and other people’s ratings, it’s called collaborative filtering § other approaches are possible 49
  • 50. Feature-based recommendation 50 § Use user’s ratings of items § run an algorithm to learn what features of items the user likes § Can be difficult to apply because § requires detailed information about items § key features may not be present in data § Recommending music may be difficult, for example
  • 52. Bayes’s Theorem 52 § Basically a theorem for combining probabilities § I’ve observed A, which indicates H is true with probability 70% § I’ve also observed B, which indicates H is true with probability 85% § what should I conclude? § Naïve Bayes is basically using this theorem § with the assumption that A and B are indepedent § this assumption is nearly always false, hence “naïve”
  • 53. Simple example 53 § Is the coin fair or not? § we throw it 10 times, get 9 heads and one tail § we try again, get 8 heads and two tails § What do we know now? § can combine data and recompute § or just use Bayes’s Theorem directly http://www.bbc.co.uk/news/magazine-22310186
  • 55. University pre-lecture, 1991 55 § My first meeting with university was Open University Day, in 1991 § Professor Bjørn Kirkerud gave the computer science talk § His subject § some day processors will stop becoming faster § we’re already building machines with many processors § what we need is a way to parallelize software § preferably automatically, by feeding in normal source code and getting it parallelized back § MapReduce is basically the state of the art on that today
  • 56. MapReduce 56 § A framework for writing massively parallel code § Simple, straightforward model § Based on “map” and “reduce” functions from functional programming (LISP)
  • 57. 57 http://research.google.com/archive/mapreduce.html Appeared in: OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
  • 58. map and reduce 58 >>> "1 2 3 4 5 6 7 8".split() ['1', '2', '3', '4', '5', '6', '7', '8'] >>> l = map(int, "1 2 3 4 5 6 7 8".split()) >>> l [1, 2, 3, 4, 5, 6, 7, 8] >>> import operator >>> reduce(operator.add, l) 36
  • 59. MapReduce 59 1. Split data into fragments 2. Create a Map task for each fragment § the task outputs a set of (key, value) pairs 3. Group the pairs by key 4. Call Reduce once for each key § all pairs with same key passed in together § reduce outputs new (key, value) pairs
  • 60. Communications 60 § HDFS § Hadoop Distributed File System § input data, temporary results, and results are stored as files here § Hadoop takes care of making files available to nodes § Hadoop RPC § how Hadoop communicates between nodes § used for scheduling tasks, heartbeat etc § Most of this is in practice hidden from the developer
  • 61. The Hadoop ecosystem 61 § Pig § dataflow language for setting up MR jobs § HBase § NoSQL database to store MR input in § Hive § SQL-like query language on top of Hadoop § Mahout § machine learning library on top of Hadoop § Hadoop Streaming § utility for writing mappers and reducers as command-line tools in other languages
  • 62. Applications of MapReduce 62 § Linear algebra operations § easily mapreducible § SQL queries over heterogeneous data § basically requires only a mapping to tables § relational algebra easy to do in MapReduce § PageRank § basically one big set of matrix multiplications § the original application of MapReduce § Recommendation engines § the SON algorithm § ...
  • 63. Apache Mahout 63 § Has three main application areas § others are welcome, but this is mainly what’s there now § Recommendation engines § several different similarity measures § collaborative filtering § Slope-one algorithm § Clustering § k-means and fuzzy k-means § Latent Dirichlet Allocation § Classification § stochastic gradient descent § Support Vector Machines § Naïve Bayes
  • 64. Lots of SQL-on-MapReduce tools 64 § Tenzing Google § Hive Apache Hadoop § YSmart Ohio State § SQL-MR AsterData § HadoopDB Hadapt § Polybase Microsoft § RainStor RainStor Inc. § ParAccel ParAccel Inc. § Impala Cloudera § ...
  • 66. Big data & machine learning 66 § This is a huge field, growing very fast § Many algorithms and techniques § can be seen as a giant toolbox with wide-ranging applications § Ranging from the very simple to the extremely sophisticated § Difficult to see the big picture § Huge range of applications § Math skills are crucial
  • 67. Take a look around Data Scientists’ Tools Using SQL Server!!! 5/21/2016 Me, A Data Scientist?67 |