Sql saturday el salvador 2016 - Me, A Data Scientist?

Me, A Data Scientist?
Fabricio Quintanilla, MSc, PhD
fabricio.quintanilla@gmail.com
@fabrixq
/fquintanilla
http://www.inteligenciadenegocios.net
MCP, MCPD, MCTS

Organiza
5/21/2016 Me, A Data Scientist?2 |

Patrocinadores del SQL Saturday

Agenda
Not Rocket Science….
Just Data Science…

Man on the Moon – 1969

Man on the Moon – Small Data
Computer Program
Date: 1969
64Kb, 2Kb RAM,
Fortran
Must Work 1st time
Apollo XI
Speed: 3,500 Km/h
Weight: 13,500 Kg
Lots of complex data
Man on the Moon
Distance: 356,500 Km
Never been there before
Must return to Earth

Skydive Stratos, 2012
Tens of Gigabytes!!!
Think about it ... We live in crazy times…

What is Big Data? mumbo-jumbo
§ A fashionable term typically used by some IT
vendors to remarket old fashioned software
and hardware

Big Data is not about Data Volume

No way!!!! Water Coller Chat
§ We need to parallelize data operations but it’s too costly & complex…
§ The business can’t get access to all the relevant data, we need external data
§ We can’t match customer master data to live customer interactions…
§ We can’t just force everything into a star-schema…
§ These BI reports and chart don’t tell us anything we didn’t know…
§ We are missing the ETL window, the data we needed didn’t arrive on time…
§ We can’t predict with confidence if we can’t explore data & develop our own
models

What is big data?
11
Big Data is
any thing
which is
crash Excel.
Small Data is
when is fit in RAM.
Big Data is when is
crash because is
not fit in RAM.
Or, in other words, Big Data is data
in volumes too great to process by
traditional methods.
https://twitter.com/devops_borat

What is Big Data? Force of Change
§ Big Data forces you to change the way you
collect, store, manage, analyze and visualize
data.

Big Data = “Crude Oil” [not useful oil]
§ Think data as ‘Crude Oil’
§ Big data is about extracting the ‘Crude Oil’,
transporting it in ‘mega-tankers’, siphoning it
through ‘pipelines’and storing it in massive
‘silos’…
§ All ‘this’ is about IT Big Data… fine and well…
§ BUT………..

You need to refine the ‘Crude Oil’
Enter Data Science

The Science [and Art] of…
§ Discovering what we don’t know from data
§ Obtaining predictive, actionable insight from data
§ Creating Data Products that have business impact now
§ Communicating relevant business stories from data
§ Building confidence in decisions that drive business value

What is a data scientist?

Class DataScientist {
Is skeptical, curious. Has inquisitive mind
Knows Machine Learning, Statistics, Probability
Applies Scientific Method. Runs Experiment
Is good at Coding & Hacking
Able to deal IT Data Engineering
Knows how to build data products
Able to find answers to known unknowns
Tells relevant business stories from data
Has Domain Knowledge
}

What does a Data Scientist Do?

10 Things [most] Data Scientists Do
§ Ask Good Questions, What is What
§ …we don’t know?
§ …we’d like to know?
§ Define and Test an Hypothesis, Run experiments
§ Scoop, Scrap, Sink & Sample Business Relevant Data
§ Purge and Wrestle Data, Tame Data
§ Explore Data, Discover Data Playfully. Discover
Unknowns.
§ Model Data. Model Algorithms
§ Understand Data Relationships
§ Tell the Machine How to Learn from Data
§ Create Data Products that DeliverActionable insight
§ Tell Relevant Business Stories from Data

[Sort of a] Data Scientist Toolkit
§ Java, R, Phyton… (bonus: Clojure, Haskell, Scala)
§ Hadoop, HDFS & MapReduce… (bonus: Spark, Storm)
§ Hbase, Pig & Hive… (bonus: Shark, Impala, Cascalog)
§ ETL, Webscrapers, Flume, Sqoop… (bonus: Hume)
§ SQL, RDBMS, DW, OLAP…
§ Knime, Weka, RapidMiner… (bonus: SciPy, NumPy, scikit-
learn, pandas)
§ D3.js, Gephi, ggplot2, Tableu, Flare, Shiny…
§ SPSS, Matlab, SAS… (the Enterprise man)
§ NoSQL, MongoDB, Couchbase, Cassandra…
§ And Yes!!! … MS-Excel: the most used, most underrated
DS tool…

Types of algorithms
21
§ Clustering
§ Association learning
§ Parameter estimation
§ Recommendation engines
§ Classification
§ Similarity matching
§ Neural networks
§ Bayesian networks
§ Genetic algorithms

Basically, it’s all maths...
22
§ Linear algebra
§ Calculus
§ Probability theory
§ Graph theory
§ ...
22
Only 10% in
devopsknow
how to work
with Big Data.
Only 1% are
realize they need
2 Big Data for
fault tolerance

Big data skills gap
§ Hardly anyone knows this stuff
§ It’s a big field, with lots and lots of theory
§ And it’s all maths, so it’s tricky to learn
23
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-
gap

Two orthogonal aspects
24
§ Analytics / machine learning
§ learning insights from data
§ Big data
§ handling massive data volumes
§ Can be combined, or used separately

Data science?
25
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

How to process Big Data?
26
§ If relational databases are not enough,
what is?
Mining ofBig
Data is
problem solved
in 2013 with
zgrep

MapReduce
27
§ A framework for writing massively
parallel code
§ Simple, straightforward model
§ Based on “map” and “reduce” functions
from functional programming (LISP)

NoSQL and Big Data
28
§ Not really that relevant
§ Traditional databases handle big data
sets, too
§ NoSQL databases have poor analytics
§ MapReduce often works from text files
§ can obviously work from SQL and NoSQL,
too
§ NoSQL is more for high throughput
§ basically, AP from the CAP theorem, instead
of CP
§ In practice, really Big Data is likely to be a
mix
§ text files, NoSQL, and SQL

The 4th V: Veracity
29
“The greatest enemy of knowledge is not
ignorance, it is the illusion of knowledge.”
Daniel Borstin, in The Discoverers
(1983)
95% of time,
when is clean Big
Data is get Little
Data

Data quality
§ A huge problem in practice
§ any manually entered data is suspect
§ most data sets are in practice deeply
problematic
§ Even automatically gathered data can be
a problem
§ systematic problems with sensors
§ errors causing data loss
§ incorrect metadata about the sensor
§ Never, never, never trust the data without
checking it!
§ garbage in, garbage out, etc
30

31
http://www.slideshare.net/Hadoop_Summit/scaling-big-data-mining-infrastructure-twitter-experience/12

Conclusion
§ Vast potential
§ to both big data and machine learning
§ Very difficult to realize that potential
§ requires mathematics, which nobody
knows
§ We need to wake up!
32

Two kinds of learning
34
§ Supervised
§ we have training data with correct
answers
§ use training data to prepare the algorithm
§ then apply it to data without a correct
answer
§ Unsupervised
§ no training data
§ throw data into the algorithm, hope it
makes some kind of sense out of the data

Some types of algorithms
§ Prediction
§ predicting a variable from data
§ Classification
§ assigning records to predefined groups
§ Clustering
§ splitting records into groups based on similarity
§ Association learning
§ seeing what often appears together with what
35

Issues
§ Data is usually noisy in some way
§ imprecise input values
§ hidden/latent input values
§ Inductive bias
§ basically, the shape of the algorithm we
choose
§ may not fit the data at all
§ may induce underfitting or overfitting
§ Machine learning without inductive bias
is not possible
36

Testing
37
§ When doing this for real, testing is crucial
§ Testing means splitting your data set
§ training data (used as input to algorithm)
§ test data (used for evaluation only)
§ Need to compute some measure of performance
§ precision/recall
§ root mean square error
§ A huge field of theory here
§ will not go into it in this course
§ very important in practice

Missing values
38
§ Usually, there are missing values in the data set
§ that is, some records have some NULL values
§ These cause problems for many machine
learning algorithms
§ Need to solve somehow
§ remove all records with NULLs
§ use a default value
§ estimate a replacement value
§ ...

Terminology
39
§ Vector
§ one-dimensional array
§ Matrix
§ two-dimensional array
§ Linear algebra
§ algebra with vectors and matrices
§ addition, multiplication, transposition, ...

Top 10 machine learning algs
1. C4.5 No
2. k-means clustering Yes
3. Support vector machines No
4. the Apriori algorithm No
5. the EM algorithm No
6. PageRank No
7. AdaBoost No
8. k-nearest neighbours class. Kind of
9. Naïve Bayes Yes
10.CART No
41
From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006.“Top
10 algorithms in data mining”,by X. Wu et al

C4.5
42
§ Algorithm for building decision trees
§ basically trees of boolean expressions
§ each node split the data set in two
§ leaves assign items to classes
§ Decision trees are useful not just for classification
§ they can also teach you something about the classes
§ C4.5 is a bit involved to learn
§ the ID3 algorithm is much simpler
§ CART (#10) is another algorithm for learning
decision trees

Support Vector Machines
43
§ A way to do binary classification on
matrices
§ Support vectors are the data points
nearest to the hyperplane that divides the
classes
§ SVMs maximize the distance between
SVs and the boundary
§ Particularly valuable because of “the
kernel trick”
§ using a transformation to a higher dimension
to handle more complex class boundaries
§ A bit of work to learn, but manageable

Apriori
44
§ An algorithm for “frequent itemsets”
§ basically, working out which items frequently
appear together
§ for example, what goods are often bought
together in the supermarket?
§ used for Amazon’s “customers who bought
this...”
§ Can also be used to find association rules
§ that is, “people who buy X often buy Y” or
similar
§ Apriori is slow
§ a faster, further development is FP-growth
http://www.dssresources.com/newsletters/66.php

Expectation Maximization
45
§ A deeply interesting algorithm I’ve seen
used in a number of contexts
§ very hard to understand what it does
§ very heavy on the maths
§ Essentially an iterative algorithm
§ skips between “expectation” step and
“maximization” step
§ tries to optimize the output of a function
§ Can be used for
§ clustering
§ a number of more specialized examples, too

PageRank
46
§ Basically a graph analysis algorithm
§ identifies the most prominent nodes
§ used for weighting search results on Google
§ Can be applied to any graph
§ for example an RDF data set
§ Basically works by simulating random walk
§ estimating the likelihood that a walker would be
on a given node at a given time
§ actual implementation is linear algebra
§ The basic algorithm has some issues
§ “spider traps”
§ graph must be connected
§ straightforward solutions to these exist

AdaBoost
47
§ Algorithm for “ensemble learning”
§ That is, for combining several algorithms
§ and training them on the same data
§ Combining more algorithms can be very
effective
§ usually better than a single algorithm
§ AdaBoost basically weights training
samples
§ giving the most weight to those which are
classified the worst

Collaborative filtering
§ Basically, you’ve got some set of items
§ these can be movies, books, beers, whatever
§ You’ve also got ratings from users
§ on a scale of 1-5, 1-10, whatever
§ Can you use this to recommend items to a
user, based on their ratings?
§ if you use the connection between their
ratings and other people’s ratings, it’s called
collaborative filtering
§ other approaches are possible
49

Feature-based recommendation
50
§ Use user’s ratings of items
§ run an algorithm to learn what features of
items the user likes
§ Can be difficult to apply because
§ requires detailed information about items
§ key features may not be present in data
§ Recommending music may be difficult,
for example

Bayes’s Theorem
52
§ Basically a theorem for combining
probabilities
§ I’ve observed A, which indicates H is true with
probability 70%
§ I’ve also observed B, which indicates H is
true with probability 85%
§ what should I conclude?
§ Naïve Bayes is basically using this
theorem
§ with the assumption that A and B are
indepedent
§ this assumption is nearly always false, hence
“naïve”

Simple example
53
§ Is the coin fair or not?
§ we throw it 10 times, get 9 heads and one tail
§ we try again, get 8 heads and two tails
§ What do we know now?
§ can combine data and recompute
§ or just use Bayes’s Theorem directly
http://www.bbc.co.uk/news/magazine-22310186

University pre-lecture, 1991
55
§ My first meeting with university was Open
University Day, in 1991
§ Professor Bjørn Kirkerud gave the computer
science talk
§ His subject
§ some day processors will stop becoming faster
§ we’re already building machines with many
processors
§ what we need is a way to parallelize software
§ preferably automatically, by feeding in normal
source code and getting it parallelized back
§ MapReduce is basically the state of the art
on that today

MapReduce
56
§ A framework for writing massively
parallel code
§ Simple, straightforward model
§ Based on “map” and “reduce” functions
from functional programming (LISP)

57
http://research.google.com/archive/mapreduce.html
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design
and Implementation,
San Francisco, CA, December, 2004.

map and reduce
58
>>> "1 2 3 4 5 6 7 8".split()
['1', '2', '3', '4', '5', '6', '7', '8']
>>> l = map(int, "1 2 3 4 5 6 7 8".split())
>>> l
[1, 2, 3, 4, 5, 6, 7, 8]
>>> import operator
>>> reduce(operator.add, l)
36

MapReduce
59
1. Split data into fragments
2. Create a Map task for each fragment
§ the task outputs a set of (key, value) pairs
3. Group the pairs by key
4. Call Reduce once for each key
§ all pairs with same key passed in together
§ reduce outputs new (key, value) pairs

Communications
60
§ HDFS
§ Hadoop Distributed File System
§ input data, temporary results, and results are stored
as files here
§ Hadoop takes care of making files available to nodes
§ Hadoop RPC
§ how Hadoop communicates between nodes
§ used for scheduling tasks, heartbeat etc
§ Most of this is in practice hidden from the
developer

The Hadoop ecosystem
61
§ Pig
§ dataflow language for setting up MR jobs
§ HBase
§ NoSQL database to store MR input in
§ Hive
§ SQL-like query language on top of Hadoop
§ Mahout
§ machine learning library on top of Hadoop
§ Hadoop Streaming
§ utility for writing mappers and reducers as
command-line tools in other languages

Applications of MapReduce
62
§ Linear algebra operations
§ easily mapreducible
§ SQL queries over heterogeneous data
§ basically requires only a mapping to tables
§ relational algebra easy to do in MapReduce
§ PageRank
§ basically one big set of matrix multiplications
§ the original application of MapReduce
§ the SON algorithm
§ ...

Apache Mahout
63
§ Has three main application areas
§ others are welcome, but this is mainly what’s there now
§ several different similarity measures
§ collaborative filtering
§ Slope-one algorithm
§ Clustering
§ k-means and fuzzy k-means
§ Latent Dirichlet Allocation
§ Classification
§ stochastic gradient descent
§ Support Vector Machines
§ Naïve Bayes

Lots of SQL-on-MapReduce tools
64
§ Tenzing Google
§ Hive Apache Hadoop
§ YSmart Ohio State
§ SQL-MR AsterData
§ HadoopDB Hadapt
§ Polybase Microsoft
§ RainStor RainStor Inc.
§ ParAccel ParAccel Inc.
§ Impala Cloudera
§ ...

Big data & machine learning
66
§ This is a huge field, growing very fast
§ Many algorithms and techniques
§ can be seen as a giant toolbox with wide-ranging
applications
§ Ranging from the very simple to the extremely
sophisticated
§ Difficult to see the big picture
§ Huge range of applications
§ Math skills are crucial

Take a look around Data Scientists’ Tools
Using SQL Server!!!

Fabricio
Quintanilla
fabricio.quintanilla@gmail.co
m inteligenciadenegocios.net
@fabrixq
PREGUNTAS Y RESPUESTAS

Sql saturday el salvador 2016 - Me, A Data Scientist?

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Sql saturday el salvador 2016 - Me, A Data Scientist? (20)

Recently uploaded (20)

Sql saturday el salvador 2016 - Me, A Data Scientist?