Hadoop for (Young) Data Scientist
Komes Chandavimol and Team
Data Science Lab, Thailand
komes@datascienceth.com
Agenda
• Big Data, Analytics and Data Science
• Hadoop + Sparks Workshops
• Sharing Experience: Hadoop (Real) Use Cases
• Hadoop + Spark Trends,
3
Big Data, Analytics and Data Science
Big Data
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
Internet of Things
http://topmanagement.com.mx/innovacion-social-y-empresarial-objetivo-de-hitachi/
6
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
The Growth of Data
7
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
https://www.domo.com/learn/data-never-sleeps-3-0
What is Big Data?
8
http://blogs.forrester.com/category/hadoop
http://solutions.forrester.com/Global/FileLib/webinars/Big_Data_-_Gold_Rush_or_Illusion.pdf
The Big Data Tools
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
11
http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
Traditional Data Management Architecture
12
http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
New Data Management Architecture
13
http://www.kdnuggets.com/2014/05/big-data-landscape-v30-analyzed.html
14
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Data Lake
How the Data Lake works?
15
http://www.clearpeaks.com/blog/category/tableau
Traditional Enterprise Data warehouse
16
What you consume from Data Lake?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
17
Volume? Variety? Velocity?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
18
Value
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
19
Big Data + Analytics = Values
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Big Data Analytics
20http://hortonworks.com/blog/big-data-refinery-fuels-next-generation-data-architecture/
Big Data Analytics
21
http://dataofthings.blogspot.com/2014/04/the-bbbt-sessions-hortonworks-big-data.html
Big Data Analytics
22http://www.gartner.com/it-glossary/predictive-analytics
23
How to do Big Data Analytics?
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now
Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
The Rise of Data Scientist
27
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/
2009
https://hbr.org/
28http://hrb.org
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
2014
The Rise of Data Scientist
Data Science Experience Sharing, Big Data Challenge #2,Bangkok Thailand
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
2014
The Data Science
30
The Solution, Data Science Team
31
Data Science Team
Doing Data Science by O'Neil et al (2013)
32
Doing Data Science by O'Neil et al (2013)
33
Doing Data Science by O'Neil et al (2013)
Data Science Team
Analyzing the Analyzers, Harris (2013)
34
Data Science Team
Data Scientist & Data Engineer
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html
35
Data Science Team
Data Scientist & Data Engineer
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html
https://www.facebook.com/DataScienceTh/posts/931828353527079:0
36
Data Science Professionals
http://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html
37
Data Science for Dummies Pierson
(2015)
∗Build In-house Team
• Train existing employee
• Train existing employee and hire experts
• Hire experts
∗Outsourcing requirements to private DS consultants
• Outsourcing for comprehensive DS Strategy development
• Outsource for DS Solutions to specific problem
∗Leverage Cloud-based platform solutions
How to build DS Team?
Machine Learning
Improving Performance in some Task with Experience”. Tom Mitchell
Tom Mitchell (1998)
The field of study that gives computers the ability to learn
without being explicitly programmed. Arthur Samuel (1990)
Wikipedia, Data Visualization for Dummies (2014)
Data Points: Visualization That Means Something(2013)
38
Machine Learning deals with systems
that can learn from data.
39
Machine Learning  Discovery
• Class Discovery
• Correlation Discovery
• Novelty (Surprise) Discovery
• Association (or Link Discovery)
40
KirkBorne-workshop-ODSC2016.pdf
The XYZ of Data Science
Smart X :
• Smart Cities
• Smart Highways
• Smart Supply Chain
Precision Y :
• Precision Medicine
• Precision Farming
• Precision Pricing
Personalized Z :
• Personalized Health
• Personalized Learning
• Personalized Shopping Experience
41
KirkBorne-Workshop-ODSC2016.pdf
Intelligence at the edge of the network… at the point of data collection
42DataInquest – Predictive Analytics and Data Science Bootcamp
Data Science is a Team Sport
http://www.ibmbigdatahub.com/blog/why-data-science-team-sport
44
How to Start?
45
Hadoop + Spark Workshops
49
Workshop #1 การติดตั้ง HDFS และ YARN
51
Workshop #2 WordCount
53
Workshop #3 WordCount (Streaming)
54
Workshop #4 WordCount
(Frequency Sort)
56
Workshop #5 Setup Cloudera
QuickStart
58
Workshop #6
Exploring HBASE data in HUE
59
Workshop #7
Design a Schema for quick twitter
relationship lookup
60
Workshop #8
Design a schema for IoT log
(Smart Meter)
61
Workshop #9
Create an HBase table for
Smart meter data
62
Workshop #10
Bank Customer Snapshot
65
Workshop #10.1 -
10.1 Create Hive Tables
10.2 Create External Hive Tables
10.3 Create External Hive Tables
10.4 Partition
67
Workshop #11
SQOOP
73
Workshop
spk1 WordCount
spk2 WordCount
spk3 WordCount
76
Workshop
spk4 SparkSQL + ML
84
Sharing Experience:
Source: Analytics: The New Path to Value, a joint MIT Sloan Management Review and IBM Institute for Business Value study.
Copyright © Massachusetts Institute of Technology 2010.
Top Performers Use Analytics 5
Times More Than Lower
Performers
Revenue - Cost = Profit
Monitoring and Maintenance
Data sources: IoT Sensors in factory
Data products: predictive maintenance models
http://www.electrex.it/en/news/600-automated-energy-management-system-a-enms-for-cement-production-plants.ht
Customer Engagement + Location
Data sources: Mobile App, Loyalty Program, GIS
Data products: Buying behavior analysis, coupon-response model , location visualization
http://www.fastcompany.com/3020859/most-creative-people/how-chinas-one-child-policy-forced-starbucks-to-rethink-its-beijing-sto
Fuel Saving
Data sources: Telematics (sensor), GPS
Data products: Prescriptive analytics – route
optimization, predictive maintenance
(parts/malfunction)
http://www.cnet.com/news/ups-turns-data-analysis-into-big-savings/
Fraud Detection
Data sources: historical pattern of transaction data
Data products: predictive models – fraud/non-fraudhttps://bluefishway.com/2013/09/13/panic-oh-no-not-again/
HR Analytics – Google Hiring
Data sources: Historical hiring attributes
Data products: Predictive model – recruiting
high performer
Behavioral Test
Situational Test
GPA
Brain Teaser
Good School
Average ROI of Analytics/Data Science
93
Hadoop + Spark Trends
Using hadoop for big data

Using hadoop for big data