Big Data and Data Intensive Computing: Use Cases

jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Use Cases
LG
Woo-Myon-Dong, Korea
Sept 12th 2013
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Use Cases
 Training in Big Data
 Big Data Supporters
 Hadoop 2.0

Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심

Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Introducing Cloudera material to Samsung, Korea

Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership

Jongwook Woo
CSULA
 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop
Streaming, RHadoop
 https://github.com/dalgual

Jongwook Woo
CSULA
 Several publications regarding Hadoop and NoSQL
 “Scalable, Incremental Learning with MapReduce
Parallelization for Cell Detection in High-Resolution 3D
Microscopy Data”. Chul Sung, Jongwook Woo, Matthew
Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings
of the International Joint Conference on Neural Networks, 2013
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA
2011, Las Vegas (July 18-21, 2011)
 Collaboration with Universities and companies
 USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ
 Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,

Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing

Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”

Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Spark and Shark
Flume
Use Cases experienced

Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers

Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph

Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagemre/a-talk-
on-apache-giraph

Jongwook Woo
CSULA
Spark and Shark
High Speed In-Memory Analytics over
Hadoop and Hive data
http://www.slideshare.net/Hadoop_Summit/s
park-and-shark
 Fast Data Sharing
–Iterative Graph Algorithms
–Interactive Query

Jongwook Woo
CSULA
Flume
Flume
 Real-time data migration to Hadoop
 Cloudera material

Jongwook Woo
CSULA
Use Cases experienced
Log Analysis at IglooSecurity Inc
 Log files from IPS and IDS
–1.5GB per day for each systems
 Extracting unusual cases using
Hadoop, Solr, Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image
Processing with Texas A&M
Hadoop Streaming API

Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyundai Motors

Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, Bus,
Rental Car
– Every 5 mins. Traffic data from 50,000 cars
 Tell the quickest directions to the
destination

Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 Find out the most frequent traffics
–Build Bus lines in the night

Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Collect customers behavior, which occurred using
the cards at the restaurants
Based on Logic: Frequency to visit the same
restaurants in 3 months
Show the popular restaurants
Credit Cards for Gas Station discount
Using a card at a gas station that does not provide
discounts
Sell a new card that gives a discount at any station

Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Collect drivers’ behavior and the status of the cars
Collect any errors in the car

Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL
Netflix

Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect people’s data of Credit card usages, Car
models, Newspapers to read, Facebook, Twitter
For example, pro-environmental Campaign for
– Mom
• who sends the kids to the public school,
• who twits about Organic foods,

Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …

Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.

Jongwook Woo
CSULA
Netflix
Biggest Video Streaming company
Dominate Movie Video industry
Using Amazon AWS
Customer Behavior Analysis
Recommendation Systems
Event to find out the fastest customer recommendation
MR algorithm

Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google
Find out influenza much earlier
– by analyzing the area under influenza
Translator
– by analyzing the data from many people
Siri of Apple
Natural Language Processing from many data of
people

Jongwook Woo
CSULA
Training Hadoop and Ecosystems
Self-study
Are you sure if you know the detail?
– Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of
Reducers, …
Training program
Cloudera, Hortonworks
– $2,500, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis, Data
Mining etc
Educational Partnership with Cloudera
– Training ppl at Samsung using Cloudera’s material
Educational Partnership with Hortonworks
– Invited to train ppl at Big Data center of Gyung-gi province
using Hortonworks’ material

Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]

Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craiglist

Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB

Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS

Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– Message Services
– ETL
 HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce

Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase

Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase

Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time

Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly

Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT

Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output

Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute Big Data
 Many solutions but Hadoop
 Storage: NoSQL DB
 Computation: Hadoop MapRedude
 Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
 Emerging Technology
 Hadoop 2.0
 Training is important

Jongwook Woo
CSULA
Question?

Big Data and Data Intensive Computing: Use Cases

More Related Content

What's hot (20)

Similar to Big Data and Data Intensive Computing: Use Cases (20)

More from Jongwook Woo (10)

Recently uploaded (20)

Big Data and Data Intensive Computing: Use Cases