SlideShare a Scribd company logo
jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Use Cases
LG
Woo-Myon-Dong, Korea
Sept 12th 2013
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Use Cases
 Training in Big Data
 Big Data Supporters
 Hadoop 2.0
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그 Ecosystems 교육
– 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을
빠르게 데이타 검색하는 시스템 R&D
• Hadoop, Solr, Java, Cloudera 이용
2013년 9월 중순: 삼성 종합 기술원
– 3일간 Hadoop 및 그 Ecosystems 교육 예정
– Introducing Cloudera material to Samsung, Korea
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop
Streaming, RHadoop
 https://github.com/dalgual
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding Hadoop and NoSQL
 “Scalable, Incremental Learning with MapReduce
Parallelization for Cell Detection in High-Resolution 3D
Microscopy Data”. Chul Sung, Jongwook Woo, Matthew
Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings
of the International Joint Conference on Neural Networks, 2013
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
2012, Las Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA
2011, Las Vegas (July 18-21, 2011)
 Collaboration with Universities and companies
 USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ
 Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
High Performance Information Computing Center
Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Spark and Shark
Flume
Use Cases experienced
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
How to compute Big Data
Parallel Computing with multiple non-
expensive computers
–Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
High Performance Information Computing Center
Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagemre/a-talk-
on-apache-giraph
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark and Shark
High Speed In-Memory Analytics over
Hadoop and Hive data
http://www.slideshare.net/Hadoop_Summit/s
park-and-shark
 Fast Data Sharing
–Iterative Graph Algorithms
–Interactive Query
High Performance Information Computing Center
Jongwook Woo
CSULA
Flume
Flume
 Real-time data migration to Hadoop
 Cloudera material
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis at IglooSecurity Inc
 Log files from IPS and IDS
–1.5GB per day for each systems
 Extracting unusual cases using
Hadoop, Solr, Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image
Processing with Texas A&M
Hadoop Streaming API
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyundai Motors
High Performance Information Computing Center
Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, Bus,
Rental Car
– Every 5 mins. Traffic data from 50,000 cars
 Tell the quickest directions to the
destination
High Performance Information Computing Center
Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 Find out the most frequent traffics
–Build Bus lines in the night
High Performance Information Computing Center
Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Collect customers behavior, which occurred using
the cards at the restaurants
Based on Logic: Frequency to visit the same
restaurants in 3 months
Show the popular restaurants
Credit Cards for Gas Station discount
Using a card at a gas station that does not provide
discounts
Sell a new card that gives a discount at any station
High Performance Information Computing Center
Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Collect drivers’ behavior and the status of the cars
Collect any errors in the car
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL
Netflix
High Performance Information Computing Center
Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect people’s data of Credit card usages, Car
models, Newspapers to read, Facebook, Twitter
For example, pro-environmental Campaign for
– Mom
• who sends the kids to the public school,
• who twits about Organic foods,
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comment Moderation
–Evaluate All New HuffPost User Comments
Every Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every
Day
Article Classification
–Tag Articles for Advertising
• E.g.: scary, salacious, …
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– Mahout, a parallel machine learning tool, is
already available.
– There are Mallet, libsvm, Weka, … that support
necessary algorithms.
Bad news:
– Mahout doesn’t support necessary algorithms
yet.
– Other algorithms do not run natively on Hadoop.
build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.
High Performance Information Computing Center
Jongwook Woo
CSULA
Netflix
Biggest Video Streaming company
Dominate Movie Video industry
Using Amazon AWS
Customer Behavior Analysis
Recommendation Systems
Event to find out the fastest customer recommendation
MR algorithm
High Performance Information Computing Center
Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google
Find out influenza much earlier
– by analyzing the area under influenza
Translator
– by analyzing the data from many people
Siri of Apple
Natural Language Processing from many data of
people
High Performance Information Computing Center
Jongwook Woo
CSULA
Training Hadoop and Ecosystems
Self-study
Are you sure if you know the detail?
– Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of
Reducers, …
Training program
Cloudera, Hortonworks
– $2,500, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis, Data
Mining etc
Educational Partnership with Cloudera
– Training ppl at Samsung using Cloudera’s material
Educational Partnership with Hortonworks
– Invited to train ppl at Big Data center of Gyung-gi province
using Hortonworks’ material
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other
commercial platforms
Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific
application master
[http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex]
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craiglist
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business
– Focus on your business not IT management
Pay as you go
Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3
HiPIC received research and teaching
grants from AWS
High Performance Information Computing Center
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– Message Services
– ETL
 HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce
High Performance Information Computing Center
Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second
Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable
Solution
Clustered HBase
High Performance Information Computing Center
Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL
 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase
High Performance Information Computing Center
Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time
High Performance Information Computing Center
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers
Migrate to MongoDB
Scalable, Fast, Proven, Friendly
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar 
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py 
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py 
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute Big Data
 Many solutions but Hadoop
 Storage: NoSQL DB
 Computation: Hadoop MapRedude
 Need to analyze Big Data in mobile computing, SNS
for Ad, User Behavior, Patterns …
 Emerging Technology
 Hadoop 2.0
 Training is important
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?

More Related Content

What's hot (20)

PPTX
Rating Prediction using Deep Learning and Spark
Jongwook Woo
 
PPTX
Revenue Earned From Students in USA
ApekshitBhingardive
 
PPTX
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 
PDF
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
PPTX
Introduction to Big Data and its Trends
Jongwook Woo
 
PPTX
On Big Data
arttan2001
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PDF
Big data primer
Stacia Misner
 
PDF
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Gigaom
 
PPT
Big Data = Big Decisions
InnoTech
 
PDF
Database revolution opening webcast 01 18-12
mark madsen
 
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
PPTX
Self Guiding User Experience
Sri Ambati
 
PPT
1630 mon lomond ashley
UKSG: connecting the knowledge community
 
PDF
Digital Measurement - How to Turn Data into Actionable Insights
Datalicious
 
PPTX
Introduction of Data Science
Jason Geng
 
PDF
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Edward Curry
 
PDF
Machine Learning on Big Data with HADOOP
EPAM Systems
 
PDF
Hadoop is Happening
Precisely
 
PDF
The State of Artificial Intelligence in 2018: A Good Old Fashioned Report
Nathan Benaich
 
Rating Prediction using Deep Learning and Spark
Jongwook Woo
 
Revenue Earned From Students in USA
ApekshitBhingardive
 
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Introduction to Big Data and its Trends
Jongwook Woo
 
On Big Data
arttan2001
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Big data primer
Stacia Misner
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Gigaom
 
Big Data = Big Decisions
InnoTech
 
Database revolution opening webcast 01 18-12
mark madsen
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Paco Nathan
 
Self Guiding User Experience
Sri Ambati
 
Digital Measurement - How to Turn Data into Actionable Insights
Datalicious
 
Introduction of Data Science
Jason Geng
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Edward Curry
 
Machine Learning on Big Data with HADOOP
EPAM Systems
 
Hadoop is Happening
Precisely
 
The State of Artificial Intelligence in 2018: A Good Old Fashioned Report
Nathan Benaich
 

Similar to Big Data and Data Intensive Computing: Use Cases (20)

PPTX
Big Data and Data Intensive Computing on Networks
Jongwook Woo
 
PPTX
Big Data and Advanced Data Intensive Computing
Jongwook Woo
 
PPTX
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Jongwook Woo
 
PPTX
Big Data Platform adopting Spark and Use Cases with Open Data
Jongwook Woo
 
PPTX
Introduction To Big Data and Use Cases using Hadoop
Jongwook Woo
 
PPTX
Introduction To Big Data and Use Cases on Hadoop
Jongwook Woo
 
PPTX
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 
PPTX
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
PPTX
Spark ukc2015v1.1
Nillohit Bhattacharya
 
PDF
Spark tutorial @ KCC 2015
Jongwook Woo
 
PPTX
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
PPTX
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Jongwook Woo
 
PPTX
hydrogenbigdataanalysis
Manvi Chandra
 
PPTX
Big Data Trend and Open Data
Jongwook Woo
 
PPTX
Big Data Trend with Open Platform
Jongwook Woo
 
PPTX
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Jongwook Woo
 
PPTX
Big Data Concepts
Ahmed Salman
 
PPTX
Data analysis using hive ql &amp; tableau
pkale1708
 
PPT
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
PPTX
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
Big Data and Data Intensive Computing on Networks
Jongwook Woo
 
Big Data and Advanced Data Intensive Computing
Jongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Jongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Jongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Jongwook Woo
 
Introduction To Big Data and Use Cases on Hadoop
Jongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Jongwook Woo
 
Spark ukc2015v1.1
Nillohit Bhattacharya
 
Spark tutorial @ KCC 2015
Jongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Jongwook Woo
 
hydrogenbigdataanalysis
Manvi Chandra
 
Big Data Trend and Open Data
Jongwook Woo
 
Big Data Trend with Open Platform
Jongwook Woo
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Jongwook Woo
 
Big Data Concepts
Ahmed Salman
 
Data analysis using hive ql &amp; tableau
pkale1708
 
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
History and Trend of Big Data and Deep Learning
Jongwook Woo
 
Ad

More from Jongwook Woo (10)

PPTX
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
PDF
How To Use Artificial Intelligence (AI) in History
Jongwook Woo
 
PPTX
Machine Learning in Quantum Computing
Jongwook Woo
 
PPTX
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Jongwook Woo
 
PPTX
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
PPTX
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
PDF
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Jongwook Woo
 
PDF
President Election of Korea in 2017
Jongwook Woo
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
PPTX
Introduction to Hadoop, Big Data, Training, Use Cases
Jongwook Woo
 
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
How To Use Artificial Intelligence (AI) in History
Jongwook Woo
 
Machine Learning in Quantum Computing
Jongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Jongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Jongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Jongwook Woo
 
President Election of Korea in 2017
Jongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Jongwook Woo
 
Ad

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 

Big Data and Data Intensive Computing: Use Cases

  • 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Use Cases LG Woo-Myon-Dong, Korea Sept 12th 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases  Training in Big Data  Big Data Supporters  Hadoop 2.0
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Introducing Cloudera material to Samsung, Korea
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Yonsei, Sookmyung, KAIST, Korean Polytech Univ  Cloudera, Hortonworks, VanillaBreeze, IglooSecurity,
  • 8. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 9. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Emerging Big Data Technology Giraph Spark and Shark Flume Use Cases experienced
  • 11. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Giraph BSP Facebook http://www.slideshare.net/aladagemre/a-talk- on-apache-giraph
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Spark and Shark High Speed In-Memory Analytics over Hadoop and Hive data http://www.slideshare.net/Hadoop_Summit/s park-and-shark  Fast Data Sharing –Iterative Graph Algorithms –Interactive Query
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Flume Flume  Real-time data migration to Hadoop  Cloudera material
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis at IglooSecurity Inc  Log files from IPS and IDS –1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  • 20. High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL Netflix
  • 25. High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  • 26. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  • 27. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Netflix Biggest Video Streaming company Dominate Movie Video industry Using Amazon AWS Customer Behavior Analysis Recommendation Systems Event to find out the fastest customer recommendation MR algorithm
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Training Hadoop and Ecosystems Self-study Are you sure if you know the detail? – Sqoop, Hive, Pig, Combiner, Partitioner, Setting # of Reducers, … Training program Cloudera, Hortonworks – $2,500, Hands-on Exercises – About Hadoop, Hbase, Hive/Pig, Data Analysis, Data Mining etc Educational Partnership with Cloudera – Training ppl at Samsung using Cloudera’s material Educational Partnership with Hortonworks – Invited to train ppl at Big Data center of Gyung-gi province using Hortonworks’ material
  • 31. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  • 35. High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  • 36. High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  • 40. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • 41. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …  Emerging Technology  Hadoop 2.0  Training is important
  • 43. High Performance Information Computing Center Jongwook Woo CSULA Question?