Big Data Analytics
Introduction
1
Theme of this Course
Large-Scale Data Management
Big Data Analytics
Data Science and Analytics
• How to manage very large amounts of data and extract value and
knowledge from them
2
Introduction to Big Data
What is Big Data?
What makes data, “Big” Data?
3
Big Data Definition
• No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
4
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
5
Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series, social
media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
6
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
7
Big Data: 3V’s
8
Some Make it 4V’s
9
Harnessing Big Data
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
10
Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
11
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
12
What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
13
Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
14
Challenges in Handling Big Data
• The Bottleneck is in technology
• New architecture, algorithms, techniques are needed
• Also in technical skills
• Experts in using the new technology and dealing with big data
15
What Technology Do We Have
For Big Data ??
16
17
Big Data Technology
18
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Data mining and machine learning tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop
19
Course Logistics
20
Course Logistics
• Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/
• Electronic WPI system: blackboard.wpi.edu
• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)
21
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)
• Reading List
• We will cover the state-of-art technology from research papers in big
conferences
• Many Hadoop-related papers are available on the course website
• Related books:
• Hadoop, The Definitive Guide [pdf]
22
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List)
• Hands-on Course
• No written homework or exams
• Several coding projects covering the entire semester
Done in teams
of two
23
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare a
review on the presented paper
• Course website gives guidelines on how to make good reviews
• Reviews are done individually
24
Late Submission Policy
• For Projects
• One-day late  10% off the max grade
• Two-day late  20% off the max grade
• Three-day late  30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after
• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class
25
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages
• Download it from the course website (link)
• Username and password will be sent to you
• Need Virtual Box (Vbox) [free]
26
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team selects
its first paper to present (1st come 1st served)
• Send me your choices top 2/3 choices
3. You have until Jan 20th
• Otherwise, I’ll randomly form teams and assign papers
4. Use Blackboard “Discussion” forum for posts or for
searching for teammates
27
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Analytics and data mining tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop
28
Open Source World’s Solution
 Google File System – Hadoop Distributed FS
 Map-Reduce – Hadoop Map-Reduce
 Sawzall – Pig, Hive, JAQL
 Big Table – Hadoop HBase, Cassandra
 Chubby – Zookeeper
Simplified Search Engine
Architecture
Spider Runtime
Batch Processing System
on top of Hadoop
SE Web Server
Search Log Storage
Internet
Simplified Data Warehouse
Architecture
Database
Batch Processing System
on top fo Hadoop
Web Server
View/Click/Events Log Storage
Business
Intelligence
Domain Knowledge
Hadoop History
 Jan 2006 – Doug Cutting joins Yahoo
 Feb 2006 – Hadoop splits out of Nutch and Yahoo starts
using it.
 Dec 2006 – Yahoo creating 100-node Webmap with
Hadoop
 Apr 2007 – Yahoo on 1000-node cluster
 Jan 2008 – Hadoop made a top-level Apache project
 Dec 2007 – Yahoo creating 1000-node Webmap with
Hadoop
 Sep 2008 – Hive added to Hadoop as a contrib project
Hadoop Introduction
 Open Source Apache Project
 http://hadoop.apache.org/
 Book: http://oreilly.com/catalog/9780596521998/index.html
 Written in Java
 Does work with other languages
 Runs on
 Linux, Windows and more
 Commodity hardware with high failure rate
Current Status of Hadoop
 Largest Cluster
 2000 nodes (8 cores, 4TB disk)
 Used by 40+ companies / universities over
the world
 Yahoo, Facebook, etc
 Cloud Computing Donation from Google and IBM
 Startup focusing on providing services for
hadoop
 Cloudera
Hadoop Components
 Hadoop Distributed File System (HDFS)
 Hadoop Map-Reduce
 Contributes
 Hadoop Streaming
 Pig / JAQL / Hive
 HBase
 Hama / Mahout
Hadoop Distributed File System
Goals of HDFS
 Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Convenient Cluster Management
 Load balancing
 Node failures
 Cluster expansion
 Optimized for Batch Processing
 Allow move computation to data
 Maximize throughput
HDFS Architecture
HDFS Details
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple DataNodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from DataNode
Big_data_1674238705.ppt is a basic background
HDFS User Interface
 Java API
 Command Line
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt
 hadoop dfs -rm /foodir myfile.txt
 hadoop dfsadmin -report
 hadoop dfsadmin -decommission datanodename
 Web Interface
 http://host:port/dfshealth.jsp
More about HDFS
 http://hadoop.apache.org/core/docs/current/hdfs_design.html
 Hadoop FileSystem API
 HDFS
 Local File System
 Kosmos File System (KFS)
 Amazon S3 File System
Hadoop Map-Reduce and
Hadoop Streaming
Hadoop Map-Reduce Introduction
 Map/Reduce works like a parallel Unix pipeline:
 cat input | grep | sort | uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce | Output
 Framework does inter-node communication
 Failure recovery, consistency etc
 Load balancing, scalability etc
 Fits a lot of batch processing applications
 Log processing
 Web index building
Big_data_1674238705.ppt is a basic background
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce
Physical Flow
Example Code
Hadoop Streaming
 Allow to write Map and Reduce functions in any
languages
 Hadoop Map/Reduce only accepts Java
 Example: Word Count
 hadoop streaming
-input /user/zshao/articles
-mapper „tr “ ” “n”‟
-reducer „uniq -c„
-output /user/zshao/
-numReduceTasks 32
Example: Log Processing
 Generate #pageview and #distinct users
for each page each day
 Input: timestamp url userid
 Generate the number of page views
 Map: emit < <date(timestamp), url>, 1>
 Reduce: add up the values for each row
 Generate the number of distinct users
 Map: emit < <date(timestamp), url, userid>, 1>
 Reduce: For the set of rows with the same
<date(timestamp), url>, count the number of distinct users by
“uniq –c"
Example: Page Rank
 In each Map/Reduce Job:
 Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
 Reduce: add all values up for each link, to generate the new
eigenvalue for that link.
 Run 50 map/reduce jobs till the eigenvalues are
stable.
TODO: Split Job Scheduler and Map-
Reduce
 Allow easy plug-in of different scheduling
algorithms
 Scheduling based on job priority, size, etc
 Scheduling for CPU, disk, memory, network bandwidth
 Preemptive scheduling
 Allow to run MPI or other jobs on the same
cluster
 PageRank is best done with MPI
Sender Receiver
TODO: Faster Map-Reduce
Mapper Receiver
sort
Sender
R1
R2
R3
…
R1
R2
R3
…
sort
Reducer
sort
Merge
Reduce
map
map
Mapper calls user functions:
Map and Partition
Sender does flow control
Receiver merge N flows into 1, call
user function Compare to sort, dump
buffer to disk, and do checkpointing
Reducer calls user functions:
Compare and Reduce
MapReduce and Hadoop
Distributed File System
B.Ramamurthy & K.Madurai
54
CCSCNE 2009 Palttsburg, April 24 2009
The Context: Big-data
 Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
 Google collects 270PB data in a month (2007), 20000PB a day (2008)
 2010 census data is expected to be a huge gold mine of information
 Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
 We are in a knowledge economy.
 Data is an important asset to any organization
 Discovery of knowledge; Enabling discovery; annotation of data
 We are looking at newer
 programming models, and
 Supporting algorithms and data structures.
 NSF refers to it as “data-intensive computing” and industry calls it “big-
data” and “cloud computing”
B.Ramamurthy & K.Madurai
55
CCSCNE 2009 Palttsburg, April 24 2009
Purpose of this talk
 To provide a simple introduction to:
 “The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
 To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.
B.Ramamurthy & K.Madurai
56
CCSCNE 2009 Palttsburg, April 24 2009
The Outline
 Introduction to MapReduce
 From CS Foundation to MapReduce
 MapReduce programming model
 Hadoop Distributed File System
 Relevance to Undergraduate Curriculum
 Demo (Internet access needed)
 Our experience with the framework
 Summary
 References
B.Ramamurthy & K.Madurai
57
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
58
What is MapReduce?
 MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.
B.Ramamurthy & K.Madurai
59
CCSCNE 2009 Palttsburg, April 24 2009
From CS Foundations to MapReduce
Consider a large data collection:
{web, weed, green, sun, moon, land, part, web,
green,…}
Problem: Count the occurrences of the different words
in the collection.
Lets design a solution for this problem;
 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability
B.Ramamurthy & K.Madurai
60
CCSCNE 2009 Palttsburg, April 24 2009
Word Counter and Result Table
Data
collection
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
61
ResultTable
Main
DataCollection
WordCounter
parse( )
count( )
{web, weed, green, sun, moon, land, part,
web, green,…}
CCSCNE 2009 Palttsburg, April 24 2009
Multiple Instances of Word Counter
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
B.Ramamurthy & K.Madurai
62
Thread
DataCollection ResultTable
WordCounter
parse( )
count( )
Main
1..*
1..*
Data
collection
Observe:
Multi-thread
Lock on shared data
CCSCNE 2009 Palttsburg, April 24 2009
Improve Word Counter for Performance
B.Ramamurthy & K.Madurai
63
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
N
o
No need for lock
Separate counters
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
64
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Addressing the Scale Issue
B.Ramamurthy & K.Madurai
65
 Single machine cannot serve all the data: you need a distributed
special (file) system
 Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)
 Critical aspects: fault tolerance + replication + load balancing,
monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
B.Ramamurthy & K.Madurai
66
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
CCSCNE 2009 Palttsburg, April 24 2009
Peta Scale Data is Commonly Distributed
B.Ramamurthy & K.Madurai
67
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection Issue: managing the
large scale data
CCSCNE 2009 Palttsburg, April 24 2009
Write Once Read Many (WORM) data
B.Ramamurthy & K.Madurai
68
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
KEY web weed green sun moon land part web green …….
VALUE
web 2
weed 1
green 2
sun 1
moon 1
land 1
part 1
Data
collection
Data
collection
Data
collection
Data
collection
CCSCNE 2009 Palttsburg, April 24 2009
WORM Data is Amenable to Parallelism
B.Ramamurthy & K.Madurai
69
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
Data
collection
Data
collection
1. Data with WORM
characteristics : yields
to parallel processing;
2. Data without
dependencies: yields
to out of order
processing
CCSCNE 2009 Palttsburg, April 24 2009
Divide and Conquer: Provision Computing at Data Location
B.Ramamurthy & K.Madurai
70
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
Data
collection
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
WordList
Thread
Main
1..*
1..*
DataCollection
Parser
1..*
Counter
1..*
ResultTable
Data
collection
For our example,
#1: Schedule parallel parse tasks
#2: Schedule parallel count tasks
This is a particular solution;
Lets generalize it:
Our parse is a mapping operation:
MAP: input  <key, value> pairs
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Map/Reduce originated from Lisp
But have different meaning here
Runtime adds distribution + fault
tolerance + replication + monitoring +
load balancing to your base application!
One node
CCSCNE 2009 Palttsburg, April 24 2009
Mapper and Reducer
B.Ramamurthy & K.Madurai
71
Remember: MapReduce is simplified processing for larger data sets:
MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009
Map Operation
MAP: Input data  <key, value> pair
Data
Collection: split1
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n
Map
……
Map
B.Ramamurthy & K.Madurai
72
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
web 1
weed 1
green 1
sun 1
moon 1
land 1
part 1
web 1
green 1
… 1
KEY VALUE
…
CCSCNE 2009 Palttsburg, April 24 2009
Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
……
Map
B.Ramamurthy & K.Madurai
73
…
CCSCNE 2009 Palttsburg, April 24 2009
Count
Count
Count
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1> Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
B.Ramamurthy & K.Madurai
74
CCSCNE 2009 Palttsburg, April 24 2009
Cat
Bat
Dog
Other
Words
(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce Example in my operating systems class
B.Ramamurthy & K.Madurai
75
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Programming
Model
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
76
MapReduce programming model
 Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and
Reducer class.
 Compile the source code with hadoop core.
 Package the code as jar executable.
 Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
 Load the data (or use it on previously available data)
 Launch the job and monitor.
 Study the result.
 Detailed steps.
B.Ramamurthy & K.Madurai
77
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce Characteristics
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without
mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same
physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
B.Ramamurthy & K.Madurai
78
CCSCNE 2009 Palttsburg, April 24 2009
Classes of problems “mapreducable”
 Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0
B.Ramamurthy & K.Madurai
79
CCSCNE 2009 Palttsburg, April 24 2009
Scope of MapReduce
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
B.Ramamurthy & K.Madurai
80
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
81
What is Hadoop?
 At Google MapReduce operation are run on a special
file system called Google File System (GFS) that is
highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
B.Ramamurthy & K.Madurai
82
CCSCNE 2009 Palttsburg, April 24 2009
Basic Features: HDFS
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
B.Ramamurthy & K.Madurai
83
CCSCNE 2009 Palttsburg, April 24 2009
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
84
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
Hadoop Distributed File System
B.Ramamurthy & K.Madurai
85
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128M
Replicated
CCSCNE 2009 Palttsburg, April 24 2009
More details: We discuss this in great detail in my Operating
Systems course
heartbeat
blockmap
Relevance and Impact on Undergraduate courses
 Data structures and algorithms: a new look at traditional
algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
 You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
 Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
 While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.
B.Ramamurthy & K.Madurai
86
CCSCNE 2009 Palttsburg, April 24 2009
Demo
 VMware simulated Hadoop and MapReduce demo
 Remote access to NEXOS system at my Buffalo office
 5-node HDFS running HDFS on Ubuntu 8.04
 1 –name node and 4 data-nodes
 Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
 Zeus (namenode), datanodes: hermes, dionysus,
aphrodite, athena
B.Ramamurthy & K.Madurai
87
CCSCNE 2009 Palttsburg, April 24 2009
Summary
 We introduced MapReduce programming model for
processing large scale data
 We discussed the supporting Hadoop Distributed
File System
 The concepts were illustrated using a simple example
 We reviewed some important parts of the source
code for the example.
 Relationship to Cloud Computing
B.Ramamurthy & K.Madurai
88
CCSCNE 2009 Palttsburg, April 24 2009
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html
B.Ramamurthy & K.Madurai
89
CCSCNE 2009 Palttsburg, April 24 2009
Big_data_1674238705.ppt is a basic background
Hive - SQL on top of Hadoop
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/
Hive Architecture
HDFS
Hive CLI
DDL
Queries
Browsing
Map Reduce
SerDe
Thrift Jute JSON
Thrift API
MetaStore
Web UI
Mgmt, etc
Hive QL
Planner Execution
Parser Planner
Hive QL – Join
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pag
eid
use
rid
time
1 111 9:08:
01
2 111 9:08:
13
1 222 9:08:
14
use
rid
age gende
r
111 25 femal
e
222 32 male
pag
eid
age
1 25
2 25
1 32
X =
page_view
user
pv_users
Hive QL – Join in Map Reduce
key value
111 <1,1
>
111 <1,2
>
222 <1,1
>
pag
eid
use
rid
time
1 111 9:08:
01
2 111 9:08:
13
1 222 9:08:
14
use
rid
age gende
r
111 25 femal
e
222 32 male
page_view
user
pv_users
key valu
e
111 <2,2
5>
222 <2,3
2>
Map
key valu
e
111 <1,1
>
111 <1,2
>
111 <2,2
5>
key valu
e
222 <1,1
>
222 <2,3
2>
Shuffle
Sort
pag
eid
age
1 25
2 25
page
id
age
1 32
Reduce
Hive QL – Group By
• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
– GROUP BY pageid, age;
pag
eid
age
1 25
2 25
1 32
2 25
pv_users
pag
eid
age Co
unt
1 25 1
2 25 2
1 32 1
pageid_age_sum
Hive QL – Group By in Map Reduce
pag
eid
age
1 25
2 25
pv_users
pag
eid
age Co
unt
1 25 1
1 32 1
pageid_age_sum
pag
eid
age
1 32
2 25
Map
key valu
e
<1,
25>
1
<2,
25>
1
key valu
e
<1,
32>
1
<2,
25>
1
key valu
e
<1,
25>
1
<1,
32>
1
key valu
e
<2,
25>
1
<2,
25>
1
Shuffle
Sort
pag
eid
age Cou
nt
2 25 2
Reduce
Hive QL – Group By with Distinct
• SQL
– SELECT pageid, COUNT(DISTINCT userid)
– FROM page_view GROUP BY pageid
pag
eid
user
id
time
1 111 9:08:
01
2 111 9:08:
13
1 222 9:08:
14
2 111 9:08:
20
page_view
page
id
count_distinct
_userid
1 2
2 1
result
Hive QL – Group By with Distinct in Map
Reduce
page_view
page
id
cou
nt
1 2
Shuffle
and
Sort
page
id
cou
nt
2 1
Reduce
page
id
useri
d
time
1 111 9:08:
01
2 111 9:08:
13
page
id
useri
d
time
1 222 9:08:
14
2 111 9:08:
20
key v
<1,111
>
<1,22
2>
key v
<2,111
>
<2,111
>
Shuffle key is a prefix of the sort key.
Hive QL: Order By
page_view
Shuffle
and
Sort
Reduce
page
id
useri
d
time
2 111 9:08:
13
1 111 9:08:
01
page
id
useri
d
time
2 111 9:08:
20
1 222 9:08:
14
key v
<1,111
>
9:08:
01
<2,111
>
9:08:
13
key v
<1,22
2>
9:08:
14
<2,111
>
9:08:
20
page
id
useri
d
time
1 111 9:08:
01
2 111 9:08:
13
page
id
useri
d
time
1 222 9:08:
14
2 111 9:08:
20
Shuffle randomly.
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Revisit
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce
Merge Sequential Map Reduce Jobs
• SQL:
– FROM (a join b on a.key = b.key) join c on a.key =
c.key SELECT …
ke
y
av bv
1 11
1
22
2
ke
y
av
1 111
A
Map Reduce
ke
y
bv
1 22
2
B
ke
y
cv
1 33
3
C
AB
Map Reduce
ke
y
av bv cv
1 11
1
22
2
33
3
ABC
Share Common Read Operations
• Extended SQL
▪ FROM pv_users
▪ INSERT INTO TABLE
pv_pageid_sum
▪ SELECT pageid, count(1)
▪ GROUP BY pageid
▪ INSERT INTO TABLE pv_age_sum
▪ SELECT age, count(1)
▪ GROUP BY age;
pag
eid
ag
e
1 25
2 32
Map Reduce
pag
eid
cou
nt
1 1
2 1
pag
eid
ag
e
1 25
2 32
Map Reduce
age cou
nt
25 1
32 1
Load Balance Problem
pag
eid
ag
e
1 25
1 25
1 25
2 32
1 25
pv_users
pag
eid
ag
e
cou
nt
1 25 4
2 32 1
pageid_age_sum
Map-Reduce
pag
eid
ag
e
cou
nt
1 25 2
2 32 1
1 25 2
pageid_age_partial_sum
Map-Reduce
Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
Map-side Aggregation / Combiner
<male, 343>
<female, 128>
<male, 123>
<female, 244>
Local
Map
<female, 128>
<female, 244>
<male, 343>
<male, 123>
Global
Shuffle
<male, 343>
<male, 123>
<female, 128>
<female, 244>
Local
Sort
<female, 372>
<male, 466>
Local
Reduce
Query Rewrite
• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);
TODO: Column-based Storage and Map-side Join
url page
quality
IP
http://a.co
m/
90 65.1.2.3
http://b.co
m/
20 68.9.0.81
http://c.co
m/
68 11.3.85.1
url clicked viewed
http://a.com/ 12 145
http://b.com/ 45 383
http://c.com/ 23 67
MetaStore
• Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Other information
• Thrift API
– Current clients in Php (Web Interface), Python (old CLI),
Java (Query Engine and CLI), Perl (Tests)
• Metadata can be stored as text files or even in a
SQL backend
Hive CLI
• DDL:
– create table/drop table/rename table
– alter table add column
• Browsing:
– show tables
– describe table
– cat table
• Loading Data
• Queries
Web UI for Hive
• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)
• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By
Hive QL – Custom Map/Reduce Scripts
• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);
• Map-Reduce: similar to hadoop streaming

More Related Content

PPT
Hadoop HDFS.ppt
PDF
An Introduction of Apache Hadoop
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Big data applications
PDF
Hadoop Master Class : A concise overview
PPTX
Presentation on Big Data Analytics
PPTX
Hadoop.pptx
Hadoop HDFS.ppt
An Introduction of Apache Hadoop
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Big data applications
Hadoop Master Class : A concise overview
Presentation on Big Data Analytics
Hadoop.pptx

Similar to Big_data_1674238705.ppt is a basic background (20)

PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Architecting Your First Big Data Implementation
PPT
Data analytics & its Trends
PDF
Intro to Big Data
PPTX
Hadoop/MapReduce/HDFS
PPTX
Big Data Analytics with Hadoop
PPSX
Big Data
PPTX
Big Data Practice_Planning_steps_RK
PDF
Big data and hadoop overvew
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
The Hadoop Ecosystem for Developers
PPT
Big data and hadoop
PPTX
Big Data Open Source Technologies
PDF
Big data analytics_using_hadoop
PPTX
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
PDF
Rapid Cluster Computing with Apache Spark 2016
PPTX
Hadoop training
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop and Big data in Big data and cloud.pptx
Architecting Your First Big Data Implementation
Data analytics & its Trends
Intro to Big Data
Hadoop/MapReduce/HDFS
Big Data Analytics with Hadoop
Big Data
Big Data Practice_Planning_steps_RK
Big data and hadoop overvew
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
The Hadoop Ecosystem for Developers
Big data and hadoop
Big Data Open Source Technologies
Big data analytics_using_hadoop
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
Rapid Cluster Computing with Apache Spark 2016
Hadoop training

Recently uploaded (20)

PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
ai agent creaction with langgraph_presentation_
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
ifsm.pptx, institutional food service management
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
PPTX
DATA MODELING, data model concepts, types of data concepts
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
Machine Learning and working of machine Learning
PPTX
lung disease detection using transfer learning approach.pptx
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
ai agent creaction with langgraph_presentation_
Stats annual compiled ipd opd ot br 2024
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
1 hour to get there before the game is done so you don’t need a car seat for ...
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
The Role of Pathology AI in Translational Cancer Research and Education
ifsm.pptx, institutional food service management
AI AND ML PROPOSAL PRESENTATION MUST.pptx
OJT-Narrative-Presentation-Entrep-group.pptx_20250808_102837_0000.pptx
DATA MODELING, data model concepts, types of data concepts
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
Chapter security of computer_8_v8.1.pptx
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPT for Diseases (1)-2, types of diseases.pptx
Machine Learning and working of machine Learning
lung disease detection using transfer learning approach.pptx
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf

Big_data_1674238705.ppt is a basic background

  • 2. Theme of this Course Large-Scale Data Management Big Data Analytics Data Science and Analytics • How to manage very large amounts of data and extract value and knowledge from them 2
  • 3. Introduction to Big Data What is Big Data? What makes data, “Big” Data? 3
  • 4. Big Data Definition • No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 4
  • 5. Characteristics of Big Data: 1-Scale (Volume) • Data Volume • 44x increase from 2009 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially 5 Exponential increase in collected/generated data
  • 6. Characteristics of Big Data: 2-Complexity (Varity) • Various formats, types, and structures • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data • A single application can be generating/collecting many types of data 6
  • 7. Characteristics of Big Data: 3-Speed (Velocity) • Data is begin generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities • Examples • E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you • Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction 7
  • 9. Some Make it 4V’s 9
  • 10. Harnessing Big Data • OLTP: Online Transaction Processing (DBMSs) • OLAP: Online Analytical Processing (Data Warehousing) • RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 10
  • 11. Who’s Generating Big Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) • The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 11
  • 12. The Model Has Changed… • The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 12
  • 13. What’s driving Big Data - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time 13
  • 14. Value of Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps • Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 14
  • 15. Challenges in Handling Big Data • The Bottleneck is in technology • New architecture, algorithms, techniques are needed • Also in technical skills • Experts in using the new technology and dealing with big data 15
  • 16. What Technology Do We Have For Big Data ?? 16
  • 17. 17
  • 19. What You Will Learn… • We focus on Hadoop/MapReduce technology • Learn the platform (how it is designed and works) • How big data are managed in a scalable, efficient way • Learn writing Hadoop jobs in different languages • Programming Languages: Java, C, Python • High-Level Languages: Apache Pig, Hive • Learn advanced analytics tools on top of Hadoop • RHadoop: Statistical tools for managing big data • Mahout: Data mining and machine learning tools over big data • Learn state-of-art technology from recent research papers • Optimizations, indexing techniques, and other extensions to Hadoop 19
  • 21. Course Logistics • Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/ • Electronic WPI system: blackboard.wpi.edu • Lectures • Tuesday, Thursday: (4:00pm - 5:20pm) 21
  • 22. Textbook & Reading List • No specific textbook • Big Data is a relatively new topic (so no fixed syllabus) • Reading List • We will cover the state-of-art technology from research papers in big conferences • Many Hadoop-related papers are available on the course website • Related books: • Hadoop, The Definitive Guide [pdf] 22
  • 23. Requirements & Grading • Seminar-Type Course • Students will read research papers and present them (Reading List) • Hands-on Course • No written homework or exams • Several coding projects covering the entire semester Done in teams of two 23
  • 24. Requirements & Grading (Cont’d) • Reviews • When a team is presenting (not the instructor), the other students should prepare a review on the presented paper • Course website gives guidelines on how to make good reviews • Reviews are done individually 24
  • 25. Late Submission Policy • For Projects • One-day late  10% off the max grade • Two-day late  20% off the max grade • Three-day late  30% off the max grade • Beyond that, no late submission is accepted • Submissions: • Submitted via blackboard system by the due date • Demonstrated to the instructor within the week after • For Reviews • No late submissions • Student may skip at most 4 reviews • Submissions: • Given to the instructor at the beginning of class 25
  • 26. More about Projects • A virtual machine is created including the needed platform for the projects • Ubuntu OS (Version 12.10) • Hadoop platform (Version 1.1.0) • Apache Pig (Version 0.10.0) • Mahout library (Version 0.7) • Rhadoop • In addition to other software packages • Download it from the course website (link) • Username and password will be sent to you • Need Virtual Box (Vbox) [free] 26
  • 27. Next Step from You… 1. Form teams of two 2. Visit the course website (Reading List), each team selects its first paper to present (1st come 1st served) • Send me your choices top 2/3 choices 3. You have until Jan 20th • Otherwise, I’ll randomly form teams and assign papers 4. Use Blackboard “Discussion” forum for posts or for searching for teammates 27
  • 28. Course Output: What You Will Learn… • We focus on Hadoop/MapReduce technology • Learn the platform (how it is designed and works) • How big data are managed in a scalable, efficient way • Learn writing Hadoop jobs in different languages • Programming Languages: Java, C, Python • High-Level Languages: Apache Pig, Hive • Learn advanced analytics tools on top of Hadoop • RHadoop: Statistical tools for managing big data • Mahout: Analytics and data mining tools over big data • Learn state-of-art technology from recent research papers • Optimizations, indexing techniques, and other extensions to Hadoop 28
  • 29. Open Source World’s Solution  Google File System – Hadoop Distributed FS  Map-Reduce – Hadoop Map-Reduce  Sawzall – Pig, Hive, JAQL  Big Table – Hadoop HBase, Cassandra  Chubby – Zookeeper
  • 30. Simplified Search Engine Architecture Spider Runtime Batch Processing System on top of Hadoop SE Web Server Search Log Storage Internet
  • 31. Simplified Data Warehouse Architecture Database Batch Processing System on top fo Hadoop Web Server View/Click/Events Log Storage Business Intelligence Domain Knowledge
  • 32. Hadoop History  Jan 2006 – Doug Cutting joins Yahoo  Feb 2006 – Hadoop splits out of Nutch and Yahoo starts using it.  Dec 2006 – Yahoo creating 100-node Webmap with Hadoop  Apr 2007 – Yahoo on 1000-node cluster  Jan 2008 – Hadoop made a top-level Apache project  Dec 2007 – Yahoo creating 1000-node Webmap with Hadoop  Sep 2008 – Hive added to Hadoop as a contrib project
  • 33. Hadoop Introduction  Open Source Apache Project  http://hadoop.apache.org/  Book: http://oreilly.com/catalog/9780596521998/index.html  Written in Java  Does work with other languages  Runs on  Linux, Windows and more  Commodity hardware with high failure rate
  • 34. Current Status of Hadoop  Largest Cluster  2000 nodes (8 cores, 4TB disk)  Used by 40+ companies / universities over the world  Yahoo, Facebook, etc  Cloud Computing Donation from Google and IBM  Startup focusing on providing services for hadoop  Cloudera
  • 35. Hadoop Components  Hadoop Distributed File System (HDFS)  Hadoop Map-Reduce  Contributes  Hadoop Streaming  Pig / JAQL / Hive  HBase  Hama / Mahout
  • 37. Goals of HDFS  Very Large Distributed File System  10K nodes, 100 million files, 10 PB  Convenient Cluster Management  Load balancing  Node failures  Cluster expansion  Optimized for Batch Processing  Allow move computation to data  Maximize throughput
  • 39. HDFS Details  Data Coherency  Write-once-read-many access model  Client can only append to existing files  Files are broken up into blocks  Typically 128 MB block size  Each block replicated on multiple DataNodes  Intelligent Client  Client can find location of blocks  Client accesses data directly from DataNode
  • 41. HDFS User Interface  Java API  Command Line  hadoop dfs -mkdir /foodir  hadoop dfs -cat /foodir/myfile.txt  hadoop dfs -rm /foodir myfile.txt  hadoop dfsadmin -report  hadoop dfsadmin -decommission datanodename  Web Interface  http://host:port/dfshealth.jsp
  • 42. More about HDFS  http://hadoop.apache.org/core/docs/current/hdfs_design.html  Hadoop FileSystem API  HDFS  Local File System  Kosmos File System (KFS)  Amazon S3 File System
  • 44. Hadoop Map-Reduce Introduction  Map/Reduce works like a parallel Unix pipeline:  cat input | grep | sort | uniq -c | cat > output  Input | Map | Shuffle & Sort | Reduce | Output  Framework does inter-node communication  Failure recovery, consistency etc  Load balancing, scalability etc  Fits a lot of batch processing applications  Log processing  Web index building
  • 46. Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> (Simplified) Map Reduce Review <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 49. Hadoop Streaming  Allow to write Map and Reduce functions in any languages  Hadoop Map/Reduce only accepts Java  Example: Word Count  hadoop streaming -input /user/zshao/articles -mapper „tr “ ” “n”‟ -reducer „uniq -c„ -output /user/zshao/ -numReduceTasks 32
  • 50. Example: Log Processing  Generate #pageview and #distinct users for each page each day  Input: timestamp url userid  Generate the number of page views  Map: emit < <date(timestamp), url>, 1>  Reduce: add up the values for each row  Generate the number of distinct users  Map: emit < <date(timestamp), url, userid>, 1>  Reduce: For the set of rows with the same <date(timestamp), url>, count the number of distinct users by “uniq –c"
  • 51. Example: Page Rank  In each Map/Reduce Job:  Map: emit <link, eigenvalue(url)/#links> for each input: <url, <eigenvalue, vector<link>> >  Reduce: add all values up for each link, to generate the new eigenvalue for that link.  Run 50 map/reduce jobs till the eigenvalues are stable.
  • 52. TODO: Split Job Scheduler and Map- Reduce  Allow easy plug-in of different scheduling algorithms  Scheduling based on job priority, size, etc  Scheduling for CPU, disk, memory, network bandwidth  Preemptive scheduling  Allow to run MPI or other jobs on the same cluster  PageRank is best done with MPI
  • 53. Sender Receiver TODO: Faster Map-Reduce Mapper Receiver sort Sender R1 R2 R3 … R1 R2 R3 … sort Reducer sort Merge Reduce map map Mapper calls user functions: Map and Partition Sender does flow control Receiver merge N flows into 1, call user function Compare to sort, dump buffer to disk, and do checkpointing Reducer calls user functions: Compare and Reduce
  • 54. MapReduce and Hadoop Distributed File System B.Ramamurthy & K.Madurai 54 CCSCNE 2009 Palttsburg, April 24 2009
  • 55. The Context: Big-data  Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)  Google collects 270PB data in a month (2007), 20000PB a day (2008)  2010 census data is expected to be a huge gold mine of information  Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance.  We are in a knowledge economy.  Data is an important asset to any organization  Discovery of knowledge; Enabling discovery; annotation of data  We are looking at newer  programming models, and  Supporting algorithms and data structures.  NSF refers to it as “data-intensive computing” and industry calls it “big- data” and “cloud computing” B.Ramamurthy & K.Madurai 55 CCSCNE 2009 Palttsburg, April 24 2009
  • 56. Purpose of this talk  To provide a simple introduction to:  “The big-data computing” : An important advancement that has a potential to impact significantly the CS and undergraduate curriculum.  A programming model called MapReduce for processing “big-data”  A supporting file system called Hadoop Distributed File System (HDFS)  To encourage educators to explore ways to infuse relevant concepts of this emerging area into their curriculum. B.Ramamurthy & K.Madurai 56 CCSCNE 2009 Palttsburg, April 24 2009
  • 57. The Outline  Introduction to MapReduce  From CS Foundation to MapReduce  MapReduce programming model  Hadoop Distributed File System  Relevance to Undergraduate Curriculum  Demo (Internet access needed)  Our experience with the framework  Summary  References B.Ramamurthy & K.Madurai 57 CCSCNE 2009 Palttsburg, April 24 2009
  • 58. MapReduce CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 58
  • 59. What is MapReduce?  MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day)  Users specify the computation in terms of a map and a reduce function,  Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and  Underlying system also handles machine failures, efficient communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. B.Ramamurthy & K.Madurai 59 CCSCNE 2009 Palttsburg, April 24 2009
  • 60. From CS Foundations to MapReduce Consider a large data collection: {web, weed, green, sun, moon, land, part, web, green,…} Problem: Count the occurrences of the different words in the collection. Lets design a solution for this problem;  We will start from scratch  We will add and relax constraints  We will do incremental design, improving the solution for performance and scalability B.Ramamurthy & K.Madurai 60 CCSCNE 2009 Palttsburg, April 24 2009
  • 61. Word Counter and Result Table Data collection web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 61 ResultTable Main DataCollection WordCounter parse( ) count( ) {web, weed, green, sun, moon, land, part, web, green,…} CCSCNE 2009 Palttsburg, April 24 2009
  • 62. Multiple Instances of Word Counter web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 B.Ramamurthy & K.Madurai 62 Thread DataCollection ResultTable WordCounter parse( ) count( ) Main 1..* 1..* Data collection Observe: Multi-thread Lock on shared data CCSCNE 2009 Palttsburg, April 24 2009
  • 63. Improve Word Counter for Performance B.Ramamurthy & K.Madurai 63 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 N o No need for lock Separate counters CCSCNE 2009 Palttsburg, April 24 2009
  • 64. Peta-scale Data B.Ramamurthy & K.Madurai 64 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 65. Addressing the Scale Issue B.Ramamurthy & K.Madurai 65  Single machine cannot serve all the data: you need a distributed special (file) system  Large number of commodity hardware disks: say, 1000 disks 1TB each  Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time.  Thus failure is norm and not an exception.  File system has to be fault-tolerant: replication, checksum  Data transfer bandwidth is critical (location of data)  Critical aspects: fault tolerance + replication + load balancing, monitoring  Exploit parallelism afforded by splitting parsing and counting  Provision and locate computing at data locations CCSCNE 2009 Palttsburg, April 24 2009
  • 66. Peta-scale Data B.Ramamurthy & K.Madurai 66 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 CCSCNE 2009 Palttsburg, April 24 2009
  • 67. Peta Scale Data is Commonly Distributed B.Ramamurthy & K.Madurai 67 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection Issue: managing the large scale data CCSCNE 2009 Palttsburg, April 24 2009
  • 68. Write Once Read Many (WORM) data B.Ramamurthy & K.Madurai 68 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable KEY web weed green sun moon land part web green ……. VALUE web 2 weed 1 green 2 sun 1 moon 1 land 1 part 1 Data collection Data collection Data collection Data collection CCSCNE 2009 Palttsburg, April 24 2009
  • 69. WORM Data is Amenable to Parallelism B.Ramamurthy & K.Madurai 69 Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection Data collection Data collection 1. Data with WORM characteristics : yields to parallel processing; 2. Data without dependencies: yields to out of order processing CCSCNE 2009 Palttsburg, April 24 2009
  • 70. Divide and Conquer: Provision Computing at Data Location B.Ramamurthy & K.Madurai 70 WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection Data collection WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable WordList Thread Main 1..* 1..* DataCollection Parser 1..* Counter 1..* ResultTable Data collection For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Our parse is a mapping operation: MAP: input  <key, value> pairs Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application! One node CCSCNE 2009 Palttsburg, April 24 2009
  • 71. Mapper and Reducer B.Ramamurthy & K.Madurai 71 Remember: MapReduce is simplified processing for larger data sets: MapReduce Version of WordCount Source code CCSCNE 2009 Palttsburg, April 24 2009
  • 72. Map Operation MAP: Input data  <key, value> pair Data Collection: split1 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map B.Ramamurthy & K.Madurai 72 web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE web 1 weed 1 green 1 sun 1 moon 1 land 1 part 1 web 1 green 1 … 1 KEY VALUE … CCSCNE 2009 Palttsburg, April 24 2009
  • 73. Reduce Reduce Reduce Reduce Operation MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map B.Ramamurthy & K.Madurai 73 … CCSCNE 2009 Palttsburg, April 24 2009
  • 74. Count Count Count Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash Map <key, 1> Reducers (say, Count) P-0000 P-0001 P-0002 , count1 , count2 ,count3 B.Ramamurthy & K.Madurai 74 CCSCNE 2009 Palttsburg, April 24 2009
  • 76. MapReduce Programming Model CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 76
  • 77. MapReduce programming model  Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set).  Design and implement solution as Mapper classes and Reducer class.  Compile the source code with hadoop core.  Package the code as jar executable.  Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams  Load the data (or use it on previously available data)  Launch the job and monitor.  Study the result.  Detailed steps. B.Ramamurthy & K.Madurai 77 CCSCNE 2009 Palttsburg, April 24 2009
  • 78. MapReduce Characteristics  Very large scale data: peta, exa bytes  Write once and read many data: allows for parallelism without mutexes  Map and Reduce are the main operations: simple code  There are other supporting operations such as combine and partition (out of the scope of this talk).  All the map should be completed before reduce operation starts.  Map and reduce operations are typically performed by the same physical processor.  Number of map tasks and reduce tasks are configurable.  Operations are provisioned near the data.  Commodity hardware and storage.  Runtime takes care of splitting and moving data for operations.  Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. B.Ramamurthy & K.Madurai 78 CCSCNE 2009 Palttsburg, April 24 2009
  • 79. Classes of problems “mapreducable”  Benchmark for comparing: Jim Gray’s challenge on data- intensive computing. Ex: “Sort”  Google uses it (we think) for wordcount, adwords, pagerank, indexing data.  Simple algorithms such as grep, text-indexing, reverse indexing  Bayesian classification: data mining domain  Facebook uses it for various operations: demographics  Financial services use it for analytics  Astronomy: Gaussian analysis for locating extra-terrestrial objects.  Expected to play a critical role in semantic web and web3.0 B.Ramamurthy & K.Madurai 79 CCSCNE 2009 Palttsburg, April 24 2009
  • 80. Scope of MapReduce Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large B.Ramamurthy & K.Madurai 80 CCSCNE 2009 Palttsburg, April 24 2009
  • 81. Hadoop CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai 81
  • 82. What is Hadoop?  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. B.Ramamurthy & K.Madurai 82 CCSCNE 2009 Palttsburg, April 24 2009
  • 83. Basic Features: HDFS  Highly fault-tolerant  High throughput  Suitable for applications with large data sets  Streaming access to file system data  Can be built out of commodity hardware B.Ramamurthy & K.Madurai 83 CCSCNE 2009 Palttsburg, April 24 2009
  • 84. Hadoop Distributed File System B.Ramamurthy & K.Madurai 84 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course
  • 85. Hadoop Distributed File System B.Ramamurthy & K.Madurai 85 Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated CCSCNE 2009 Palttsburg, April 24 2009 More details: We discuss this in great detail in my Operating Systems course heartbeat blockmap
  • 86. Relevance and Impact on Undergraduate courses  Data structures and algorithms: a new look at traditional algorithms such as sort: Quicksort may not be your choice! It is not easily parallelizable. Merge sort is better.  You can identify mappers and reducers among your algorithms. Mappers and reducers are simply place holders for algorithms relevant for your applications.  Large scale data and analytics are indeed concepts to reckon with similar to how we addressed “programming in the large” by OO concepts.  While a full course on MR/HDFS may not be warranted, the concepts perhaps can be woven into most courses in our CS curriculum. B.Ramamurthy & K.Madurai 86 CCSCNE 2009 Palttsburg, April 24 2009
  • 87. Demo  VMware simulated Hadoop and MapReduce demo  Remote access to NEXOS system at my Buffalo office  5-node HDFS running HDFS on Ubuntu 8.04  1 –name node and 4 data-nodes  Each is an old commodity PC with 512 MB RAM, 120GB – 160GB external memory  Zeus (namenode), datanodes: hermes, dionysus, aphrodite, athena B.Ramamurthy & K.Madurai 87 CCSCNE 2009 Palttsburg, April 24 2009
  • 88. Summary  We introduced MapReduce programming model for processing large scale data  We discussed the supporting Hadoop Distributed File System  The concepts were illustrated using a simple example  We reviewed some important parts of the source code for the example.  Relationship to Cloud Computing B.Ramamurthy & K.Madurai 88 CCSCNE 2009 Palttsburg, April 24 2009
  • 89. References 1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tu torial.html 2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. 3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic 4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html B.Ramamurthy & K.Madurai 89 CCSCNE 2009 Palttsburg, April 24 2009
  • 91. Hive - SQL on top of Hadoop
  • 92. Map-Reduce and SQL • Map-Reduce is scalable – SQL has a huge user base – SQL is easy to code • Solution: Combine SQL and Map-Reduce – Hive on top of Hadoop (open source) – Aster Data (proprietary) – Green Plum (proprietary)
  • 93. Hive • A database/data warehouse on top of Hadoop – Rich data types (structs, lists and maps) – Efficient implementations of SQL filters, joins and group- by’s on top of map reduce • Allow users to access Hive data without using Hive • Link: – http://svn.apache.org/repos/asf/hadoop/hive/ trunk/
  • 94. Hive Architecture HDFS Hive CLI DDL Queries Browsing Map Reduce SerDe Thrift Jute JSON Thrift API MetaStore Web UI Mgmt, etc Hive QL Planner Execution Parser Planner
  • 95. Hive QL – Join • SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); pag eid use rid time 1 111 9:08: 01 2 111 9:08: 13 1 222 9:08: 14 use rid age gende r 111 25 femal e 222 32 male pag eid age 1 25 2 25 1 32 X = page_view user pv_users
  • 96. Hive QL – Join in Map Reduce key value 111 <1,1 > 111 <1,2 > 222 <1,1 > pag eid use rid time 1 111 9:08: 01 2 111 9:08: 13 1 222 9:08: 14 use rid age gende r 111 25 femal e 222 32 male page_view user pv_users key valu e 111 <2,2 5> 222 <2,3 2> Map key valu e 111 <1,1 > 111 <1,2 > 111 <2,2 5> key valu e 222 <1,1 > 222 <2,3 2> Shuffle Sort pag eid age 1 25 2 25 page id age 1 32 Reduce
  • 97. Hive QL – Group By • SQL: ▪ INSERT INTO TABLE pageid_age_sum ▪ SELECT pageid, age, count(1) ▪ FROM pv_users – GROUP BY pageid, age; pag eid age 1 25 2 25 1 32 2 25 pv_users pag eid age Co unt 1 25 1 2 25 2 1 32 1 pageid_age_sum
  • 98. Hive QL – Group By in Map Reduce pag eid age 1 25 2 25 pv_users pag eid age Co unt 1 25 1 1 32 1 pageid_age_sum pag eid age 1 32 2 25 Map key valu e <1, 25> 1 <2, 25> 1 key valu e <1, 32> 1 <2, 25> 1 key valu e <1, 25> 1 <1, 32> 1 key valu e <2, 25> 1 <2, 25> 1 Shuffle Sort pag eid age Cou nt 2 25 2 Reduce
  • 99. Hive QL – Group By with Distinct • SQL – SELECT pageid, COUNT(DISTINCT userid) – FROM page_view GROUP BY pageid pag eid user id time 1 111 9:08: 01 2 111 9:08: 13 1 222 9:08: 14 2 111 9:08: 20 page_view page id count_distinct _userid 1 2 2 1 result
  • 100. Hive QL – Group By with Distinct in Map Reduce page_view page id cou nt 1 2 Shuffle and Sort page id cou nt 2 1 Reduce page id useri d time 1 111 9:08: 01 2 111 9:08: 13 page id useri d time 1 222 9:08: 14 2 111 9:08: 20 key v <1,111 > <1,22 2> key v <2,111 > <2,111 > Shuffle key is a prefix of the sort key.
  • 101. Hive QL: Order By page_view Shuffle and Sort Reduce page id useri d time 2 111 9:08: 13 1 111 9:08: 01 page id useri d time 2 111 9:08: 20 1 222 9:08: 14 key v <1,111 > 9:08: 01 <2,111 > 9:08: 13 key v <1,22 2> 9:08: 14 <2,111 > 9:08: 20 page id useri d time 1 111 9:08: 01 2 111 9:08: 13 page id useri d time 1 222 9:08: 14 2 111 9:08: 20 Shuffle randomly.
  • 102. Hive Optimizations Efficient Execution of SQL on top of Map-Reduce
  • 103. Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> (Simplified) Map Reduce Revisit <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 104. Merge Sequential Map Reduce Jobs • SQL: – FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … ke y av bv 1 11 1 22 2 ke y av 1 111 A Map Reduce ke y bv 1 22 2 B ke y cv 1 33 3 C AB Map Reduce ke y av bv cv 1 11 1 22 2 33 3 ABC
  • 105. Share Common Read Operations • Extended SQL ▪ FROM pv_users ▪ INSERT INTO TABLE pv_pageid_sum ▪ SELECT pageid, count(1) ▪ GROUP BY pageid ▪ INSERT INTO TABLE pv_age_sum ▪ SELECT age, count(1) ▪ GROUP BY age; pag eid ag e 1 25 2 32 Map Reduce pag eid cou nt 1 1 2 1 pag eid ag e 1 25 2 32 Map Reduce age cou nt 25 1 32 1
  • 106. Load Balance Problem pag eid ag e 1 25 1 25 1 25 2 32 1 25 pv_users pag eid ag e cou nt 1 25 4 2 32 1 pageid_age_sum Map-Reduce pag eid ag e cou nt 1 25 2 2 32 1 1 25 2 pageid_age_partial_sum Map-Reduce
  • 107. Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> Map-side Aggregation / Combiner <male, 343> <female, 128> <male, 123> <female, 244> Local Map <female, 128> <female, 244> <male, 343> <male, 123> Global Shuffle <male, 343> <male, 123> <female, 128> <female, 244> Local Sort <female, 372> <male, 466> Local Reduce
  • 108. Query Rewrite • Predicate Push-down – select * from (select * from t) where col1 = ‘2008’; • Column Pruning – select col1, col3 from (select * from t);
  • 109. TODO: Column-based Storage and Map-side Join url page quality IP http://a.co m/ 90 65.1.2.3 http://b.co m/ 20 68.9.0.81 http://c.co m/ 68 11.3.85.1 url clicked viewed http://a.com/ 12 145 http://b.com/ 45 383 http://c.com/ 23 67
  • 110. MetaStore • Stores Table/Partition properties: – Table schema and SerDe library – Table Location on HDFS – Logical Partitioning keys and types – Other information • Thrift API – Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests) • Metadata can be stored as text files or even in a SQL backend
  • 111. Hive CLI • DDL: – create table/drop table/rename table – alter table add column • Browsing: – show tables – describe table – cat table • Loading Data • Queries
  • 112. Web UI for Hive • MetaStore UI: – Browse and navigate all tables in the system – Comment on each table and each column – Also captures data dependencies • HiPal: – Interactively construct SQL queries by mouse clicks – Support projection, filtering, group by and joining – Also support
  • 113. Hive Query Language • Philosophy – SQL – Map-Reduce with custom scripts (hadoop streaming) • Query Operators – Projections – Equi-joins – Group by – Sampling – Order By
  • 114. Hive QL – Custom Map/Reduce Scripts • Extended SQL: • FROM ( • FROM pv_users • MAP pv_users.userid, pv_users.date • USING 'map_script' AS (dt, uid) • CLUSTER BY dt) map • INSERT INTO TABLE pv_users_reduced • REDUCE map.dt, map.uid • USING 'reduce_script' AS (date, count); • Map-Reduce: similar to hadoop streaming