Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

TCloud Computing, Inc.
Hadoop Product Family
and Ecosystem

Agenda
• What is Big Data?
• Big Data Opportunities
• Hadoop
– Introduction to Hadoop
– Hadoop 2.0
– What’s next for Hadoop?
• Hadoop ecosystem
• Conclusion

What is Big Data?
A set of files A database A single file

4 V’s of Big Data
http://www.datasciencecentral.com/profiles/blogs/data-veracity

Big data Expands on 4 fronts
Velocity
Volume
Variety
Veracity
MB GB TB PB
batch
periodic
near Real-Time
Real-Time
http://whatis.techtarget.com/definition/3Vs

Big Data Opportunities
http://www.sap.com/corporate-en/news.epx?PressID=21316

Big Data Revenue by Market Segment 2012
• 1
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Big Data Market Forecast 2012-2017
• 1
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Hadoop Solutions
The most common problems Hadoop can solve

Threat Analysis/Trade Surveillance
• Challenge:
– Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved
• Like looking for a needle in a haystack
• Solution with Hadoop:
– Parallel processing over huge datasets
– Pattern recognition to identify anomalies
• – i.e., threats
• Typical Industry:
– Security, Financial Services

Recommendation Engine
• Challenge:
– Using user data to predict which products to recommend
• Solution with Hadoop:
– Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar users like
• Typical Industry
– ISP, Advertising

Walmart Case
Revenue ?
Friday
Beer
Diapers

• 1
http://tech.naver.jp/blog/?p=2412

• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
– inspired by

Hadoop Concepts
• Distribute the data as it is initially stored in the system
• Moving Computation is Cheaper than Moving Data
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.

Hadoop 2.0
• Hadoop 2.2.0 is expected to GA in Fall 2013
• HDFS Federation
• HDFS High Availability (HA)
• Hadoop YARN (MapReduce 2.0)

HDFS Federation - Limitation of Hadoop 1.0
• Scalability
– Storage scales horizontally - namespace doesn’t
• Performance
– File system operations throughput limited by a single node
• Poor isolation
– All the tenants share a single namespace

HDFS Federation
• Multiple independent NameNodes and Namespace
Volumes in a cluster
– Namespace Volume = Namespace + Block Pool
• Block Storage as generic storage service
– Set of blocks for a Namespace Volume is called a Block Pool
– DNs store blocks for all the Namespace Volumes – no
partitioning

HDFS Federation
Hadoop Hadoop 2.0
http://hortonworks.com/blog/an-introduction-to-hdfs-federation/
/home//app/Hive /app/HBase

HDFS High Availability (HA)
• Secondary Name Node is not Name Node
• http://www.youtube.com/watch?v=hEqQMLSXQlY

HDFS High Availability (HA)
https://issues.apache.org/jira/browse/HDFS-1623

Why do we need YARN
• Scalability
– Maximum Cluster size – 4,000 nodes
– Maximum concurrent tasks – 40,000
• Single point of failure
– Failure kills all queued and running jobs
• Lacks support for alternate paradigms
– Iterative applications implemented using MapReduce are 10x
slower
– Example: K-Means, PageRank

Hadoop YARN
http://hortonworks.com/hadoop/yarn/

Role of YARN
• Resource Manager
– Per-cluster
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
Job Tracker
Resource Manager
Application Master

Hadoop YARN architectural
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
• Container
– Basic unit of allocation
– Ex. Container A =
2GB, 1CPU
– Fine-grained resource
allocation
– Replace the fixed map/reduce slots

What’s next for Hadoop?
• Real-time
– Apache Tez
• Part of Stinger
– Spark
• SQL in Hadoop
– Stinger
• An immediate aim of 100x performance increase for Hive is more
ambitious than any other effort.
• Based on industry standard SQL, the Stinger Initiative improves
HiveQL to deliver SQL compatibility.
– Shark

What’s next for Hadoop?
• Security: Data encryption
– hadoop-9331: Hadoop crypto codec framework and crypto
codec implementations
• hadoop-9332: Crypto codec implementations for AES
• hadoop-9333: Hadoop crypto codec framework based on
compression codec
• mapreduce-5025: Key Distribution and Management for supporting
crypto codec in Map Reduce
• 2013/09/28 Hadoop in Taiwan 2013
– Hadoop Security: Now and future
– Session B, 16:00~16:40

Growing Hadoop Ecosystem
• The term ‘Hadoop’ is taken to be the combination of
HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
– Typically referred to as the ‘Hadoop Ecosystem’
• Zookeeper
• Hive and Pig
• HBase
• Flume
• Other Ecosystem Projects
– Sqoop
– Oozie
– Mahout

The Ecosystem is the System
• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache

Relation Map
MapReduce Runtime
(Dist. Programming
Framework)
Hadoop Distributed File System (HDFS)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive
(Analytical Language)
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

ZooKeeper – Coordination Framework
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

What is ZooKeeper
• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information

Why use ZooKeeper?
• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution

ZooKeeper Architecture
– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the
change
– HA support

HBase – Column NoSQL DB
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

I – Inspired by
• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper

HBase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]

When to use HBase
• Need random, low latency access to the data
• Application has a flexible schema where each row is
slightly different
– Add columns on the fly
• Most of columns are NULL in each row

Flume / Sqoop – Data Integration Framework
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

What’s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path

(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats

Sqoop
• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS

What is Sqoop
• A suite of tools that connect Hadoop and database
systems
• Import tables from databases into HDFS for deep
analysis
• Export MapReduce results back to a database for
presentation to end-users
• Provides the ability to import from SQL databases
straight into your Hive data warehouse

How Sqoop helps
• The Problem
– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
– Easy import of data from many databases to HDFS
– Generate code for use in MapReduce applications

Why Sqoop
• JDBC-based implementation
– Works with many popular database vendors
• Auto-generation of tedious user-side code
– Write MapReduce applications to work with your data, faster
• Integration with Hive
– Allows you to stay in a SQL-based environment

Pig / Hive – Analytical Language
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

Why Hive and Pig?
• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by
• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive/MR V.S. Hive/Tez
http://www.slideshare.net/adammuise/2013-jul-23thughivetuningdeepdive

Pig
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by

Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions
that are stored in a
metastore
A schema is optionally defined
at runtime
Programmait Access JDBC, ODBC PigServer

• Input
• For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye World
Hello Hadoop Goodbye Hadoop
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
WordCount Example

WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}

WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;

WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;

Spark / Shark - Analytical Language
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

Why
• MapReduce is too slow
• Aims to make data analytics fast — both fast to run and
fast to write.
• When you have the request: iterative algorithms

What is
• In-memory distributed computing framework
• Create by UC Berkeley AMP Lab in 2010
• Target Problem that Hadoop MR is bad at
– Iterative algorithm (Machine Learning )
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~15 companies

BDAS, the Berkeley Data Analytics Stack
https://amplab.cs.berkeley.edu/software/

What Different between Hadoop and Spark
Data Source
Map()
Data Source 2
Join()
Cache()Transform
http://spark.incubator.apache.org
HDFS
Map
Reduce
Map
Reduce

What is Shark
• A data analytic (warehouse) system that
– Port of Apache Hive to run on Spark
– Compatible with existing Hive data, metastores, and query(Hive,
UDFs,etc)
– Similar speedup of up to 40x than hive
– Scale out and is fault-tolerant
– Support low-latency, interactive query through in-memory
computing

Shark Architecture
Hive
Meta Store
HDFS/HBase
Spark
SQL
Parser
Query
Optimizer Physical Plan
Execution
Cache Mgr.
CLI Thrift/JDBC
Driver

Oozie – Job Workflow & Scheduling
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

What is ?
• A Java Web Application
• Oozie is a workﬂow scheduler for Hadoop
• Crond for Hadoop
Job 1
Job 3
Job 2
Job 4 Job 5

Why
• Why use Oozie instead of just cascading a jobs one
after another
• Major flexibility
– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
– You can tell Oozie to restart a job from a speciﬁc node in the
graph or to skip speciﬁc failed nodes

How it triggered
• Time
– Execute your workflow every 15 minutes
• Time and Data
– Materialize your workflow every hour, but only run them when
the input data is ready.
00:15 00:30 00:45 01:00
01:00 02:00 03:00 04:00
Hadoop
Input Data Exists?

Oozie use criteria
• Need Launch, control, and monitor jobs from your Java
Apps
– Java Client API/Command Line Interface
• Need control jobs from anywhere
– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
– Email when a job is complete

Mahout – Data Mining
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster

Why
• Current state of ML libraries
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Are Research oriented

Mahout – scale
• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case
– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community
– Vibrant, responsive and diverse

Mahout – four use cases
• Mahout machine learning algorithms
– Recommendation mining : takes users’ behavior and find items
said specified user might like
– Clustering : takes e.g. text documents and groups them based
on related document topics
– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign
unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms
in query session, shopping cart content) and identifies, which
individual items typically appear together

Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him

Conclusion
• Big Data Opportunities
– The market still growing
• Hadoop 2.0
– Federation
– HA
– YARN
• What’s next for Hadoop
– Real-time query
– Data encryption
• What other projects are included in the Hadoop
ecosystem
– Different project for different purpose
– Choose right tools for your needs

Recap – Hadoop Ecosystem
MapReduce Runtime
(Dist. Programming
Framework)
HBase
(Column
NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
Pig/Hive
Mahout
(Data Mining)
YARN
ZooKeeper
(Coordination)
Tez
(near real-time
processing)
Spark
(in-
memory)
Shark

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3 (20)

More from tcloudcomputing-tw (7)

Recently uploaded (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3