Hadoop An Introduction

Hadoop
An Introduction
Mohanasundaram Ponnusamy

 The term used to describe large collections of data
and grow so large and quickly
 With this massive quantity of data, more than 80%
are unstructured or semi-structured.
 it is difficult to manage with regular database or
statistical tools
 Its relevant Big Data solutions based on Hadoop
and other analytics software.
What is Bigdata ?

 Open Source project of the Apache Foundation
 A framework written in Java originally developed by Doug Cutting
 Hadoop was based on Google White Papers
 2003: Description of the Google File System (GFS)
 A method for storing data in a distributed, reliable fashion
 2004: Description of distributed MapReduce
 A method for processing data in a parallel fashion
 Optimized to handle
 massive quantities of data through parallelism
 Variety of data i.e., structured, unstructured or semi-structured
 Commodity hardware relatively inexpensive
What is Hadoop ?

 Massive parallel processing is done with great
performance
 Reliability through replication.
 It replicates its data across multiple computers. if one
goes down, the data is processed on one of the others.
 Current Hadoop Version is 2.6 (Nov 2014)
 Appends possible but updates not.
 Find more info at http://hadoop.apache.org/
What is Hadoop? Cont..

 Store & Process large data ( TB, PB)
 Fault Tolerant. Failure is common
 Highly Scalable Horizontally with commodity
hardware
 Moving computation to storage instead of moving
data to the processors
 Distributed Computing ( MR)
 Not Only SQL
Why Hadoop ?

 We can process data very quickly, but we can only
read/write it very slowly
 Solution: parallel reads
 1 HDD = 75MB/sec
 1,000 HDDs = 75GB/sec
 Far more acceptable
Why Hadoop? Cont..

 HDFS
 A distributed file system that provides high-throughput access to
application data.
 MR
 A distributed data processing model and execution environment
that runs on large clusters of commodity machines.
 YARN
 A framework for job scheduling and cluster resource
management.
 Common Utilities
 The set of utilities that supports Hadoop Subprojects.
 It includes File System, RPC, Serialization libs, etc.
Hadoop Components

 Pig (Yahoo)
 A data flow scripting (PigLatin) language and execution environment for exploring
very large datasets.
 Pig runs on HDFS and MapReduce clusters.
 Hive (Facebook)
 A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (HiveSQL) for querying the data.
 HBase
 A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
 ZooKeeper
 A distributed, highly available coordination service.
 ZooKeeper provides primitives such as distributed locks that can be used for
building distributed applications.
Hadoop Ecosystem Projects

 Sqoop
 A tool for efficiently moving data between relational databases and
HDFS.
 Flume
 Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data
 Oozie
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs
 Cassandra
 A scalable multi-master database with no single points of failure.
 Mahout
 A Scalable machine learning and data mining library.
Hadoop Ecosystem Projects Cont..

 Spark
 A fast analytics and Stream processing for Hadoop data.
 Spark provides a simple and expressive programming model that
supports a wide range of applications, including ETL, machine learning,
stream processing, and graph computation.
 Tez
 A generalized data-flow programming framework
 built on Hadoop YARN
 a powerful and flexible engine to execute an arbitrary DAG of tasks to
process data for both batch and interactive use-cases.
 Tez is being adopted by Hive™, Pig™ and other frameworks in the
Hadoop ecosystem, and also by other commercial software (e.g. ETL
tools), to replace Hadoop™ MapReduce as the underlying execution
engine.
Hadoop Ecosystem Projects Cont..

Hadoop An Introduction

More Related Content

What's hot (20)

Similar to Hadoop An Introduction (20)

Recently uploaded (20)

Hadoop An Introduction