SlideShare a Scribd company logo
Getting Started With Big Data
Apache Hadoop
Apache Hadoop
Apache Hadoop
• is a popular open-source
framework for storing and
processing large data sets across
clusters of computers.
• HDP 2.2 on Sandbox system
Requirements:
– Now runs on 32-bit and 64-bit OS
(Windows XP, Windows 7,
Windows 8 and Mac OSX)
– Minimum 4GB RAM; 8Gb required
to run Ambari and Hbase
– Virtualization enabled on BIOS
– Browser: Chrome 25+, IE 9+, Safari
6+ recommended. (Sandbox will
not run on IE 10)
• An ideal way to get started Enterprise
Hadoop. Sandbox is a self-contained
virtual machine with Apache Hadoop
pre-configured alongside a set of
hands-on, step-by-step Hadoop
tutorials.
• Sandbox is a personal, portable Hadoop
environment that comes with a dozen
interactive Hadoop tutorials.
• It includes many of the most exciting
developments from the latest HDP
distribution, packaged up in a virtual
environment that you can get up and
running in 15 minutes!
Hadoop… Getting Started
Terminologies
• Hadoop
• YARN – the Hadoop Operating system
– enables a user to interact with all data in multiple
ways simultaneously, making Hadoop a true multi-use
data platform and allowing it to take its place in a
modern data architecture.
– A framework for job scheduling and cluster resource
management.
– This means that many different processing engines can
operate simultaneously across a Hadoop cluster, on
the same data, at the same time.
• the Hadoop Distributed File System (HDFS)
– A distributed file system that provides high-
throughput access to application data.
• MapReduce
– A YARN-based system for parallel processing of large
data sets.
• Sqoop
• theHiveODBC Driver
Hortonworks Data Platform(HDP)
• is a 100% open source
distribution of Apache
Hadoop that is truly
enterprise grade having
been built, tested and
hardened with enterprise
rigor.
Introducing Apache Hadoop to
Developers
• Apache Hadoop is a community driven open-source project
governed by the Apache Software Foundation.
• originally implemented at Yahoo based on papers published
by Google in 2003 and 2004.
• Since then Apache Hadoop has matured and developed to
become a data platform for not just processing humongous
amount of data in batch but with the advent of YARN it now
supports many diverse workloads such as Interactive
queries over large data with Hive on Tez, Realtime data
processing with Apache Storm, super scalable NoSQL
datastore like HBase, in-memory datastore like Spark and
the list goes on.
Apache Enterprise Hadoop
...
Core of Hadoop
• A set of machines running
HDFS and MapReduce is
known as a Hadoop Cluster.
Individual machines are
known as nodes. A cluster
can have as few as one node
to as many as several
thousands. For most
application scenarios Hadoop
is linearly scalable, which
means you can expect better
performance by simply
adding more nodes.
• The Hadoop
Distributed File
System (HDFS)
• MapReduce
MapReduce
• a method for distributing a task across multiple nodes. Each node
processes data stored on that node to the extent possible.
• A running Map Reduce job consists of various phases such as Map -
> Sort -> Shuffle -> Reduce
• Advantages:
– Automatic parallelization and distribution of data in blocks across a
distributed, scale-out infrastructure.
– Fault-tolerance against failure of storage, compute and network
infrastructure
– Deployment, monitoring and security capability
– A clean abstraction for programmers
• Most MapReduce programs are written in Java. It can also be
written in any scripting language using the Streaming API of
Hadoop.
The MapReduce Concepts and
Terminology
• MapReduce jobs are controlled by a software daemon
known as the JobTracker. The JobTracker resides on a
'master node'. Clients submit MapReduce jobs to the
JobTracker. The JobTracker assigns Map and Reduce tasks to
other nodes on the cluster.
• These nodes each run a software daemon known as the
TaskTracker. The TaskTracker is responsible for actually
instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
• A job is a program with the ability of complete execution of
Mappers and Reducers over a dataset. A task is the
execution of a single Mapper or Reducer over a slice of
data.
Hadoop Distributed File System
• the foundation of the Hadoop cluster.
• manages how the datasets are stored in the
Hadoop cluster.
• responsible for distributing the data across the
data nodes, managing replication for
redundancy and administrative tasks like
adding, removing and recovery of data nodes.
Apache Hive
• provides a data warehouse view of the data in HDFS.
• Using a SQL-like language Hive lets you create
summarizations of your data, perform ad-hoc queries,
and analysis of large datasets in the Hadoop cluster.
• The overall approach with Hive is to project a table
structure on the dataset and then manipulate it with
HiveQL.
• Since you are using data in HDFS your operations can
be scaled across all the datanodes and you can
manipulate huge datasets.
Apache HCatalog
• Used to hold location and metadata about the
data in a Hadoop cluster. This allows scripts and
MapReduce jobs to be decoupled from data
location and metadata like the schema.
• since it supports many tools, like Hive and Pig,
the location and metadata can be shared
between tools. Using the open APIs of HCatalog
other tools like Teradata Aster can also use the
location and metadata in HCatalog.
• how can we reference data by name and inherit
the location and metadata???
Apache Pig
• a language for expressing data analysis and
infrastructure processes.
• is translated into a series of MapReduce jobs that
are run by the Hadoop cluster.
• is extensible through user-defined functions that
can be written in Java and other languages.
• Pig scripts provide a high level language to create
the MapReduce jobs needed to process data in a
Hadoop cluster.

More Related Content

What's hot (19)

PPTX
Asbury Hadoop Overview
Brian Enochson
 
PPTX
Hadoop Architecture
Dr. C.V. Suresh Babu
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPT
Hadoop
chandinisanz
 
PPTX
Introduction to apache hadoop copy
Mohammad_Tariq
 
PPTX
Hadoop
avnishagr
 
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
PPTX
Hadoop overview
Siva Pandeti
 
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
ODP
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
PDF
Map reduce and hadoop at mylife
responseteam
 
PPTX
Hadoop Ecosystem
Lior Sidi
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PPTX
Hadoop
reddivarihareesh
 
PDF
Apache Hadoop 1.1
Sperasoft
 
PPTX
Hadoop Architecture
Ganesh B
 
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop Primer
Steve Staso
 
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop
chandinisanz
 
Introduction to apache hadoop copy
Mohammad_Tariq
 
Hadoop
avnishagr
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop overview
Siva Pandeti
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Map reduce and hadoop at mylife
responseteam
 
Hadoop Ecosystem
Lior Sidi
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Apache Hadoop 1.1
Sperasoft
 
Hadoop Architecture
Ganesh B
 

Viewers also liked (12)

PPS
Tripod Astrophotography - Glenn Wester
glennwester
 
PDF
LEFGOZAH-Nominees
dmvs-jim
 
PPTX
Ավելուկ
777ruzan
 
PPTX
Multi 2
Ian Khelynchy
 
PPTX
Trabajo de angie
launikaentuvida
 
PPTX
Mortgage CRM Made Easy with Mortgage Quest
Chris Carter
 
PDF
Sammy Vander Donckt
Sammy Vander Donckt
 
PPTX
How hurricanes get their names
kygraham23
 
PDF
certificate Finance
Ziyad Abdulaziz
 
PPTX
Analysis for office training
Kibrom Gebrehiwot
 
PDF
Catalog cat
Zubes Masade
 
PDF
Caterpillar operation and maintenance manual 3500 b engines s
Zubes Masade
 
Tripod Astrophotography - Glenn Wester
glennwester
 
LEFGOZAH-Nominees
dmvs-jim
 
Ավելուկ
777ruzan
 
Multi 2
Ian Khelynchy
 
Trabajo de angie
launikaentuvida
 
Mortgage CRM Made Easy with Mortgage Quest
Chris Carter
 
Sammy Vander Donckt
Sammy Vander Donckt
 
How hurricanes get their names
kygraham23
 
certificate Finance
Ziyad Abdulaziz
 
Analysis for office training
Kibrom Gebrehiwot
 
Catalog cat
Zubes Masade
 
Caterpillar operation and maintenance manual 3500 b engines s
Zubes Masade
 
Ad

Similar to Getting started big data (20)

PDF
Bdm hadoop ecosystem
Amit Bhardwaj
 
PDF
Unit IV.pdf
KennyPratheepKumar
 
DOCX
project report on hadoop
Manoj Jangalva
 
PPTX
Hadoop_arunam_ppt
jerrin joseph
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PDF
BIGDATA ppts
Krisshhna Daasaarii
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
ODP
Hadoop introduction
葵慶 李
 
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PPTX
Hadoop basics
Laxmi Rauth
 
PDF
Hadoop installation by santosh nage
Santosh Nage
 
PDF
Introduction to HADOOP.pdf
8840VinayShelke
 
PPTX
Big Data Training in Mohali
E2MATRIX
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
Big Data Training in Amritsar
E2MATRIX
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
PPTX
Big Data Training in Ludhiana
E2MATRIX
 
Bdm hadoop ecosystem
Amit Bhardwaj
 
Unit IV.pdf
KennyPratheepKumar
 
project report on hadoop
Manoj Jangalva
 
Hadoop_arunam_ppt
jerrin joseph
 
Hadoop Big Data A big picture
J S Jodha
 
BIGDATA ppts
Krisshhna Daasaarii
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop introduction
葵慶 李
 
MODULE 1: Introduction to Big Data Analytics.pptx
NiramayKolalle
 
Apache hadoop introduction and architecture
Harikrishnan K
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Hadoop basics
Laxmi Rauth
 
Hadoop installation by santosh nage
Santosh Nage
 
Introduction to HADOOP.pdf
8840VinayShelke
 
Big Data Training in Mohali
E2MATRIX
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Big Data Training in Amritsar
E2MATRIX
 
Tools and techniques for data science
Ajay Ohri
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Big Data Training in Ludhiana
E2MATRIX
 
Ad

Recently uploaded (20)

PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
epi editorial commitee meeting presentation
MIPLM
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Controller Request and Response in Odoo18
Celine George
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
Introduction presentation of the patentbutler tool
MIPLM
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 

Getting started big data

  • 1. Getting Started With Big Data Apache Hadoop
  • 2. Apache Hadoop Apache Hadoop • is a popular open-source framework for storing and processing large data sets across clusters of computers. • HDP 2.2 on Sandbox system Requirements: – Now runs on 32-bit and 64-bit OS (Windows XP, Windows 7, Windows 8 and Mac OSX) – Minimum 4GB RAM; 8Gb required to run Ambari and Hbase – Virtualization enabled on BIOS – Browser: Chrome 25+, IE 9+, Safari 6+ recommended. (Sandbox will not run on IE 10) • An ideal way to get started Enterprise Hadoop. Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials. • Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. • It includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!
  • 3. Hadoop… Getting Started Terminologies • Hadoop • YARN – the Hadoop Operating system – enables a user to interact with all data in multiple ways simultaneously, making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture. – A framework for job scheduling and cluster resource management. – This means that many different processing engines can operate simultaneously across a Hadoop cluster, on the same data, at the same time. • the Hadoop Distributed File System (HDFS) – A distributed file system that provides high- throughput access to application data. • MapReduce – A YARN-based system for parallel processing of large data sets. • Sqoop • theHiveODBC Driver Hortonworks Data Platform(HDP) • is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
  • 4. Introducing Apache Hadoop to Developers • Apache Hadoop is a community driven open-source project governed by the Apache Software Foundation. • originally implemented at Yahoo based on papers published by Google in 2003 and 2004. • Since then Apache Hadoop has matured and developed to become a data platform for not just processing humongous amount of data in batch but with the advent of YARN it now supports many diverse workloads such as Interactive queries over large data with Hive on Tez, Realtime data processing with Apache Storm, super scalable NoSQL datastore like HBase, in-memory datastore like Spark and the list goes on.
  • 6. Core of Hadoop • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster. Individual machines are known as nodes. A cluster can have as few as one node to as many as several thousands. For most application scenarios Hadoop is linearly scalable, which means you can expect better performance by simply adding more nodes. • The Hadoop Distributed File System (HDFS) • MapReduce
  • 7. MapReduce • a method for distributing a task across multiple nodes. Each node processes data stored on that node to the extent possible. • A running Map Reduce job consists of various phases such as Map - > Sort -> Shuffle -> Reduce • Advantages: – Automatic parallelization and distribution of data in blocks across a distributed, scale-out infrastructure. – Fault-tolerance against failure of storage, compute and network infrastructure – Deployment, monitoring and security capability – A clean abstraction for programmers • Most MapReduce programs are written in Java. It can also be written in any scripting language using the Streaming API of Hadoop.
  • 8. The MapReduce Concepts and Terminology • MapReduce jobs are controlled by a software daemon known as the JobTracker. The JobTracker resides on a 'master node'. Clients submit MapReduce jobs to the JobTracker. The JobTracker assigns Map and Reduce tasks to other nodes on the cluster. • These nodes each run a software daemon known as the TaskTracker. The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker • A job is a program with the ability of complete execution of Mappers and Reducers over a dataset. A task is the execution of a single Mapper or Reducer over a slice of data.
  • 9. Hadoop Distributed File System • the foundation of the Hadoop cluster. • manages how the datasets are stored in the Hadoop cluster. • responsible for distributing the data across the data nodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes.
  • 10. Apache Hive • provides a data warehouse view of the data in HDFS. • Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. • The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. • Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.
  • 11. Apache HCatalog • Used to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. • since it supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. • how can we reference data by name and inherit the location and metadata???
  • 12. Apache Pig • a language for expressing data analysis and infrastructure processes. • is translated into a series of MapReduce jobs that are run by the Hadoop cluster. • is extensible through user-defined functions that can be written in Java and other languages. • Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.