SlideShare a Scribd company logo
BIG DATA ANALYTICS WITH HADOOP
BY-
SWAMIL SINGH
VIPLAV MANDAL
GUIDED BY-DR S.SRIVATAVA
AGENDA
• Design of website clickstream data and example.
• How to load data into sandbox.
• Load data using flume and the process.
• About flume
• Flumes working process.
• Process to refine data.
• Map reduce
• Hcatalog and hcatalog work
• Hive
• Hive work & their process
• Queries
DESIGN OF WEBSITE CLICKSTREAM DATA
• Clickstream data is an information trail a user leaves behind while visiting a website. It is
typically captured in semi-structured website log files.
• These website log files contain data elements such as a date and time stamp, the visitor’s IP
address, the destination URLs of the pages visited, and a user ID that uniquely identifies the
website visitor.
• One of the original uses of Hadoop at Yahoo was to store and process their massive volume of
clickstream data
EXAMPLE OF WEBSITE CLICK STREAM DATA
HOW TO LOAD DATA INTO SANDBOX
• The sandbox is a fully contained Data Platform environment
• The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all
the tools needed for data ingestion and processing.
• You can access and analyze sandbox data with many Business Intelligence (BI) applications.
• By combining web logs with more traditional customer data, we can better understand our
customers, and also understand how to optimize future promotions and advertising.
ABOUT FLUME
• Flume’s high-level architecture is built on a streamlined codebase that is easy to use and
extend.
• . The project is highly reliable, without the risk of data loss. Flume also supports dynamic
reconfiguration without the need for a restart, which reduces downtime for its agents.
• Flume components interact in the following way
• A flow in Flume starts from the Client.
• The Client transmits the Event to a Source operating within the Agent
• The Source receiving this Event then delivers it to one or more Channels.
• One or more Sinks operating within the same Agent drains these Channels.
• Channels decouple the ingestion rate from drain rate using the familiar producer-consumer
model of data exchange.
LOAD DATA USING FLUME ,THE PROCESS
SEQUENCE DIAGRAM OF FLUME
• Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput
streams in the HDFS. These different types of data can be landed in Hadoop for future analysis
using interactive queries in Apache Hive.
• In one specific example,
-Flume is used to log manufacturing operations. When one run of product comes off the line,
it generates a log file about that run.
• The large volume log file data can stream through Flume into a tool for same-day analysis with
Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a
quality assurance engineer using Apache Hive..
THE PROCESS TO REFINE DATA
• Omniture logs* – website log files containing information such as URL, timestamp, IP address,
geocoded IP address, and user ID (SWID).
• Users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.
• Products* – CMS
• data that maps product categories to website URLs
MAP REDUCE
ABOUT THE MAP REDUCE
• A MapReduce job splits a large data set into independent chunks and organizes them into key,
value pairs for parallel processing.
• The Map function divides the input into ranges by the Input Format and creates a map task for
each range in the input
• The output of each map task is partitioned into a group of key-value pairs for each reduce.
• The Reduce function then collects the various results and combines them to answer the larger
problem that the master node needs to solve
HCATALOG
• Apache HCatalog is a table management layer that exposes Hive metadata to other Hadoop
applications
• HCatalog’s table abstraction presents users with a relational view of data in the Hadoop
Distributed File System (HDFS) and ensures that users need not worry about where or in what
format their data is stored
• HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.
HOW HCATALOG WORKS
• HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-
deserializer) can be written.
• By default, HCatalog supports RCFile, CSV, JSON, and Sequence File formats. To use a
custom format, you must provide the Input Format, Output Format, and SerDe.
• HCatalog is built on top of the Hive metastore and incorporates components from the Hive
DDL.
• HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command
line interface for issuing data definition and metadata exploration commands.
HIVE
• Hive is a component of Data Platform. Hive provides a SQL-like interface to data stored in DP.
• Hive provides a database query interface to Apache Hadoop.
• Hive because of its SQL like query language is often used as the interface to an Apache
Hadoop based data warehouse.
• Pig fits in through its data flow strengths where it takes on the tasks of bringing data into
Apache Hadoop and working with it to get it into the form for querying.
HOW TO WORK HIVE
• The tables in Hive are similar to tables in a relational database, and data units are organized in
a taxonomy from larger to more granular units.
• Databases are comprised of tables, which are made up of partitions.
• Data can be accessed via a simple query language and Hive supports overwriting or appending
data
• Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN,
CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT.
• In addition, analysts can combine primitive data types to form complex data types, such as
structs, maps and arrays.
WORKING PROCESS OF HIVE
• Any queries…..
• THANK YOU

More Related Content

What's hot (20)

PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
PPTX
Big data processing with apache spark part1
Abbas Maazallahi
 
PPSX
Big data with Hadoop - Introduction
Tomy Rhymond
 
PDF
Emergent Distributed Data Storage
hybrid cloud
 
PDF
Big Data Real Time Applications
DataWorks Summit
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
PPTX
Big data concepts
Serkan Özal
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PPT
Big data introduction, Hadoop in details
Mahmoud Yassin
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PPTX
Big Data Concepts
Ahmed Salman
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
PDF
BIG DATA
Dr. Shashank Shetty
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Big data processing with apache spark part1
Abbas Maazallahi
 
Big data with Hadoop - Introduction
Tomy Rhymond
 
Emergent Distributed Data Storage
hybrid cloud
 
Big Data Real Time Applications
DataWorks Summit
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Big Data and Hadoop
MaulikLakhani
 
Big data concepts
Serkan Özal
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Big data ppt
Thirunavukkarasu Ps
 
Big Data Concepts
Ahmed Salman
 
Big Data and Hadoop Basics
Sonal Tiwari
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Intro to Big Data Hadoop
Apache Apex
 
Big Data Analytics with Hadoop
Philippe Julio
 
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 

Similar to Big data analytics with hadoop volume 2 (20)

ODP
Hadoop introduction
葵慶 李
 
PDF
Big data-hadoop-training-course-content-content
Training Institute
 
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
PPTX
A Glimpse of Bigdata - Introduction
saisreealekhya
 
PPTX
Analysis of historical movie data by BHADRA
Bhadra Gowdra
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PPTX
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
meganath16032003
 
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
PPTX
TriHUG November HCatalog Talk by Alan Gates
trihug
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PPTX
Big Data Ingestion Using Hadoop - Capstone Presentation
Samkannan
 
PPTX
Capstone presentation
Vikal Gupta
 
PPTX
Apache Hive
Ajit Koti
 
PDF
OpenSource Big Data Platform - Flamingo Project
BYOUNG GON KIM
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PPTX
Hadoop as data refinery
Steve Loughran
 
PPTX
Hadoop as Data Refinery - Steve Loughran
JAX London
 
PDF
Prototyping Data Intensive Apps: TrendingTopics.org
Peter Skomoroch
 
Hadoop introduction
葵慶 李
 
Big data-hadoop-training-course-content-content
Training Institute
 
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Analysis of historical movie data by BHADRA
Bhadra Gowdra
 
Intro to Hadoop
Jonathan Bloom
 
Apache hadoop introduction and architecture
Harikrishnan K
 
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
meganath16032003
 
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
TriHUG November HCatalog Talk by Alan Gates
trihug
 
Hive @ Hadoop day seattle_2010
nzhang
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Big Data Ingestion Using Hadoop - Capstone Presentation
Samkannan
 
Capstone presentation
Vikal Gupta
 
Apache Hive
Ajit Koti
 
OpenSource Big Data Platform - Flamingo Project
BYOUNG GON KIM
 
Hadoop Big Data A big picture
J S Jodha
 
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Prototyping Data Intensive Apps: TrendingTopics.org
Peter Skomoroch
 
Ad

Recently uploaded (20)

PPTX
Controller Request and Response in Odoo18
Celine George
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PPTX
Difference between write and update in odoo 18
Celine George
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PDF
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
Controller Request and Response in Odoo18
Celine George
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
Difference between write and update in odoo 18
Celine George
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
EDUCATIONAL MEDIA/ TEACHING AUDIO VISUAL AIDS
Sonali Gupta
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Horarios de distribución de agua en julio
pegazohn1978
 
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
Introduction presentation of the patentbutler tool
MIPLM
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
Ad

Big data analytics with hadoop volume 2

  • 1. BIG DATA ANALYTICS WITH HADOOP BY- SWAMIL SINGH VIPLAV MANDAL GUIDED BY-DR S.SRIVATAVA
  • 2. AGENDA • Design of website clickstream data and example. • How to load data into sandbox. • Load data using flume and the process. • About flume • Flumes working process. • Process to refine data. • Map reduce • Hcatalog and hcatalog work • Hive • Hive work & their process • Queries
  • 3. DESIGN OF WEBSITE CLICKSTREAM DATA • Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files. • These website log files contain data elements such as a date and time stamp, the visitor’s IP address, the destination URLs of the pages visited, and a user ID that uniquely identifies the website visitor. • One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data
  • 4. EXAMPLE OF WEBSITE CLICK STREAM DATA
  • 5. HOW TO LOAD DATA INTO SANDBOX • The sandbox is a fully contained Data Platform environment • The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. • You can access and analyze sandbox data with many Business Intelligence (BI) applications. • By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.
  • 6. ABOUT FLUME • Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. • . The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents. • Flume components interact in the following way • A flow in Flume starts from the Client. • The Client transmits the Event to a Source operating within the Agent • The Source receiving this Event then delivers it to one or more Channels. • One or more Sinks operating within the same Agent drains these Channels. • Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
  • 7. LOAD DATA USING FLUME ,THE PROCESS
  • 9. • Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the HDFS. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. • In one specific example, -Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. • The large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive..
  • 10. THE PROCESS TO REFINE DATA • Omniture logs* – website log files containing information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID). • Users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender. • Products* – CMS • data that maps product categories to website URLs
  • 12. ABOUT THE MAP REDUCE • A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. • The Map function divides the input into ranges by the Input Format and creates a map task for each range in the input • The output of each map task is partitioned into a group of key-value pairs for each reduce. • The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve
  • 13. HCATALOG • Apache HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications • HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored • HCatalog displays data from RCFile format, text files, or sequence files in a tabular view.
  • 14. HOW HCATALOG WORKS • HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer- deserializer) can be written. • By default, HCatalog supports RCFile, CSV, JSON, and Sequence File formats. To use a custom format, you must provide the Input Format, Output Format, and SerDe. • HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. • HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.
  • 15. HIVE • Hive is a component of Data Platform. Hive provides a SQL-like interface to data stored in DP. • Hive provides a database query interface to Apache Hadoop. • Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. • Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.
  • 16. HOW TO WORK HIVE • The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. • Databases are comprised of tables, which are made up of partitions. • Data can be accessed via a simple query language and Hive supports overwriting or appending data • Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. • In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.