SlideShare a Scribd company logo
Presented by
Stephen Peter
The Hadoop Data Access Layer
Stephen Peter
E-Mail: Stephen.peter@gmail.com
LinkedIn - https://in.linkedin.com/in/stephenepeter
Hortonworks Certified Trainer.
Hortonworks Certified Developer (Apache Pig & Hive)
Digital Badge : http://bcert.me/sxohnqiq
Professional Experience: Over 20 years of IT experience with
specialization in Business Intelligence , Data warehousing and Big Data.
Worked in organizations such as HCL Tech, Oracle , Cisco Systems.
Presently working as Hadoop trainer at Spring People.
Area of interest: coexistence of Enterprise DW and Hadoop
Introduction
• The motivation for Hadoop
▫ The need for ingesting, storing and analyzing big data.
▫ Use cases on the value of Big Data.
• Hadoop as an integral part of Modern Data Architecture.
• The HDP (Hortonworks Data Platform) reference architecture.
▫ HDP Data Access Layer.
 The different components its functions and application.
• Use case – Data warehouse Optimization using Hadoop.
▫ to achieve better insight and cost effectiveness.
Agenda
Emerging Data landscape
• In the past the world’s data doubled every
century, now its every 2 years.
• The flood of data is driven by IOT, mobile
devices, server logs, geo location coordinates,
social media and sensor data.
• Big data is characterized by:
 Velocity – 90% of world’s data created in the
last two years.
 Volume – from 8 ZB in 2015 expected to grow
to 40 ZB by 2020.
 Variety – 80% of enterprise data unstructured
ranging from docs, emails, images, web logs,
sensor data, geospatial coordinates and server
logs.
Big Data Use Cases
Source: https://hortonworks.com
Hadoop – An integral part of modern Data Architecture
Source: https://hortonworks.com
Hortonworks Hadoop Platform - HDP
www.hortonworks.com
• Batch Processing using Map Reduce Framework
• Interactive SQL Query using Hive on Tez framework.
• Apache Pig scripting language can run on MR or Tez.
• Low latency data access via NoSQL database Hbase.
• Apache Storm processes and analyze streams of data
in real time as it flows into HDFS
• Apache Spark is a fast, in-memory data processing
engine that enables batch, real-time, and advanced
analytics on the Apache Hadoop platform.
HDP - Data Access Layer
www.hortonworks.com
Ingest Data into HDFS using Scoop
▫ The primary use case:
 Stream log entries from multiple machines
 Aggregate them to a centralized, persistent
store such as the Hadoop Distributed File
System
 Log entries can be analyzed by other Hadoop
tools.
▫ Flume is not limited to log entries.
 Flume is used to collect many types of
streaming data.
 Examples include network traffic data, social
media generated data, machine sensor data, and
email messages.
▫ Flume is not the best choice where data is not
regularly generated.
Ingest Data into HDFS using Flume
• Use the Twitter streaming API as the source
• Create a twitter application
• Configure the flume agent by modifying the flume
configuration.
▫ Configure the source, channel and sink.
▫ Source type:
org.apache.flume.source.twitter.TwitterSource
▫ Channel type: MemChannel
▫ Sink type : HDFS
• Run the flume command to extract data from
twitter.
for example
$ flume-ng agent --conf ./conf/ -f conf/twitter.conf
Importing Twitter data into HDFS
Query Data using Hive
Example Hive QL commands
 Create a Hive managed table:
CREATE TABLE stockinfo (symbol STRING, price FLOAT,
change FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’;
 Create a Hive external table:
CREATE EXTERNAL TABLE salaries (gender string, age int, salary
double,zip int
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',‘
LOCATION '/user/train/salaries/';
 Load data from file in HDFS:
LOAD DATA INPATH ‘/user/me/stockdata.csv’
OVERWRITE INTO TABLE stockinfo;
 View everything in the table:
SELECT * from stockinfo;
Performance tuning in Hive
• Hive Partition table
• Hive Buckets
• Use Optimized Row Columnar (ORC) Format storage
• Cost Based SQL Optimization
• Using Hive on Tez for low latency query
Use cases for Apache Pig
• Pig can extract data from multiple sources, transform it and store it in HDFS.
• Research raw data.
• Iterative data processing
database
data
log
data
sensor
data
transform HDFS
extract transform load
Hive
other
tools
PIG
analysis
tools
 Load data from a file and apply a schema:
stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS
(symbol STRING, price FLOAT, change FLOAT) ;
 Display the data in stockinfo:
DUMP stockinfo;
 Filter the stockinfo data and write the filtered data to HDFS:
IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);
STORE IBM_only INTO ‘ibm_stockinfo’;
 Load data from a file without applying a schema
a = LOAD ‘flightdelays’ using PigStorage(‘,’);
 Apply schema on read
c = foreach a generate $0 as year:int, $1 as month:int,
$4 as name:chararray;
Example Pig Statements
Create workflow using Apache Oozie
email
distcp
MapReduce
Hive
PigSqoop
Oozie workflow example
data data
Apache Oozie is a server-based workflow engine
used to execute Hadoop jobs.
Used to build and schedule complex data
transformations by combining MapReduce,
Apache Hive, Apache Pig, and Apache Sqoop
jobs into a single, logical unit of work.
Oozie can also perform Java, Linux shell,
distcp, SSH, email, and other operations.
Oozie runs as a Java Web application in
Apache Tomcat.
Use Case -Data warehouse Optimization with Hadoop
Hadoop data access layer v4.0

More Related Content

What's hot (20)

PPTX
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
PPTX
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PPTX
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop
ronit gaikwad
 
PDF
What is an Open Data Lake? - Data Sheets | Whitepaper
Vasu S
 
PPTX
Digital Transformation with Microsoft Azure
Luan Moreno Medeiros Maciel
 
PPTX
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PPTX
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
PDF
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Spark Summit
 
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
PPTX
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to Azure HDInsight
Stéphane Fréchette
 
PDF
Intro to Big Data - Spark
Sofian Hadiwijaya
 
PPTX
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
PPTX
Hadoop and Hive in Enterprises
markgrover
 
PPTX
Optimizing Big Data to run in the Public Cloud
Qubole
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
What is an Open Data Lake? - Data Sheets | Whitepaper
Vasu S
 
Digital Transformation with Microsoft Azure
Luan Moreno Medeiros Maciel
 
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Spark Summit
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
Introduction to Azure HDInsight
Stéphane Fréchette
 
Intro to Big Data - Spark
Sofian Hadiwijaya
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
Hadoop and Hive in Enterprises
markgrover
 
Optimizing Big Data to run in the Public Cloud
Qubole
 

Similar to Hadoop data access layer v4.0 (20)

PDF
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Mark Rittman
 
PPTX
Intro to Hybrid Data Warehouse
Jonathan Bloom
 
PDF
Open Source Solution for Data Analyst Workflow
Sigit Prasetyo
 
PDF
An Overview Of Apache Pig And Apache Hive
Joe Andelija
 
PDF
Big data and mstr bridge the elephant
Kognitio
 
PPTX
Big Data Summer training presentation
HarshitaKamboj
 
PDF
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
PPTX
Big Data - Part IV
Thanuja Seneviratne
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PPTX
Introduction to PIG
Shanmathy Prabakaran
 
PPTX
Introduction to Data Analyst Training
Cloudera, Inc.
 
PPTX
SoCal BigData Day
John Park
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PPTX
Big data
jaskaur1234
 
PPT
Architecting Big Data Ingest & Manipulation
George Long
 
PPTX
Big Data Processing Using Hadoop Infrastructure
Dmitry Buzdin
 
PPTX
Apache Hive for modern DBAs
Luis Marques
 
PPTX
Big Data & Hadoop Data Analysis
Koushik Mondal
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Mark Rittman
 
Intro to Hybrid Data Warehouse
Jonathan Bloom
 
Open Source Solution for Data Analyst Workflow
Sigit Prasetyo
 
An Overview Of Apache Pig And Apache Hive
Joe Andelija
 
Big data and mstr bridge the elephant
Kognitio
 
Big Data Summer training presentation
HarshitaKamboj
 
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
Big Data - Part IV
Thanuja Seneviratne
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Introduction to PIG
Shanmathy Prabakaran
 
Introduction to Data Analyst Training
Cloudera, Inc.
 
SoCal BigData Day
John Park
 
Intro to Hadoop
Jonathan Bloom
 
Big data
jaskaur1234
 
Architecting Big Data Ingest & Manipulation
George Long
 
Big Data Processing Using Hadoop Infrastructure
Dmitry Buzdin
 
Apache Hive for modern DBAs
Luis Marques
 
Big Data & Hadoop Data Analysis
Koushik Mondal
 
Ad

More from SpringPeople (20)

PPTX
Growth hacking tips and tricks that you can try
SpringPeople
 
PPTX
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
PPTX
Introduction to Big Data
SpringPeople
 
PPTX
Introduction to Microsoft Azure IaaS
SpringPeople
 
PPTX
Introduction to Selenium WebDriver
SpringPeople
 
PPT
Introduction to Open stack - An Overview
SpringPeople
 
PPTX
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
PPT
Why 2 million Developers depend on MuleSoft
SpringPeople
 
PPTX
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
SpringPeople
 
PPTX
Mastering Test Automation: How To Use Selenium Successfully
SpringPeople
 
PPTX
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
SpringPeople
 
PDF
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
PDF
SpringPeople - Devops skills - Do you have what it takes?
SpringPeople
 
PPTX
Elastic - ELK, Logstash & Kibana
SpringPeople
 
PDF
Introduction To Core Java - SpringPeople
SpringPeople
 
PDF
Introduction To Hadoop Administration - SpringPeople
SpringPeople
 
PDF
Introduction To Cloud Foundry - SpringPeople
SpringPeople
 
PDF
Introduction To Spring Enterprise Integration - SpringPeople
SpringPeople
 
PDF
Introduction To Groovy And Grails - SpringPeople
SpringPeople
 
PDF
Introduction To Jenkins - SpringPeople
SpringPeople
 
Growth hacking tips and tricks that you can try
SpringPeople
 
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Introduction to Big Data
SpringPeople
 
Introduction to Microsoft Azure IaaS
SpringPeople
 
Introduction to Selenium WebDriver
SpringPeople
 
Introduction to Open stack - An Overview
SpringPeople
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
SpringPeople
 
Why 2 million Developers depend on MuleSoft
SpringPeople
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
SpringPeople
 
Mastering Test Automation: How To Use Selenium Successfully
SpringPeople
 
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
SpringPeople
 
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
SpringPeople - Devops skills - Do you have what it takes?
SpringPeople
 
Elastic - ELK, Logstash & Kibana
SpringPeople
 
Introduction To Core Java - SpringPeople
SpringPeople
 
Introduction To Hadoop Administration - SpringPeople
SpringPeople
 
Introduction To Cloud Foundry - SpringPeople
SpringPeople
 
Introduction To Spring Enterprise Integration - SpringPeople
SpringPeople
 
Introduction To Groovy And Grails - SpringPeople
SpringPeople
 
Introduction To Jenkins - SpringPeople
SpringPeople
 
Ad

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 

Hadoop data access layer v4.0

  • 1. Presented by Stephen Peter The Hadoop Data Access Layer
  • 2. Stephen Peter E-Mail: [email protected] LinkedIn - https://in.linkedin.com/in/stephenepeter Hortonworks Certified Trainer. Hortonworks Certified Developer (Apache Pig & Hive) Digital Badge : http://bcert.me/sxohnqiq Professional Experience: Over 20 years of IT experience with specialization in Business Intelligence , Data warehousing and Big Data. Worked in organizations such as HCL Tech, Oracle , Cisco Systems. Presently working as Hadoop trainer at Spring People. Area of interest: coexistence of Enterprise DW and Hadoop Introduction
  • 3. • The motivation for Hadoop ▫ The need for ingesting, storing and analyzing big data. ▫ Use cases on the value of Big Data. • Hadoop as an integral part of Modern Data Architecture. • The HDP (Hortonworks Data Platform) reference architecture. ▫ HDP Data Access Layer.  The different components its functions and application. • Use case – Data warehouse Optimization using Hadoop. ▫ to achieve better insight and cost effectiveness. Agenda
  • 4. Emerging Data landscape • In the past the world’s data doubled every century, now its every 2 years. • The flood of data is driven by IOT, mobile devices, server logs, geo location coordinates, social media and sensor data. • Big data is characterized by:  Velocity – 90% of world’s data created in the last two years.  Volume – from 8 ZB in 2015 expected to grow to 40 ZB by 2020.  Variety – 80% of enterprise data unstructured ranging from docs, emails, images, web logs, sensor data, geospatial coordinates and server logs.
  • 5. Big Data Use Cases Source: https://hortonworks.com
  • 6. Hadoop – An integral part of modern Data Architecture Source: https://hortonworks.com
  • 7. Hortonworks Hadoop Platform - HDP www.hortonworks.com
  • 8. • Batch Processing using Map Reduce Framework • Interactive SQL Query using Hive on Tez framework. • Apache Pig scripting language can run on MR or Tez. • Low latency data access via NoSQL database Hbase. • Apache Storm processes and analyze streams of data in real time as it flows into HDFS • Apache Spark is a fast, in-memory data processing engine that enables batch, real-time, and advanced analytics on the Apache Hadoop platform. HDP - Data Access Layer www.hortonworks.com
  • 9. Ingest Data into HDFS using Scoop
  • 10. ▫ The primary use case:  Stream log entries from multiple machines  Aggregate them to a centralized, persistent store such as the Hadoop Distributed File System  Log entries can be analyzed by other Hadoop tools. ▫ Flume is not limited to log entries.  Flume is used to collect many types of streaming data.  Examples include network traffic data, social media generated data, machine sensor data, and email messages. ▫ Flume is not the best choice where data is not regularly generated. Ingest Data into HDFS using Flume
  • 11. • Use the Twitter streaming API as the source • Create a twitter application • Configure the flume agent by modifying the flume configuration. ▫ Configure the source, channel and sink. ▫ Source type: org.apache.flume.source.twitter.TwitterSource ▫ Channel type: MemChannel ▫ Sink type : HDFS • Run the flume command to extract data from twitter. for example $ flume-ng agent --conf ./conf/ -f conf/twitter.conf Importing Twitter data into HDFS
  • 13. Example Hive QL commands  Create a Hive managed table: CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;  Create a Hive external table: CREATE EXTERNAL TABLE salaries (gender string, age int, salary double,zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘ LOCATION '/user/train/salaries/';  Load data from file in HDFS: LOAD DATA INPATH ‘/user/me/stockdata.csv’ OVERWRITE INTO TABLE stockinfo;  View everything in the table: SELECT * from stockinfo;
  • 14. Performance tuning in Hive • Hive Partition table • Hive Buckets • Use Optimized Row Columnar (ORC) Format storage • Cost Based SQL Optimization • Using Hive on Tez for low latency query
  • 15. Use cases for Apache Pig • Pig can extract data from multiple sources, transform it and store it in HDFS. • Research raw data. • Iterative data processing database data log data sensor data transform HDFS extract transform load Hive other tools PIG analysis tools
  • 16.  Load data from a file and apply a schema: stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS (symbol STRING, price FLOAT, change FLOAT) ;  Display the data in stockinfo: DUMP stockinfo;  Filter the stockinfo data and write the filtered data to HDFS: IBM_only = FILTER stockinfo BY (symbol == ‘IBM’); STORE IBM_only INTO ‘ibm_stockinfo’;  Load data from a file without applying a schema a = LOAD ‘flightdelays’ using PigStorage(‘,’);  Apply schema on read c = foreach a generate $0 as year:int, $1 as month:int, $4 as name:chararray; Example Pig Statements
  • 17. Create workflow using Apache Oozie email distcp MapReduce Hive PigSqoop Oozie workflow example data data Apache Oozie is a server-based workflow engine used to execute Hadoop jobs. Used to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache Pig, and Apache Sqoop jobs into a single, logical unit of work. Oozie can also perform Java, Linux shell, distcp, SSH, email, and other operations. Oozie runs as a Java Web application in Apache Tomcat.
  • 18. Use Case -Data warehouse Optimization with Hadoop