SlideShare a Scribd company logo
Big Data Ingestion
Using Hadoop : 1038-A
1
Faculty Advisor and
Sponsor:
Dr. Rohit Aggarwal
Team Members and Roles
Software Developer and Reporting Analyst
2
Data Engineer
Agenda
• Data Scenario today
• Why this project and Objective
• Business Questions to be answered
• Project Infrastructure
• Project Workflow
• Learnings and Conclusion
3
Data Scenario today
• According to IBM, 80% of data generated today is
unstructured
• Need to process unstructured data to structured data
Outsourcing Data Video Streaming Data Social Media Data Logs Data
4
Why this Project ?
• To learn and implement big data technologies which are
used to process log data
• Clickstream Log Data as the data source
• What is Clickstream data?
Clickstream Data is user navigation data on any website
5
• Big data world comprises of many technologies
• Focus of this project is to learn and implement Apache Hadoop
ecosystem
• Hadoop is primarily used for Data Engineering by many companies
notably Amazon, eBay, Walmart
6
Business Questions to be answered
• Most popular browsing time
• Most popular product category
• Weekly Distribution of Clicks per page
• Customer Conversion Rate
7
Project Infrastructure
Major components of infrastructure
• eCommerce Website Setup: Magento eCommerce platform,
goDaddy cPanel hosting
• Apache Hadoop Setup: Multi Node Cloudera Hadoop cluster
on aws EC2
• Data Collection Server: Divolte.js setup on aws EC2 to track
custom events from website
8
Project Workflow
• Primarily three stages
• Performed several iterations of the below workflow
9
Stage 1 : Data Collection
10
Stage 2: Data Ingestion and Engineering
• The process of accessing and importing data for immediate usage or
storage in a database is called as Data Ingestion
• Build Data Pipeline
• Transfer files from local Filesystem to HDFS (Hadoop File
System)
E.g. Apache Sqoop is a popular tool used in big data ecosystem to
transfer bulk data
11
• Data engineering is a process of converting unstructured data to
meaningful relational data using set of sophisticated tools or
procedures
 ELT Process (Extract Load Transform)
 Focus on data transformation using Map Reduce
E.g. Components used in this project:
• Apache Hive
• Apache Pig
• Python
12
Apache Hive
• It provides a SQL like interface to query data stored in various
databases and file system
Application in our project: Avro File (Data Source)
Before Run Script After
13
Apache Pig and Python
• It is a high level platform for creating programs on Hadoop.
The language used is called as Pig Latin
Application in our project: Web Server Log File (Data Source)
Before Run Script After
14
Sample of Python Script
15
Stage 3: Data Visualization
• Final stage of the process
• Structured Data exported from Hadoop to csv format
• Use of Business Intelligence tools such as Tableau and Power
BI
16
Sample Reports
Website Browsing Hour Distribution Customer Conversion Rate
17
18
Weekly Distribution of Clicks per page
19
Most Popular Product Category
Challenges Faced
• eCommerce Website Setup
• Multi-node cluster creation on Amazon Ec2
• Cloudera Hadoop Installation on cluster
• Parsing links to get detailed information about Product
Category and sub-category
20
Learnings and Conclusion
• Implementation of an end to end project
• Technology Stack worked on:
21
Special Thanks
• Thanks to our entire Capstone Faculty team and Sponsor for
timely guidance
• Bi-weekly progress reports helped us to get a reality check of
the project
• Great learning experience
22
Any Questions?
23
Thank You!!!
24

More Related Content

What's hot (20)

PDF
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
PDF
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
PPTX
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
PPTX
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Self Service BI for OBIEE using Tableau
BI Connector
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PDF
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
PDF
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
PDF
Lambda architecture for real time big data
Trieu Nguyen
 
PPTX
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
PDF
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola
 
PPTX
Snaplogic Live: Big Data in Motion
SnapLogic
 
PDF
Offload, Transform, and Present - The New World of Data Integration
gluent.
 
PPTX
SnapLogic Live: Big Data Integration
SnapLogic
 
PPTX
Spark Summit Keynote by Shaun Connolly
Spark Summit
 
PDF
Webinar: Big Data Integration - Why Same Old, Same Old Won't Cut It
SnapLogic
 
PPTX
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
PPTX
SnapLogic Live: Salesforce Integration
SnapLogic
 
PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
PPTX
Cloud-Con: Integration & Web APIs
SnapLogic
 
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
IOT, Streaming Analytics and Machine Learning
DataWorks Summit/Hadoop Summit
 
Enabling Self Service BI for OBIEE using Tableau
BI Connector
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
Lambda architecture for real time big data
Trieu Nguyen
 
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola
 
Snaplogic Live: Big Data in Motion
SnapLogic
 
Offload, Transform, and Present - The New World of Data Integration
gluent.
 
SnapLogic Live: Big Data Integration
SnapLogic
 
Spark Summit Keynote by Shaun Connolly
Spark Summit
 
Webinar: Big Data Integration - Why Same Old, Same Old Won't Cut It
SnapLogic
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
SnapLogic Live: Salesforce Integration
SnapLogic
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
Cloud-Con: Integration & Web APIs
SnapLogic
 

Similar to Big Data Ingestion Using Hadoop - Capstone Presentation (20)

PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PPTX
Big data analytics with hadoop volume 2
Imviplav
 
PDF
Big data pipelines
Vivek Aanand Ganesan
 
PDF
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
PDF
Big data-hadoop-training-course-content-content
Training Institute
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PDF
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
PDF
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
PDF
Open Source Solution for Data Analyst Workflow
Sigit Prasetyo
 
PPTX
Big Data Processing
Michael Ming Lei
 
PPTX
Hadoop as data refinery
Steve Loughran
 
PPTX
Hadoop as Data Refinery - Steve Loughran
JAX London
 
PPT
Architecting Big Data Ingest & Manipulation
George Long
 
PPTX
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Data Con LA
 
PDF
Big Data , Big Problem?
Mohammadhasan Farazmand
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
PPT
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPT
Capital onehadoopintro
Doug Chang
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Big data analytics with hadoop volume 2
Imviplav
 
Big data pipelines
Vivek Aanand Ganesan
 
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Big data-hadoop-training-course-content-content
Training Institute
 
Introduction to Data Engineering
Durga Gadiraju
 
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
Open Source Solution for Data Analyst Workflow
Sigit Prasetyo
 
Big Data Processing
Michael Ming Lei
 
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Architecting Big Data Ingest & Manipulation
George Long
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Data Con LA
 
Big Data , Big Problem?
Mohammadhasan Farazmand
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Raghu Kashyap
 
Big Data Analytics with Hadoop
Philippe Julio
 
Capital onehadoopintro
Doug Chang
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Ad

Recently uploaded (20)

PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Ad

Big Data Ingestion Using Hadoop - Capstone Presentation

  • 1. Big Data Ingestion Using Hadoop : 1038-A 1 Faculty Advisor and Sponsor: Dr. Rohit Aggarwal
  • 2. Team Members and Roles Software Developer and Reporting Analyst 2 Data Engineer
  • 3. Agenda • Data Scenario today • Why this project and Objective • Business Questions to be answered • Project Infrastructure • Project Workflow • Learnings and Conclusion 3
  • 4. Data Scenario today • According to IBM, 80% of data generated today is unstructured • Need to process unstructured data to structured data Outsourcing Data Video Streaming Data Social Media Data Logs Data 4
  • 5. Why this Project ? • To learn and implement big data technologies which are used to process log data • Clickstream Log Data as the data source • What is Clickstream data? Clickstream Data is user navigation data on any website 5
  • 6. • Big data world comprises of many technologies • Focus of this project is to learn and implement Apache Hadoop ecosystem • Hadoop is primarily used for Data Engineering by many companies notably Amazon, eBay, Walmart 6
  • 7. Business Questions to be answered • Most popular browsing time • Most popular product category • Weekly Distribution of Clicks per page • Customer Conversion Rate 7
  • 8. Project Infrastructure Major components of infrastructure • eCommerce Website Setup: Magento eCommerce platform, goDaddy cPanel hosting • Apache Hadoop Setup: Multi Node Cloudera Hadoop cluster on aws EC2 • Data Collection Server: Divolte.js setup on aws EC2 to track custom events from website 8
  • 9. Project Workflow • Primarily three stages • Performed several iterations of the below workflow 9
  • 10. Stage 1 : Data Collection 10
  • 11. Stage 2: Data Ingestion and Engineering • The process of accessing and importing data for immediate usage or storage in a database is called as Data Ingestion • Build Data Pipeline • Transfer files from local Filesystem to HDFS (Hadoop File System) E.g. Apache Sqoop is a popular tool used in big data ecosystem to transfer bulk data 11
  • 12. • Data engineering is a process of converting unstructured data to meaningful relational data using set of sophisticated tools or procedures  ELT Process (Extract Load Transform)  Focus on data transformation using Map Reduce E.g. Components used in this project: • Apache Hive • Apache Pig • Python 12
  • 13. Apache Hive • It provides a SQL like interface to query data stored in various databases and file system Application in our project: Avro File (Data Source) Before Run Script After 13
  • 14. Apache Pig and Python • It is a high level platform for creating programs on Hadoop. The language used is called as Pig Latin Application in our project: Web Server Log File (Data Source) Before Run Script After 14
  • 15. Sample of Python Script 15
  • 16. Stage 3: Data Visualization • Final stage of the process • Structured Data exported from Hadoop to csv format • Use of Business Intelligence tools such as Tableau and Power BI 16
  • 17. Sample Reports Website Browsing Hour Distribution Customer Conversion Rate 17
  • 18. 18 Weekly Distribution of Clicks per page
  • 20. Challenges Faced • eCommerce Website Setup • Multi-node cluster creation on Amazon Ec2 • Cloudera Hadoop Installation on cluster • Parsing links to get detailed information about Product Category and sub-category 20
  • 21. Learnings and Conclusion • Implementation of an end to end project • Technology Stack worked on: 21
  • 22. Special Thanks • Thanks to our entire Capstone Faculty team and Sponsor for timely guidance • Bi-weekly progress reports helped us to get a reality check of the project • Great learning experience 22

Editor's Notes

  • #5: https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/
  • #6: http://searchcrm.techtarget.com/definition/clickstream-analysis
  • #12: http://whatis.techtarget.com/definition/data-ingestion http://sqoop.apache.org/
  • #13: https://blog.insightdatascience.com/data-science-vs-data-engineering-62da7678adaa
  • #14: https://hive.apache.org/
  • #15: https://pig.apache.org/
  • #16: Python used for parsing text data.
  • #21: Multi-node: Elastic IP, VPC, Security Groups. Parsing Links: extracting information from hyperlinks.