Big Data Ingestion Using Hadoop - Capstone Presentation

Download as PPTX, PDF

0 likes90 views

The document outlines a project on big data ingestion using the Hadoop ecosystem, focusing on processing clickstream data from an e-commerce website. It covers the project's objectives, infrastructure, workflow stages, and technologies used, primarily Apache Hadoop, Hive, Pig, and Python for data engineering and visualization. The project aims to answer business questions related to user behavior and product popularity, culminating in a series of structured reports.

Data & Analytics

Big Data Ingestion
Using Hadoop : 1038-A
1
Faculty Advisor and
Sponsor:
Dr. Rohit Aggarwal

Team Members and Roles
Software Developer and Reporting Analyst
2
Data Engineer

Agenda
• Data Scenario today
• Why this project and Objective
• Business Questions to be answered
• Project Infrastructure
• Project Workflow
• Learnings and Conclusion
3

Data Scenario today
• According to IBM, 80% of data generated today is
unstructured
• Need to process unstructured data to structured data
Outsourcing Data Video Streaming Data Social Media Data Logs Data
4

Why this Project ?
• To learn and implement big data technologies which are
used to process log data
• Clickstream Log Data as the data source
• What is Clickstream data?
Clickstream Data is user navigation data on any website
5

• Big data world comprises of many technologies
• Focus of this project is to learn and implement Apache Hadoop
ecosystem
• Hadoop is primarily used for Data Engineering by many companies
notably Amazon, eBay, Walmart
6

Business Questions to be answered
• Most popular browsing time
• Most popular product category
• Weekly Distribution of Clicks per page
• Customer Conversion Rate
7

Project Infrastructure
Major components of infrastructure
• eCommerce Website Setup: Magento eCommerce platform,
goDaddy cPanel hosting
• Apache Hadoop Setup: Multi Node Cloudera Hadoop cluster
on aws EC2
• Data Collection Server: Divolte.js setup on aws EC2 to track
custom events from website
8

Project Workflow
• Primarily three stages
• Performed several iterations of the below workflow
9

Stage 2: Data Ingestion and Engineering
• The process of accessing and importing data for immediate usage or
storage in a database is called as Data Ingestion
• Build Data Pipeline
• Transfer files from local Filesystem to HDFS (Hadoop File
System)
E.g. Apache Sqoop is a popular tool used in big data ecosystem to
transfer bulk data
11

• Data engineering is a process of converting unstructured data to
meaningful relational data using set of sophisticated tools or
procedures
 ELT Process (Extract Load Transform)
 Focus on data transformation using Map Reduce
E.g. Components used in this project:
• Apache Hive
• Apache Pig
• Python
12

Apache Hive
• It provides a SQL like interface to query data stored in various
databases and file system
Application in our project: Avro File (Data Source)
Before Run Script After
13

Apache Pig and Python
• It is a high level platform for creating programs on Hadoop.
The language used is called as Pig Latin
Application in our project: Web Server Log File (Data Source)
Before Run Script After
14

Stage 3: Data Visualization
• Final stage of the process
• Structured Data exported from Hadoop to csv format
• Use of Business Intelligence tools such as Tableau and Power
BI
16

Sample Reports
Website Browsing Hour Distribution Customer Conversion Rate
17

18
Weekly Distribution of Clicks per page

Challenges Faced
• eCommerce Website Setup
• Multi-node cluster creation on Amazon Ec2
• Cloudera Hadoop Installation on cluster
• Parsing links to get detailed information about Product
Category and sub-category
20

Learnings and Conclusion
• Implementation of an end to end project
• Technology Stack worked on:
21

Special Thanks
• Thanks to our entire Capstone Faculty team and Sponsor for
timely guidance
• Bi-weekly progress reports helped us to get a reality check of
the project
• Great learning experience
22

More Related Content

What's hot (20)

PDF

Data Care, Feeding, and MaintenanceMercedes Coyle

PDF

Airbyte @ Airflow Summit - The new modern data stackMichel Tricot

PPTX

Disrupting Insurance with Advanced Analytics The Next Generation CarrierDataWorks Summit/Hadoop Summit

PPTX

IOT, Streaming Analytics and Machine Learning DataWorks Summit/Hadoop Summit

PDF

Enabling Self Service BI for OBIEE using TableauBI Connector

PPTX

Analysis of Major Trends in Big Data AnalyticsDataWorks Summit/Hadoop Summit

PDF

How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics

PDF

Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016StampedeCon

PDF

Lambda architecture for real time big dataTrieu Nguyen

PPTX

Gluent Extending Enterprise Applications with Hadoopgluent.

PDF

How to design and implement a data ops architecture with sdc and gcpJoseph Arriola

PPTX

Snaplogic Live: Big Data in MotionSnapLogic

PDF

Offload, Transform, and Present - The New World of Data Integrationgluent.

PPTX

SnapLogic Live: Big Data IntegrationSnapLogic

PPTX

Spark Summit Keynote by Shaun ConnollySpark Summit

PDF

Webinar: Big Data Integration - Why Same Old, Same Old Won't Cut ItSnapLogic

PPTX

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe

PPTX

SnapLogic Live: Salesforce IntegrationSnapLogic

PPTX

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit

PPTX

Cloud-Con: Integration & Web APIsSnapLogic

Data Care, Feeding, and MaintenanceMercedes Coyle

Airbyte @ Airflow Summit - The new modern data stackMichel Tricot

Disrupting Insurance with Advanced Analytics The Next Generation CarrierDataWorks Summit/Hadoop Summit

IOT, Streaming Analytics and Machine Learning DataWorks Summit/Hadoop Summit

Enabling Self Service BI for OBIEE using TableauBI Connector

Analysis of Major Trends in Big Data AnalyticsDataWorks Summit/Hadoop Summit

How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics

Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016StampedeCon

Lambda architecture for real time big dataTrieu Nguyen

Gluent Extending Enterprise Applications with Hadoopgluent.

How to design and implement a data ops architecture with sdc and gcpJoseph Arriola

Snaplogic Live: Big Data in MotionSnapLogic

Offload, Transform, and Present - The New World of Data Integrationgluent.

SnapLogic Live: Big Data IntegrationSnapLogic

Spark Summit Keynote by Shaun ConnollySpark Summit

Webinar: Big Data Integration - Why Same Old, Same Old Won't Cut ItSnapLogic

Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe

SnapLogic Live: Salesforce IntegrationSnapLogic

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit

Cloud-Con: Integration & Web APIsSnapLogic

Similar to Big Data Ingestion Using Hadoop - Capstone Presentation (20)

PDF

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

PPTX

Big data analytics with hadoop volume 2Imviplav

PDF

Big data pipelinesVivek Aanand Ganesan

PDF

Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30

PDF

Big data-hadoop-training-course-content-contentTraining Institute

PPTX

Introduction to Data EngineeringDurga Gadiraju

PDF

Hadoop meets Agile! - An Agile Big Data ModelUwe Printz

PDF

Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon

PDF

Open Source Solution for Data Analyst WorkflowSigit Prasetyo

PPTX

Big Data ProcessingMichael Ming Lei

PPTX

Hadoop as data refinerySteve Loughran

PPTX

Hadoop as Data Refinery - Steve LoughranJAX London

PPT

Architecting Big Data Ingest & ManipulationGeorge Long

PPTX

Building Modern Data Pipelines on GCP via a FREE online BootcampData Con LA

PDF

Big Data , Big Problem?Mohammadhasan Farazmand

PPT

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman

PPT

Gartner peer forum sept 2011 orbitzRaghu Kashyap

PPTX

Big Data Analytics with HadoopPhilippe Julio

PPT

Capital onehadoopintroDoug Chang

PPTX

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

Big data analytics with hadoop volume 2Imviplav

Big data pipelinesVivek Aanand Ganesan

Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30

Big data-hadoop-training-course-content-contentTraining Institute

Introduction to Data EngineeringDurga Gadiraju

Hadoop meets Agile! - An Agile Big Data ModelUwe Printz

Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon

Open Source Solution for Data Analyst WorkflowSigit Prasetyo

Big Data ProcessingMichael Ming Lei

Hadoop as data refinerySteve Loughran

Hadoop as Data Refinery - Steve LoughranJAX London

Architecting Big Data Ingest & ManipulationGeorge Long

Building Modern Data Pipelines on GCP via a FREE online BootcampData Con LA

Big Data , Big Problem?Mohammadhasan Farazmand

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman

Gartner peer forum sept 2011 orbitzRaghu Kashyap

Big Data Analytics with HadoopPhilippe Julio

Capital onehadoopintroDoug Chang

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Recently uploaded (20)

PDF

Research Methodology Overview Introductionayeshagul29594

PPTX

apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...apidays

PPTX

Aict presentation on dpplppp sjdhfh.pptxvabaso5932

PPTX

Advanced_NLP_with_Transformers_PPT_final 50.pptxShiwani Gupta

PPTX

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

PDF

The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...Lal Chandran

PDF

Context Engineering for AI Agents, approaches, memories.pdfTamanna

PDF

apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...apidays

PPTX

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

PPTX

apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...apidays

PDF

apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...apidays

PPTX

Numbers of a nation: how we estimate population statistics | Accessible slidesOffice for National Statistics

PPTX

apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...apidays

PPT

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

PDF

Using AI/ML for Space Biology ResearchVICTOR MAESTRE RAMIREZ

PPTX

SlideEgg_501298-Agentic AI.pptx agentic ai530BYManoj

PDF

Driving Employee Engagement in a Hybrid World.pdfMia scott

PPT

AI Future trends and opportunities_oct7v1.pptSHIKHAKMEHTA

PDF

What does good look like - CRAP Brighton 8 July 2025Jan Kierzyk

PPTX

apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...apidays