SlideShare a Scribd company logo
Dipanjan Mukherjee
Bigdata and Hadoop
Bootcamp
What Is Bigdata
Big data means really a big data, it is a collection
of large datasets that cannot be processed using
traditional computing techniques. Big data is not
merely a data, rather it has become a complete
subject, which involves various tools, techniques
and frameworks.
Bigdata Market Size
Big Data Perspective And Volume
 The big data growth we’ve been witnessing is only natural. We constantly generate
data. On Google alone, we submit 40,000 search queries per second. That amounts
to 1.2 trillion searches yearly!
 Each minute, 300 new hours of video show up on YouTube. That’s why there’s more
than 1 billion gigabytes (1 exabyte) of data on its servers!
 People share more than 100 terabytes of data on Facebook daily. Every minute,
users send 31 million messages and view 2.7 million videos.
 Big data usage statistics indicate people take about 80% of photos on their
smartphones. Considering that only this year over 1.4 billion devices will be
shipped worldwide, we can only expect this percentage to grow.
 Smart devices (for example, fitness trackers, sensors, Amazon Echo) produce 5
quintillion bytes of data daily. In 5 years, we can expect for the number of these
gadgets to be more than 50 billion!
 Big data stats indicate that more than 30% of data will be uploaded to the cloud
by next year.
 Huge companies like Google use shared computing to satisfy their customers’
needs. About 1,000 computers are involved in answering every query.
 In fact, the most popular open source for distributed computing – Hadoop, has a
compound annual growth rate of 58% and will surpass $1 billion by 2020.
What Is Hadoop
Hadoop is a free, Java-based programming
framework that supports the processing of large
data sets in a distributed computing environment.
It is part of the Apache project sponsored by the
Apache Software Foundation.
Hadoop Architecture
Hadoop Components
❖ MapReduce
❖ HDFS
❖ Hadoop Common
❖ YARN (Yet Another Resource Negotiator)
MapReduce Architecture
MapReduce Wordcount Example
HDFS Architecture
NameNode
Metadata (Name, replicas,…):
/hime/foo/data, 3,…
Client
Rack 1 Rack 2Client
Read
Metadata ops
Replication
Block opsReadDataNodes ReadDataNodes
Write
It is distributed across hundreds or even thousands of servers with each node storing a part of the file system.
Since the storage is done on commodity hardware, there are more chances of the node failing and, with that,
the data can be lost. HDFS gets over that problem by storing the same data in multiple sets.
HDFS works quite well for data loads that come in a streaming format. So, it is more suited for batch processing
applications rather than for interactive use. It is important to note that HDFS works for high throughput rather
than low latency.
HDFS works exclusively well for large datasets, and the standard size of datasets could be anywhere between
gigabytes and terabytes. It provides high-aggregate data bandwidth, and it is possible to scale hundreds of
nodes in a single cluster. Hence, millions of files are supported in a single instance.
It is extremely important to stick to data coherency. The standard files that come routinely in the HDFS fold
are the read-once and write-many-times files so that the data can remain the same and it can be accessed
multiple times without any issues regarding data coherency.
HDFS works on the assumption that moving of computation is much easier, faster, and cheaper than moving of
data of humongous size, which can create network congestion and lead to longer overall turnaround times.
HDFS provides the facility to let applications access data at the place where they are located.
HDFS is highly profitable in the sense that it can easily work on commodity hardware that are of different types
without any issue of compatibility. Hence, it is very well suited for taking advantage of cheaply and readily
available commodity hardware components.
HDFS Benefits
❑ Issue with small files
❑ Slow processing speed
❑ Latency
❑ Security
❑ No real-time data processing
❑ Support for batch processing only
❑ Uncertainty
❑ Lengthy line of code
❑ No caching
❑ No use of use
❑ No delta iteration
HDFS Limitations
Apache Spark Overview
Spark is the cluster computing framework for large-scale data processing.
Spark offers a set of libraries in three languages (Java, Scala, Python) for
its unified computing engine. What does this definition actually mean?
▪ Unified — with Spark, there is no need to piece together an application
out of multiple APIs or systems. Spark provides you with enough built-in
APIs to get the job done.
▪ Computing Engine — Spark handles the loading of data from various file
systems and runs computations on it, but does not store any data itself
permanently. Spark operates entirely in memory, allowing unparalleled
performance and speed.
▪ Libraries — Spark is comprised of a series of libraries built for data
science tasks. Spark includes libraries for SQL (Spark SQL), Machine
Learning (MLlib), Stream Processing (Spark Streaming and Structured
Streaming), and Graph Analytics (GraphX).
Apache Spark architecture
Apache Hadoop MR VS Apache Spark
Spark vs Hadoop MapReduce
Factors Spark Hadoop MapReduce
Speed 100x times than MapReduce Faster than traditional system
Written in Scala Java
Data Processing
Batch/ real-time/ iterative/
interactive/ graph
Batch processing
Ease of Use
Compact and easier than
Hadoop
Complex and lengthy
Caching
Caches the data in-memory
and enhances the system
performance
Doesn’t support caching of
data
Bigdata on cloud
Cloud computing Bigdata
Definition
Provides resources (storage, computing,
databases, monitoring tools etc.) on
demand
Provides a way to handle huge
volumes of data and generate
insights
Reference
It refers to internet services from SaaS,
PaaS to Iaas
It refers to data, which can be
structured, semi-structures or
unstructured
How they are used
It uses wide range of network of cloud
servers over the internet to analyze data
and information
It could be developed either
on-premise or cloud to discover
undiscovered patterns and
generate actionable insights
Formats
Cloud computing is new paradigm to
computing resources
It consists of all kind of data,
which are in many different
formats
Used for
Used to store data and information on
remote servers
It is used to describe huge
volume of data and
information

More Related Content

What's hot (20)

PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Ashok Royal
 
PPTX
Big data-at-detik
k4ndar
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
PDF
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
PPTX
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
PPTX
Big Data Hadoop Technology
Rahul Sharma
 
PPTX
Whatisbigdataandwhylearnhadoop
Edureka!
 
PPTX
Hadoop for beginners free course ppt
Njain85
 
PPTX
Big Data - A brief introduction
Frans van Noort
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PDF
Big data Big Analytics
Ajay Ohri
 
PDF
Big data and hadoop
Kishor Parkhe
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
PDF
An introduction to Big Data
ForwardSprint
 
PPTX
Big Data Analytics for Non-Programmers
Edureka!
 
DOCX
10 Popular Hadoop Technical Interview Questions
ZaranTech LLC
 
PDF
Introduction to Big Data
Joey Li
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Ashok Royal
 
Big data-at-detik
k4ndar
 
Big Data and Hadoop
MaulikLakhani
 
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Big Data Hadoop Technology
Rahul Sharma
 
Whatisbigdataandwhylearnhadoop
Edureka!
 
Hadoop for beginners free course ppt
Njain85
 
Big Data - A brief introduction
Frans van Noort
 
Big data ppt
Thirunavukkarasu Ps
 
Big data Big Analytics
Ajay Ohri
 
Big data and hadoop
Kishor Parkhe
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
An introduction to Big Data
ForwardSprint
 
Big Data Analytics for Non-Programmers
Edureka!
 
10 Popular Hadoop Technical Interview Questions
ZaranTech LLC
 
Introduction to Big Data
Joey Li
 

Similar to Bigdata and Hadoop Bootcamp (20)

PDF
Introduction to Big Data
Haluan Irsad
 
PDF
Hadoop
Veera Sundari
 
PDF
Big data with java
Stefan Angelov
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PPTX
Big data Presentation
himanshu arora
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
Inroduction to Big Data
Omnia Safaan
 
PPTX
Big data
Mina Soltani
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
big data and hadoop
ahmed alshikh
 
PPT
Big Data & Hadoop
Krishna Sujeer
 
PDF
DBA to Data Scientist
pasalapudi
 
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
PDF
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PPTX
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PDF
Big data processing with apache spark
sarith divakar
 
Introduction to Big Data
Haluan Irsad
 
Big data with java
Stefan Angelov
 
Big data and hadoop overvew
Kunal Khanna
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Big data Presentation
himanshu arora
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Inroduction to Big Data
Omnia Safaan
 
Big data
Mina Soltani
 
Hadoop and Big Data: Revealed
Sachin Holla
 
big data and hadoop
ahmed alshikh
 
Big Data & Hadoop
Krishna Sujeer
 
DBA to Data Scientist
pasalapudi
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Big data processing with apache spark
sarith divakar
 
Ad

More from Spotle.ai (20)

PDF
Spotle AI-thon - AI For Good Business Plan Showcase - Team IIM Indore - AI Ro...
Spotle.ai
 
PDF
Spotle AI-thon - AI For Good Business Plan Showcase - Cummins College
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Elit...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India- Ankur chat...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team La c...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Temp...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Shivam Gi...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Tech Owls...
Spotle.ai
 
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Jar...
Spotle.ai
 
PDF
Artificial intelligence in fintech
Spotle.ai
 
PDF
Semi-supervised Machine Learning
Spotle.ai
 
PDF
Basics of Reinforcement Learning
Spotle.ai
 
PDF
Tableau And Data Visualization - Get Started
Spotle.ai
 
PDF
Artificial Intelligence in FinTech
Spotle.ai
 
PDF
Supervised and Unsupervised Machine Learning
Spotle.ai
 
PDF
Growing-up With AI
Spotle.ai
 
PDF
AI And Cyber-security Threats
Spotle.ai
 
PDF
Robotic Process Automation With Blue Prism
Spotle.ai
 
Spotle AI-thon - AI For Good Business Plan Showcase - Team IIM Indore - AI Ro...
Spotle.ai
 
Spotle AI-thon - AI For Good Business Plan Showcase - Cummins College
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Elit...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India- Ankur chat...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team La c...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Temp...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Shivam Gi...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Tech Owls...
Spotle.ai
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Jar...
Spotle.ai
 
Artificial intelligence in fintech
Spotle.ai
 
Semi-supervised Machine Learning
Spotle.ai
 
Basics of Reinforcement Learning
Spotle.ai
 
Tableau And Data Visualization - Get Started
Spotle.ai
 
Artificial Intelligence in FinTech
Spotle.ai
 
Supervised and Unsupervised Machine Learning
Spotle.ai
 
Growing-up With AI
Spotle.ai
 
AI And Cyber-security Threats
Spotle.ai
 
Robotic Process Automation With Blue Prism
Spotle.ai
 
Ad

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Bigdata and Hadoop Bootcamp

  • 2. What Is Bigdata Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks.
  • 4. Big Data Perspective And Volume  The big data growth we’ve been witnessing is only natural. We constantly generate data. On Google alone, we submit 40,000 search queries per second. That amounts to 1.2 trillion searches yearly!  Each minute, 300 new hours of video show up on YouTube. That’s why there’s more than 1 billion gigabytes (1 exabyte) of data on its servers!  People share more than 100 terabytes of data on Facebook daily. Every minute, users send 31 million messages and view 2.7 million videos.  Big data usage statistics indicate people take about 80% of photos on their smartphones. Considering that only this year over 1.4 billion devices will be shipped worldwide, we can only expect this percentage to grow.  Smart devices (for example, fitness trackers, sensors, Amazon Echo) produce 5 quintillion bytes of data daily. In 5 years, we can expect for the number of these gadgets to be more than 50 billion!  Big data stats indicate that more than 30% of data will be uploaded to the cloud by next year.  Huge companies like Google use shared computing to satisfy their customers’ needs. About 1,000 computers are involved in answering every query.  In fact, the most popular open source for distributed computing – Hadoop, has a compound annual growth rate of 58% and will surpass $1 billion by 2020.
  • 5. What Is Hadoop Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
  • 7. Hadoop Components ❖ MapReduce ❖ HDFS ❖ Hadoop Common ❖ YARN (Yet Another Resource Negotiator)
  • 10. HDFS Architecture NameNode Metadata (Name, replicas,…): /hime/foo/data, 3,… Client Rack 1 Rack 2Client Read Metadata ops Replication Block opsReadDataNodes ReadDataNodes Write
  • 11. It is distributed across hundreds or even thousands of servers with each node storing a part of the file system. Since the storage is done on commodity hardware, there are more chances of the node failing and, with that, the data can be lost. HDFS gets over that problem by storing the same data in multiple sets. HDFS works quite well for data loads that come in a streaming format. So, it is more suited for batch processing applications rather than for interactive use. It is important to note that HDFS works for high throughput rather than low latency. HDFS works exclusively well for large datasets, and the standard size of datasets could be anywhere between gigabytes and terabytes. It provides high-aggregate data bandwidth, and it is possible to scale hundreds of nodes in a single cluster. Hence, millions of files are supported in a single instance. It is extremely important to stick to data coherency. The standard files that come routinely in the HDFS fold are the read-once and write-many-times files so that the data can remain the same and it can be accessed multiple times without any issues regarding data coherency. HDFS works on the assumption that moving of computation is much easier, faster, and cheaper than moving of data of humongous size, which can create network congestion and lead to longer overall turnaround times. HDFS provides the facility to let applications access data at the place where they are located. HDFS is highly profitable in the sense that it can easily work on commodity hardware that are of different types without any issue of compatibility. Hence, it is very well suited for taking advantage of cheaply and readily available commodity hardware components. HDFS Benefits
  • 12. ❑ Issue with small files ❑ Slow processing speed ❑ Latency ❑ Security ❑ No real-time data processing ❑ Support for batch processing only ❑ Uncertainty ❑ Lengthy line of code ❑ No caching ❑ No use of use ❑ No delta iteration HDFS Limitations
  • 13. Apache Spark Overview Spark is the cluster computing framework for large-scale data processing. Spark offers a set of libraries in three languages (Java, Scala, Python) for its unified computing engine. What does this definition actually mean? ▪ Unified — with Spark, there is no need to piece together an application out of multiple APIs or systems. Spark provides you with enough built-in APIs to get the job done. ▪ Computing Engine — Spark handles the loading of data from various file systems and runs computations on it, but does not store any data itself permanently. Spark operates entirely in memory, allowing unparalleled performance and speed. ▪ Libraries — Spark is comprised of a series of libraries built for data science tasks. Spark includes libraries for SQL (Spark SQL), Machine Learning (MLlib), Stream Processing (Spark Streaming and Structured Streaming), and Graph Analytics (GraphX).
  • 15. Apache Hadoop MR VS Apache Spark Spark vs Hadoop MapReduce Factors Spark Hadoop MapReduce Speed 100x times than MapReduce Faster than traditional system Written in Scala Java Data Processing Batch/ real-time/ iterative/ interactive/ graph Batch processing Ease of Use Compact and easier than Hadoop Complex and lengthy Caching Caches the data in-memory and enhances the system performance Doesn’t support caching of data
  • 16. Bigdata on cloud Cloud computing Bigdata Definition Provides resources (storage, computing, databases, monitoring tools etc.) on demand Provides a way to handle huge volumes of data and generate insights Reference It refers to internet services from SaaS, PaaS to Iaas It refers to data, which can be structured, semi-structures or unstructured How they are used It uses wide range of network of cloud servers over the internet to analyze data and information It could be developed either on-premise or cloud to discover undiscovered patterns and generate actionable insights Formats Cloud computing is new paradigm to computing resources It consists of all kind of data, which are in many different formats Used for Used to store data and information on remote servers It is used to describe huge volume of data and information