SlideShare a Scribd company logo
www.edureka.co/big-data-and-hadoop
Introduction to Big Data and Hadoop
www.edureka.co/big-data-and-hadoop
What will you learn today?
 Big Data – An Introduction
 Use Cases of Big Data in Multiple Industry Verticals
 Hadoop and its Eco-System
 Hadoop Architecture
 Learning Path for Developers, Administrators,
Testing Professionals and Aspiring DataScientists
www.edureka.co/big-data-and-hadoop
Un-structured Data is Exploding
www.edureka.co/big-data-and-hadoop
IBM’s Definition of Big Data
IBM’s Definition – Big Data Characteristics
www.edureka.co/big-data-and-hadoop
Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
www.edureka.co/big-data-and-hadoop
Annie’s Question
Map the following to corresponding data type:
» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)
www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. XML files, e-mail body  Semi-structured data
Audio, Video, Image, Files, Archived documents  Unstructured data
Data from Enterprise systems (ERP, CRM etc.)  Structured data
www.edureka.co/big-data-and-hadoop
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios
 Web and e-tailing
» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection
 Telecommunications
» Customer Churn Prevention
» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
 Government
» Fraud Detection and Cyber Security
» Welfare Schemes
» Justice
 Healthcare and Life Sciences
» Health Information Exchange
» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services
» Modeling True Risk
» Threat Analysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
 Retail
» Point of Sales Transaction Analysis
» Customer Churn Analysis
» Sentiment Analysis
www.edureka.co/big-data-and-hadoop
Why DFS?
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
43 Minutes
Read 1 TB Data
www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
4.3 Minutes43 Minutes
Read 1 TB Data
www.edureka.co/big-data-and-hadoop
Hadoop!
www.edureka.co/big-data-and-hadoop
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
www.edureka.co/big-data-and-hadoop
Hidden Treasure
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Case Study: Sears Holding Corporation
www.edureka.co/big-data-and-hadoop
Limitations of Existing Data Analytics Architecture
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Instrumentation
A meagre
10% of the
~2PB data is
available for
BI
Storage
2. Moving data to compute
doesn’t scale
90% of
the ~2PB
archived
Processing
3. Premature data
death
1. Can’t explore original
high fidelity raw data
www.edureka.co/big-data-and-hadoop
Solution: A Combined Storage Computer Layer
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both
Storage
And
Processing
Entire ~2PB
Data is
available for
processing
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre
10% as was the case with existing Non-Hadoop solutions.
www.edureka.co/big-data-and-hadoop
Annie’s Question
Hadoop is a framework that allows for the distributed
processing of:
» Small Data Sets
» Large Data Sets
www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
Pig Latin
Data Analysis
Hive
DW System
Other
YARN
Frameworks
(MPI, GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning
www.edureka.co/big-data-and-hadoop
Hadoop Cluster: Facebook
Facebook
 We use Hadoop to store copies of internal log and dimension data sources
and use
it as a source for reporting/analytics and machine learning.
 Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have
built a higher level data warehousing framework using these features
called Hive(see the http://Hadoop.apache.org/hive/). We have also
developed a FUSE implementation over HDFS.
www.edureka.co/big-data-and-hadoop
YARN – Moving beyond MapReduce
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave..)
www.edureka.co/big-data-and-hadoop
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
 Hadoop daemons run on the local machine.
 Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
www.edureka.co/big-data-and-hadoop
Big Data Learning Path
• Java / Python / Ruby
• Hadoop Eco-system
• NoSQL DB
• Spark
• Linux Administration
• Cluster Management
• Cluster Performance
• Virtualization
• Statistics Skills
• Machine Learning
• Hadoop Essentials
• Expertise in R
Developer/Testing
Administration
Data Analyst
Big Data and Hadoop
MapReduce
Design Patterns
Apache
Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics
Using R
Advance Predictive
Modelling in R
Talend for Big Data
Data Visualization
Using Tableau
www.edureka.co/big-data-and-hadoop
Learning Path to Certification
CourseLIVE Online Class Class Recording in LMS
24/7 Post Class Support Module Wise Quiz and Assignment
Project Work
Verifiable Certificate
1. Assistance from Peers and
Support team
2. Review for Certification
www.edureka.co/big-data-and-hadoop
DEMO
www.edureka.co/big-data-and-hadoop
Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
www.edureka.co/big-data-and-hadoop
Thank You
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours

More Related Content

What's hot (20)

PDF
Understanding Big Data And Hadoop
Edureka!
 
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Edureka!
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PDF
Hadoop Administration pdf
Edureka!
 
PPTX
Big data Hadoop presentation
Shivanee garg
 
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
DOCX
Hadoop admin training
Arun Kumar
 
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
ODT
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PDF
Introduction to Hadoop part1
Giovanna Roda
 
PDF
Introduction to Hadoop
joelcrabb
 
PDF
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
PDF
Hadoop Career Path and Interview Preparation
Edureka!
 
Understanding Big Data And Hadoop
Edureka!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Edureka!
 
Hadoop and Big Data
Harshdeep Kaur
 
Hadoop Administration pdf
Edureka!
 
Big data Hadoop presentation
Shivanee garg
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Big data and Hadoop
Rahul Agarwal
 
Hadoop admin training
Arun Kumar
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Introduction to Hadoop part1
Giovanna Roda
 
Introduction to Hadoop
joelcrabb
 
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Big Data & Hadoop Tutorial
Edureka!
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Hadoop Career Path and Interview Preparation
Edureka!
 

Viewers also liked (20)

PPTX
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Zhijie Shen
 
PDF
Converting Visitors into Leads
Edureka!
 
PDF
Manipulating Data with Talend.
Edureka!
 
PDF
AWS Cloud Essentials - An Overview
Edureka!
 
PDF
Why Talend for Big Data?
Edureka!
 
PPTX
SEO Techniques
Edureka!
 
PDF
Hadoop MapReduce Framework
Edureka!
 
PDF
Fault Tolerance with Kafka
Edureka!
 
PDF
Sentiment Analysis in R
Edureka!
 
PDF
5 Best Practices DevOps Culture
Edureka!
 
PPTX
Salesforce Certification | Salesforce Careers | Salesforce Training For Begin...
Edureka!
 
PDF
Differences between OpenStack and AWS
Edureka!
 
PDF
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
PDF
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Edureka!
 
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
PDF
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Edureka!
 
PDF
Control Transactions using PowerCenter
Edureka!
 
PPTX
Selenium Tutorial For Beginners | What Is Selenium? | Selenium Automation Tes...
Edureka!
 
PPTX
Splunk Tutorial for Beginners - What is Splunk | Edureka
Edureka!
 
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Zhijie Shen
 
Converting Visitors into Leads
Edureka!
 
Manipulating Data with Talend.
Edureka!
 
AWS Cloud Essentials - An Overview
Edureka!
 
Why Talend for Big Data?
Edureka!
 
SEO Techniques
Edureka!
 
Hadoop MapReduce Framework
Edureka!
 
Fault Tolerance with Kafka
Edureka!
 
Sentiment Analysis in R
Edureka!
 
5 Best Practices DevOps Culture
Edureka!
 
Salesforce Certification | Salesforce Careers | Salesforce Training For Begin...
Edureka!
 
Differences between OpenStack and AWS
Edureka!
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Edureka!
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Edureka!
 
Control Transactions using PowerCenter
Edureka!
 
Selenium Tutorial For Beginners | What Is Selenium? | Selenium Automation Tes...
Edureka!
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Edureka!
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Ad

Similar to Introduction to Big Data and Hadoop (20)

PDF
Introduction to Big Data and Hadoop
Edureka!
 
PPTX
Learn Hadoop
Edureka!
 
PDF
Hadoop : The Pile of Big Data
Edureka!
 
PPTX
Learn Big Data & Hadoop
Edureka!
 
PPTX
Hadoop for Data Warehousing professionals
Edureka!
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
Pankajkumar496281
 
PPTX
Whatisbigdataandwhylearnhadoop
Edureka!
 
PDF
What is hadoop
Asis Mohanty
 
PPTX
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
PPTX
Big data and hadoop
Sri Kanth
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
Survey Paper on Big Data and Hadoop
IRJET Journal
 
PDF
Big data and hadoop
AshishRathore72
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
PDF
Hadoop Webinar 28July15
Edureka!
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Introduction to Big Data and Hadoop
Edureka!
 
Learn Hadoop
Edureka!
 
Hadoop : The Pile of Big Data
Edureka!
 
Learn Big Data & Hadoop
Edureka!
 
Hadoop for Data Warehousing professionals
Edureka!
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Pankajkumar496281
 
Whatisbigdataandwhylearnhadoop
Edureka!
 
What is hadoop
Asis Mohanty
 
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Big data and hadoop
Sri Kanth
 
Big data and hadoop overvew
Kunal Khanna
 
Survey Paper on Big Data and Hadoop
IRJET Journal
 
Big data and hadoop
AshishRathore72
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
Hadoop Webinar 28July15
Edureka!
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 

Recently uploaded (20)

PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Python basic programing language for automation
DanialHabibi2
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 

Introduction to Big Data and Hadoop

  • 2. www.edureka.co/big-data-and-hadoop What will you learn today?  Big Data – An Introduction  Use Cases of Big Data in Multiple Industry Verticals  Hadoop and its Eco-System  Hadoop Architecture  Learning Path for Developers, Administrators, Testing Professionals and Aspiring DataScientists
  • 4. www.edureka.co/big-data-and-hadoop IBM’s Definition of Big Data IBM’s Definition – Big Data Characteristics
  • 5. www.edureka.co/big-data-and-hadoop Annie’s Introduction Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
  • 6. www.edureka.co/big-data-and-hadoop Annie’s Question Map the following to corresponding data type: » XML files, e-mail body » Audio, Video, Images, Archived documents » Data from Enterprise systems (ERP, CRM etc.)
  • 7. www.edureka.co/big-data-and-hadoop Annie’s Answer Ans. XML files, e-mail body  Semi-structured data Audio, Video, Image, Files, Archived documents  Unstructured data Data from Enterprise systems (ERP, CRM etc.)  Structured data
  • 8. www.edureka.co/big-data-and-hadoop Further Reading More on Big Data http://www.edureka.in/blog/the-hype-behind-big-data/ Why Hadoop? http://www.edureka.in/blog/why-hadoop/ Opportunities in Hadoop http://www.edureka.in/blog/jobs-in-hadoop/ Big Data http://en.wikipedia.org/wiki/Big_Data IBM’s definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/
  • 9. www.edureka.co/big-data-and-hadoop Common Big Data Customer Scenarios  Web and e-tailing » Recommendation Engines » Ad Targeting » Search Quality » Abuse and Click Fraud Detection  Telecommunications » Customer Churn Prevention » Network Performance Optimization » Calling Data Record (CDR) Analysis » Analysing Network to Predict Failure
  • 10. www.edureka.co/big-data-and-hadoop Common Big Data Customer Scenarios (Contd.)  Government » Fraud Detection and Cyber Security » Welfare Schemes » Justice  Healthcare and Life Sciences » Health Information Exchange » Gene Sequencing » Serialization » Healthcare Service Quality Improvements » Drug Safety
  • 11. www.edureka.co/big-data-and-hadoop Common Big Data Customer Scenarios (Contd.)  Banks and Financial services » Modeling True Risk » Threat Analysis » Fraud Detection » Trade Surveillance » Credit Scoring and Analysis  Retail » Point of Sales Transaction Analysis » Customer Churn Analysis » Sentiment Analysis
  • 12. www.edureka.co/big-data-and-hadoop Why DFS? Read 1 TB Data 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine
  • 13. www.edureka.co/big-data-and-hadoop Why DFS? (Contd.) 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine 43 Minutes Read 1 TB Data
  • 14. www.edureka.co/big-data-and-hadoop Why DFS? (Contd.) 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine 4.3 Minutes43 Minutes Read 1 TB Data
  • 16. www.edureka.co/big-data-and-hadoop Hadoop Cluster: A Typical Use Case RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores. Ethernet: 3 x 10 GB/s OS: 64-bit CentOS RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Active NameNodeSecondary NameNode DataNode DataNode RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply StandBy NameNode
  • 17. www.edureka.co/big-data-and-hadoop Hidden Treasure  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data. *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data. Case Study: Sears Holding Corporation
  • 18. www.edureka.co/big-data-and-hadoop Limitations of Existing Data Analytics Architecture Mostly Append BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid Storage only Grid (Original Raw Data) Collection Instrumentation A meagre 10% of the ~2PB data is available for BI Storage 2. Moving data to compute doesn’t scale 90% of the ~2PB archived Processing 3. Premature data death 1. Can’t explore original high fidelity raw data
  • 19. www.edureka.co/big-data-and-hadoop Solution: A Combined Storage Computer Layer Mostly Append BI Reports + Interactive Apps RDBMS (Aggregated Data) Hadoop : Storage + Compute Grid Collection Instrumentation Both Storage And Processing Entire ~2PB Data is available for processing No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
  • 20. www.edureka.co/big-data-and-hadoop Annie’s Question Hadoop is a framework that allows for the distributed processing of: » Small Data Sets » Large Data Sets
  • 21. www.edureka.co/big-data-and-hadoop Annie’s Answer Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
  • 22. www.edureka.co/big-data-and-hadoop Hadoop Ecosystem Pig Latin Data Analysis Hive DW System Other YARN Frameworks (MPI, GRAPH) HBaseMapReduce Framework YARN Cluster Resource Management Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Hadoop 2.0 Sqoop Unstructured or Semi-structured Data Structured Data Flume Mahout Machine Learning
  • 23. www.edureka.co/big-data-and-hadoop Hadoop Cluster: Facebook Facebook  We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.  Currently we have 2 major clusters: » A 1100-machine cluster with 8800 cores and about 12 PB raw storage. » A 300-machine cluster with 2400 cores and about 3 PB raw storage. » Each (commodity) node has 8 cores and 12 TB of storage. » We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive(see the http://Hadoop.apache.org/hive/). We have also developed a FUSE implementation over HDFS.
  • 24. www.edureka.co/big-data-and-hadoop YARN – Moving beyond MapReduce BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..)
  • 25. www.edureka.co/big-data-and-hadoop Hadoop Cluster Modes Hadoop can run in any of the following three modes: Fully-Distributed Mode Pseudo-Distributed Mode  Hadoop daemons run on the local machine.  Hadoop daemons run on a cluster of machines. Standalone (or Local) Mode  No daemons, everything runs in a single JVM.  Suitable for running MapReduce programs during development.  Has no DFS.
  • 26. www.edureka.co/big-data-and-hadoop Big Data Learning Path • Java / Python / Ruby • Hadoop Eco-system • NoSQL DB • Spark • Linux Administration • Cluster Management • Cluster Performance • Virtualization • Statistics Skills • Machine Learning • Hadoop Essentials • Expertise in R Developer/Testing Administration Data Analyst Big Data and Hadoop MapReduce Design Patterns Apache Spark & Scala Apache Cassandra Linux Administration Hadoop Administration Data Science Business Analytics Using R Advance Predictive Modelling in R Talend for Big Data Data Visualization Using Tableau
  • 27. www.edureka.co/big-data-and-hadoop Learning Path to Certification CourseLIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz and Assignment Project Work Verifiable Certificate 1. Assistance from Peers and Support team 2. Review for Certification
  • 29. www.edureka.co/big-data-and-hadoop Further Reading  Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/  Apache Hadoop HDFS Architecture http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
  • 30. www.edureka.co/big-data-and-hadoop Thank You Questions/Queries/Feedback Recording and presentation will be made available to you within 24 hours