SlideShare a Scribd company logo
Prepared By: Marwan A. Al-Wajeeh
1
2
Outline
Big Data an Overview
Big Data Sources
What Is Big Data
Big Data Challenges
Big Data Analytics
3
More than 2.5 billion bytes of data are created EVERY DAY
IBM: 90 percent world’s Data today was produced in the last
two years
80% of world data is unstructured
Facebook Process 500 TB per day.
Lots and Lots of Web Pages (20 billion web pages in google)
A billion Facebook Users
Billions+ Facebook Pages
Hundreds of Million Twitters Account
Hundreds of Million Twitters per Day
Billions Google Queries per Day
Millions of servers, Beta Bytes of Data
4
Big Data an Overview
5
Big Data
6
Internet of Events: 4 sources of event data
7
Big Data Sources
Big Data is a collection of data sets that are large and
complex in nature.
Big Data is any data that is expensive to manage and
hard to extract value from.
They constitute both structure and un structured
data they grow large so fast that they are not
manageable by traditional relational database
systems or congenital statistical tools.
8
What Is Big Data?
Volume: the size of data
 Google Example:
 10 Billions web pages
 Average size of web pages = 200KB
 10 billion * 20KB= 200 TB
 Disk read bandwidth = 50MB/Sec
 Time to read= 4 million seconds= 46+ Day
 Airbus A380 Example:
 Each A380 four engine generates 1 PB of data on a flight,
for example, from London (LHR) to Singapore (SIN)
9
Big Data: Four Challenges (4 V’s)
Velocity (speed of change).
 we are not only generating a lot amount of data but the data is
continuously being added and things are changing very
rapidly.
Verity (different types of data source).
 The diversity of sources, format, quality, and structure
Veracity (uncertainty of data).
 that means that you cannot completely sure that we have
recorded incompletely sure.
10
Big Data: Four Challenges (4 V’s)
11
Traditional vs Big Data
Big data analytics is the process of:
Collecting
Organizing and
Analyzing
Of large set of data “big data” to
Discover patterns and
Other useful information
12
Big Data Analytics
Traditional Analytics Big Data Analytics
Analytics using known data which
is well understood
Not well understood data format
from it largely being unstructured
and semi-structured
Built based on relational data
models
Big data comes in various form and
formats from multiple disconnected
systems. They are almost flat with
no relation ship.
13
Traditional vs Big Data Analytics
 Traditional RDBMS Fails to handle Big Data
Big Data (terabytes) can not fit in the memory for a
single computer
Processing of Big Data in single computer will take a
lot of time
Scaling with the traditional RDBMS is expensive.
14
Analytical Challenges with Big Data
Memory
Disk
CPU
Machine Learning, Statistics
 The algorithms runs on the CPU, and access the data that is in
memory
Then bring the data from disk into memory
What Happens if the data so big, that is can’t all fit in the
memory at the same time.
15
Single Node architecture
 10 billion web pages
Average size of webpage= 20KB
10 billion * 20 KB= 200TB
Disk read bandwidth = 50MB/sec
Time to read = 4 million second= 46+ days
Thus: this is unacceptable, and we need a better solution
 Clustering Computing emerge as new solution
The fundamental idea is to split the data into chunks, if we
have 1000 disks and CPUs, the process will done with in
hour.
16
Google Example
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Cluster Architecture
Multiple rack So We
have a data center
18
Now once we have this kind of cluster
This does not solve the problem completely
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org 19
 Node Failure
A single server can stay up for 3 years (1000 days)
1000 server in the cluster => 1 failure/ day
Million server in cluster => 1000 failure/day (Google have
approximately million server)
 how to store data persistently and keep it available if
nodes can fail
 how to deal with node failure during along running
computation?
20
Cluster Commuting Challenges
 Network bottleneck
Network bandwidth = 1 Gbps
Moving 10 TB takes approximately 1 day
Complex computation might need to move a lot of data
and that can slow computation down.
We need a framework doesn't move data around so much
while it’s doing computation.
Distribution programming is hard!
 It is hard to write distributed programs correctly
We need simple model that hides most of complexity of
distributed programming
21
Cluster Commuting Challenges
Map- Reduce address the challenges of cluster
computing
Store date redundantly on multiple nodes for persistence
and availability
Move computation close to the data to minimize data
movement
Simple programming model to hide complexity of all this
magic
22
Map-Reduce
23
Hadoop= MapReduce + HDFS
Pig Hive HBase
Flume
Rhado
op
Spoop
Oozie
Avro
Zoo
Keeper
Big Data Analytics Tools and Technologies
Thank You
24
4 Types of Analytics
Descriptive: What happened?
Diagnostics: Why did it happen?
Predictive: what will happen?
Prescriptive: what is the best that can happen
Analytics Tools:
SAS
IBM SPSS
Stata
R
MATLAb
25
 The key aspects of the big data platform are: Integration, Analytics
, Visualization, Development, workload optimization , security and
governs
26
The 5 High Value Big Data Use
Cases
27
Thank You
28

More Related Content

PDF
Machine learning for java developers
Nirmal Fernando
 
PPTX
Java ug
Tibor Kurina, PhD
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
PDF
Data science
Purna Chander
 
PPTX
Nicola Pagni - Anomaly Detection in Elasticsearch
MeetupDataScienceRoma
 
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
PDF
CSB_community
Albert Anthony Gavino, MBA
 
PDF
Code Once Use Often with Declarative Data Pipelines
Databricks
 
Machine learning for java developers
Nirmal Fernando
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Data science
Purna Chander
 
Nicola Pagni - Anomaly Detection in Elasticsearch
MeetupDataScienceRoma
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Code Once Use Often with Declarative Data Pipelines
Databricks
 

What's hot (20)

PPT
Data Science in the Real World: Making a Difference
Srinath Perera
 
PDF
Tracking data lineage at Stitch Fix
Stitch Fix Algorithms
 
PDF
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
PDF
AllegroGraph - Cognitive Probability Graph webcast
Franz Inc. - AllegroGraph
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PDF
Quick presentation for the OpenML workshop in Eindhoven 2014
Manuel Martín
 
PDF
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
PDF
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
PPTX
Top 10 Data analytics tools to look for in 2021
Mobcoder
 
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
Stitch Fix Algorithms
 
PDF
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
PPTX
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Sri Ambati
 
PDF
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Edureka!
 
PPTX
Evolution of big data
ShilpaKrishna6
 
PDF
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
PDF
Maoye resume 2017_1_v10_short
Mao Ye
 
PDF
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
PDF
Cheat sheets for data scientists
Ajay Ohri
 
PPTX
Neo4j_allHands_04112013
Arka Pattanayak
 
PPTX
Python for data science
Tanzeel Ahmad Mujahid
 
Data Science in the Real World: Making a Difference
Srinath Perera
 
Tracking data lineage at Stitch Fix
Stitch Fix Algorithms
 
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
AllegroGraph - Cognitive Probability Graph webcast
Franz Inc. - AllegroGraph
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Quick presentation for the OpenML workshop in Eindhoven 2014
Manuel Martín
 
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
Hadoop/Spark Non-Technical Basics
Zitao Liu
 
Top 10 Data analytics tools to look for in 2021
Mobcoder
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
Stitch Fix Algorithms
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Sri Ambati
 
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Edureka!
 
Evolution of big data
ShilpaKrishna6
 
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Maoye resume 2017_1_v10_short
Mao Ye
 
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
Cheat sheets for data scientists
Ajay Ohri
 
Neo4j_allHands_04112013
Arka Pattanayak
 
Python for data science
Tanzeel Ahmad Mujahid
 
Ad

Similar to Introduction Big data (20)

PPTX
bigdata 2.pptx
AjayAgarwal107
 
PPTX
Data mining with big data
Sandip Tipayle Patil
 
PPTX
Data mining with big data
Sandip Tipayle Patil
 
PPT
Seminar presentation
Klawal13
 
PPTX
bigdata.pptx
VIJAYAPRABAP
 
PPTX
A Big Data Concept
Dharmesh Tank
 
PPTX
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PPTX
Introduction to Big Data & Big Data 1.0 System
Petr Novotný
 
PPT
NoSQL Basics - a quick tour
Bikram Sinha. MBA, PMP
 
PPTX
Big data management
zeba khanam
 
PDF
bigdata.pdf
AnjaliKumari301316
 
PDF
Addressing dm-cloud
Genoveva Vargas-Solar
 
PPT
Big data introduction, Hadoop in details
Mahmoud Yassin
 
DOCX
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
tangyechloe
 
PPT
Big data and Internet
Sanoj Kumar
 
PPT
Big Data
Raja Ram Dutta
 
PPTX
Aginity Big Data Research Lab
asifahmed
 
PPTX
Aginity Big Data Research Lab V3
mcacicio
 
PPTX
Aginity Big Data Research Lab
dkuhn
 
bigdata 2.pptx
AjayAgarwal107
 
Data mining with big data
Sandip Tipayle Patil
 
Data mining with big data
Sandip Tipayle Patil
 
Seminar presentation
Klawal13
 
bigdata.pptx
VIJAYAPRABAP
 
A Big Data Concept
Dharmesh Tank
 
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
Introduction to Big Data & Big Data 1.0 System
Petr Novotný
 
NoSQL Basics - a quick tour
Bikram Sinha. MBA, PMP
 
Big data management
zeba khanam
 
bigdata.pdf
AnjaliKumari301316
 
Addressing dm-cloud
Genoveva Vargas-Solar
 
Big data introduction, Hadoop in details
Mahmoud Yassin
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
tangyechloe
 
Big data and Internet
Sanoj Kumar
 
Big Data
Raja Ram Dutta
 
Aginity Big Data Research Lab
asifahmed
 
Aginity Big Data Research Lab V3
mcacicio
 
Aginity Big Data Research Lab
dkuhn
 
Ad

Recently uploaded (20)

PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Software Development Methodologies in 2025
KodekX
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

Introduction Big data

  • 1. Prepared By: Marwan A. Al-Wajeeh 1
  • 2. 2
  • 3. Outline Big Data an Overview Big Data Sources What Is Big Data Big Data Challenges Big Data Analytics 3
  • 4. More than 2.5 billion bytes of data are created EVERY DAY IBM: 90 percent world’s Data today was produced in the last two years 80% of world data is unstructured Facebook Process 500 TB per day. Lots and Lots of Web Pages (20 billion web pages in google) A billion Facebook Users Billions+ Facebook Pages Hundreds of Million Twitters Account Hundreds of Million Twitters per Day Billions Google Queries per Day Millions of servers, Beta Bytes of Data 4 Big Data an Overview
  • 6. 6 Internet of Events: 4 sources of event data
  • 8. Big Data is a collection of data sets that are large and complex in nature. Big Data is any data that is expensive to manage and hard to extract value from. They constitute both structure and un structured data they grow large so fast that they are not manageable by traditional relational database systems or congenital statistical tools. 8 What Is Big Data?
  • 9. Volume: the size of data  Google Example:  10 Billions web pages  Average size of web pages = 200KB  10 billion * 20KB= 200 TB  Disk read bandwidth = 50MB/Sec  Time to read= 4 million seconds= 46+ Day  Airbus A380 Example:  Each A380 four engine generates 1 PB of data on a flight, for example, from London (LHR) to Singapore (SIN) 9 Big Data: Four Challenges (4 V’s)
  • 10. Velocity (speed of change).  we are not only generating a lot amount of data but the data is continuously being added and things are changing very rapidly. Verity (different types of data source).  The diversity of sources, format, quality, and structure Veracity (uncertainty of data).  that means that you cannot completely sure that we have recorded incompletely sure. 10 Big Data: Four Challenges (4 V’s)
  • 12. Big data analytics is the process of: Collecting Organizing and Analyzing Of large set of data “big data” to Discover patterns and Other useful information 12 Big Data Analytics
  • 13. Traditional Analytics Big Data Analytics Analytics using known data which is well understood Not well understood data format from it largely being unstructured and semi-structured Built based on relational data models Big data comes in various form and formats from multiple disconnected systems. They are almost flat with no relation ship. 13 Traditional vs Big Data Analytics
  • 14.  Traditional RDBMS Fails to handle Big Data Big Data (terabytes) can not fit in the memory for a single computer Processing of Big Data in single computer will take a lot of time Scaling with the traditional RDBMS is expensive. 14 Analytical Challenges with Big Data
  • 15. Memory Disk CPU Machine Learning, Statistics  The algorithms runs on the CPU, and access the data that is in memory Then bring the data from disk into memory What Happens if the data so big, that is can’t all fit in the memory at the same time. 15 Single Node architecture
  • 16.  10 billion web pages Average size of webpage= 20KB 10 billion * 20 KB= 200TB Disk read bandwidth = 50MB/sec Time to read = 4 million second= 46+ days Thus: this is unacceptable, and we need a better solution  Clustering Computing emerge as new solution The fundamental idea is to split the data into chunks, if we have 1000 disks and CPUs, the process will done with in hour. 16 Google Example
  • 17. Mem Disk CPU Mem Disk CPU … Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU … Switch Switch1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Cluster Architecture Multiple rack So We have a data center
  • 18. 18 Now once we have this kind of cluster This does not solve the problem completely
  • 19. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
  • 20.  Node Failure A single server can stay up for 3 years (1000 days) 1000 server in the cluster => 1 failure/ day Million server in cluster => 1000 failure/day (Google have approximately million server)  how to store data persistently and keep it available if nodes can fail  how to deal with node failure during along running computation? 20 Cluster Commuting Challenges
  • 21.  Network bottleneck Network bandwidth = 1 Gbps Moving 10 TB takes approximately 1 day Complex computation might need to move a lot of data and that can slow computation down. We need a framework doesn't move data around so much while it’s doing computation. Distribution programming is hard!  It is hard to write distributed programs correctly We need simple model that hides most of complexity of distributed programming 21 Cluster Commuting Challenges
  • 22. Map- Reduce address the challenges of cluster computing Store date redundantly on multiple nodes for persistence and availability Move computation close to the data to minimize data movement Simple programming model to hide complexity of all this magic 22 Map-Reduce
  • 23. 23 Hadoop= MapReduce + HDFS Pig Hive HBase Flume Rhado op Spoop Oozie Avro Zoo Keeper Big Data Analytics Tools and Technologies
  • 25. 4 Types of Analytics Descriptive: What happened? Diagnostics: Why did it happen? Predictive: what will happen? Prescriptive: what is the best that can happen Analytics Tools: SAS IBM SPSS Stata R MATLAb 25
  • 26.  The key aspects of the big data platform are: Integration, Analytics , Visualization, Development, workload optimization , security and governs 26
  • 27. The 5 High Value Big Data Use Cases 27