SlideShare a Scribd company logo
Big data and computing grid
1
2
3
4
5
6
Starting our complex problem
Awesome
Distribute file system
Lightning-Fast Cluster Computing
2
3
About goal
This workshop will help in understanding compute grid, it’s model and it’s
implementation. We have tried my best to explain the concepts in detail. The
programming language used for demo in here is Scala. And we then apply
this model to settle some problems we can see where this is simple and
useful for us.
1
• Lost of Data (Terabytes or Petabytes)
• Big data is the term of collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing application. The challenges include capture, curation,
storage, search, sharing, transfer, analysis and visualization.
• Systems/ Enterprises generate huge amount of data form Terabytes to and even
Petabytes of information.
NYSE generates about one terabyte of new trade
data/day to perform stock trading analytics to
determine trends for optimal trades.
Big data and computing grid
7
• 2,500 exabytes of new information in 2013
with internet as primary driver
• Digital universe grew by 62% last year to
800K petabytes and will grow to 1.2
zettabytes this years
8
9
Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even
petabytes—of information.
Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big
data must be used as it streams into your enterprise in order to maximize its value.
Big data is any type of data - structured and unstructured data such as text, sensor data,
audio, video, click streams, log files and more. New insights are found when analyzing these
data types together.
1
2
3
4
10
Recommendation engines
Ad targeting
Search quality
Abuse and Click fraud detection
1
2
3
4
11
Customer churn prevention
Network performance optimization
Calling data record analysis
Analyzing network to predict failure
1
2
3
4
12
Health information exchange
Gene sequencing
Healthcare service quality improvements
Drug safety
1
2
3
4
13
Modeling true risk
Threat analysis
Fraud detection
Credit scoring and analysis
Big data and computing grid
Big data and computing grid
Big data and computing grid
2
Big data and computing grid
Apache Hadoop is a framework that allows for
distributed processing of large data sets across
clusters of commodity computers using a simple
programming model.
It is an Open-source Data Management with
scale-out storage & distributed processing.
19
1 2 3 4
20
21
22
• Splits a task across processors
• “Near” the data & assembles results
• Self-healing, high bandwidth
• Clustered storage
• JobTracker manages the TaskTrackers
• Distributed across nodes
• Natively redundant
• NameNode tracks locations.
23
3
Apache Spark is an open-source cluster-computing
framework for real time processing developed by the Apache
Software Foundation.
Spark provides an interface for programming entire clusters
with implicit data parallelism and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of
computations.
25
Big data and computing grid
27
28
29
Map, flatMap, Filter, … Collect, Take, count, …
Resilient Distributed Datasets (RDD): Collection that can be operated in parallel.
4
31
Hadoop Mapreduce Spark
Time to sort 100TB data
32
Most active open source community in big data
200+ developers, 50+ companies contributing
33
34
4
Big data and computing grid

More Related Content

What's hot (20)

PPTX
Great Expectations Presentation
Adam Doyle
 
PDF
Introduction to Big Data
Haluan Irsad
 
PPTX
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
PDF
Big data analytics with Apache Hadoop
Suman Saurabh
 
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
PDF
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
PPTX
Big Data Analytics
Tyrone Systems
 
PPSX
Big Data
Neha Mehta
 
PDF
5 Factors Impacting Your Big Data Project's Performance
Qubole
 
PPTX
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
PDF
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
PDF
Introduction to Big Data
Kristof Jozsa
 
PPTX
Bigdata
Saravanan Manoharan
 
PPT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
PPTX
Introduction to big data
Sitaram Kotnis
 
PPTX
Big data unit 2
RojaT4
 
PPTX
Introduction to Big Data
Karan Desai
 
PDF
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Great Expectations Presentation
Adam Doyle
 
Introduction to Big Data
Haluan Irsad
 
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
Big data analytics with Apache Hadoop
Suman Saurabh
 
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Big Data Analytics
Tyrone Systems
 
Big Data
Neha Mehta
 
5 Factors Impacting Your Big Data Project's Performance
Qubole
 
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
Introduction to Big Data
Kristof Jozsa
 
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Introduction to big data
Sitaram Kotnis
 
Big data unit 2
RojaT4
 
Introduction to Big Data
Karan Desai
 
Big Data Analytics for Real Time Systems
Kamalika Dutta
 

Similar to Big data and computing grid (20)

PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PDF
Bigdata and Hadoop Bootcamp
Spotle.ai
 
PPTX
Inroduction to Big Data
Omnia Safaan
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PDF
Big data processing with apache spark
sarith divakar
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
Big Data training
vishal192091
 
PDF
Intro to Big Data - Spark
Sofian Hadiwijaya
 
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PDF
Big Data Processing with Hadoop : A Review
IRJET Journal
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
Big data overview
beCloudReady
 
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
PPTX
Big Data
Faisal Ahmed
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Bigdata and Hadoop Bootcamp
Spotle.ai
 
Inroduction to Big Data
Omnia Safaan
 
Spark Driven Big Data Analytics
inoshg
 
Big data processing with apache spark
sarith divakar
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Big Data training
vishal192091
 
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Big Data Processing with Hadoop : A Review
IRJET Journal
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Big data overview
beCloudReady
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Big Data
Faisal Ahmed
 
Ad

Recently uploaded (20)

PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
What Is Data Integration and Transformation?
subhashenia
 
Ad

Big data and computing grid

  • 2. 1 2 3 4 5 6 Starting our complex problem Awesome Distribute file system Lightning-Fast Cluster Computing 2
  • 3. 3 About goal This workshop will help in understanding compute grid, it’s model and it’s implementation. We have tried my best to explain the concepts in detail. The programming language used for demo in here is Scala. And we then apply this model to settle some problems we can see where this is simple and useful for us.
  • 4. 1
  • 5. • Lost of Data (Terabytes or Petabytes) • Big data is the term of collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing application. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. • Systems/ Enterprises generate huge amount of data form Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data/day to perform stock trading analytics to determine trends for optimal trades.
  • 7. 7 • 2,500 exabytes of new information in 2013 with internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 zettabytes this years
  • 8. 8
  • 9. 9 Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
  • 10. 1 2 3 4 10 Recommendation engines Ad targeting Search quality Abuse and Click fraud detection
  • 11. 1 2 3 4 11 Customer churn prevention Network performance optimization Calling data record analysis Analyzing network to predict failure
  • 12. 1 2 3 4 12 Health information exchange Gene sequencing Healthcare service quality improvements Drug safety
  • 13. 1 2 3 4 13 Modeling true risk Threat analysis Fraud detection Credit scoring and analysis
  • 17. 2
  • 19. Apache Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing. 19
  • 20. 1 2 3 4 20
  • 21. 21
  • 22. 22 • Splits a task across processors • “Near” the data & assembles results • Self-healing, high bandwidth • Clustered storage • JobTracker manages the TaskTrackers • Distributed across nodes • Natively redundant • NameNode tracks locations.
  • 23. 23
  • 24. 3
  • 25. Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations. 25
  • 27. 27
  • 28. 28
  • 29. 29 Map, flatMap, Filter, … Collect, Take, count, … Resilient Distributed Datasets (RDD): Collection that can be operated in parallel.
  • 30. 4
  • 31. 31 Hadoop Mapreduce Spark Time to sort 100TB data
  • 32. 32 Most active open source community in big data 200+ developers, 50+ companies contributing
  • 33. 33
  • 34. 34
  • 35. 4