SlideShare a Scribd company logo
Big Data Architecture on Cloud
Computing Infrastructure
Reza Bakhshayeshi
About me
• Reza Bakhshayeshi
• MSc. Information Technology – Computer Networks
• 7 years of experience in Cloud Computing research
• 3 years of experience in industry
• Email: bakhshayeshi.reza@gmail.com
2
Agenda
• Cloud Computing
• Introduction to OpenStack
• Why OpenStack
• What is Sahara?
• Sahara Architecture
• Lab Session
3
Cloud Computing 4
Five Essential Characteristics
• Based on NIST:
5
Service Offering Models
• Software as a Service (SaaS)
• Platform as a Service (PaaS)
• Infrastructure as a Service (IaaS)
6
Introduction to OpenStack
• OpenStack began in 2010 as a joint project of Rackspace Hosting
and NASA.
• OpenStack is a free and open-source software platform for cloud
computing, mostly deployed as an infrastructure-as-a-service
(IaaS)
7
Why OpenStack?
• OpenStack elevates your business to the cloud.
OpenStack is a scalable, open sourced cloud computing
platform.
• Comprised of modular, scalable, and flexible set of
utilities; provides clients with value, efficiency, and
agility.
8
Why OpenStack?
• Open-source; the technology is supported by a large
community of developers.
• Tried and tested by large businesses.
• Interoperability and open-source APIs allow admins
to manage hybrid IT environments without the
additional overhead layer
9
OpenStack By Numbers 10
11
12
13
What size organizations use OpenStack? 14
Increase Maturity in Deployments 15
OpenStack Architecture 16
17
What is Sahara?
• Basic Idea comes from Amazon Elastic MapReduce (EMR)
• Sahara’s mission is to provide a scalable data processing
stack and associated management interfaces.
• Provision and operate data processing clusters
• Schedule and operate data processing jobs
• Data Processing ~ Hadoop, Spark, Storm, etc.
18
What is Sahara?
• Sahara aims to provide users with a simple means to
provision Hadoop, Spark, and Storm clusters by
specifying several parameters such as the:
oVersion
oCluster topology
oHardware node details and more.
19
Use Cases
• Fast provisioning of data processing clusters on
OpenStack for development and quality assurance(QA).
• Utilization of unused compute power from a general
purpose OpenStack IaaS cloud.
• “Analytics as a Service” for ad-hoc or bursty analytic
workloads (similar to AWS EMR).
20
Key Features
• Designed as an OpenStack component.
• Managed through a REST API with a user interface(UI)
available as part of OpenStack Dashboard.
• Predefined configuration templates with the ability to
modify parameters.
21
Key Features
• Support for a variety of data processing frameworks:
omultiple Hadoop vendor distributions.
oApache Spark and Storm.
opluggable system of Hadoop installation engines.
ointegration with vendor specific management tools,
such as Apache Ambari and Cloudera Management
Console.
22
Key Features - Provision Cluster
• Create/Terminate Cluster
• Heat API/Nova Direct API
• Neutron/Nova Network
• Floating IP Management
• Anti-affinity
• Cluster Scaling
• Add Node/Remove Node
• Support Plugins
• Vanilla/Hortonworks Data Platform/Cloudera/Spark/MapR
23
Key Features - Elastic Data Processing
• Support Job Type
• Hive/Pig/MapReduce/MapReduce
Streaming/Java/Spark/Shell/HBase
• Support Data Locality
• Rack/Hypervisor/Swift
• Data Source
• Internal: Ephemeral Disk/Cinder
• External: Swift
• Run Job in Transient Cluster
24
Sahara and OpenStack 25
Distros
• Vanilla Apache Hadoop: 2.6.0, 2.7.1
• Hotonworks Data Platform (HDP): 2.2, 2.3
• Cloudera (CDH): 5.3.x, 5.4.x
• MapR: 4.0.x, 5.0.x
• Vanilla Apache Spark: 1.0.0, 1.3.1
• Vanilla Apache Storm: 0.9.2
26
Fast Cluster Provisioning
Select
Hadoop Version
Select
Base Image
w/ Hadoop
Define
Cluster
Configuration
Provision
Cluster
Operate
Cluster
Terminate
Cluster
Analytic as a Service using Elastic Data Processing
Select
Hadoop Version
Configure Jobs
Set Limit
for Cluster
Execute Jobs Get The Result
• Choose type of the job: pig, hive, jar-file, etc.
• Select input and output data location (Swift support)
• Cluster will be removed automatically after the job completion
• Provide the details Hadoop configuration, like size, topology, and others
• Sahara will provision VMs, install and configure Hadoop
• Support Scale out Cluster to add/remove nodes
Work Flow 27
Swift
OpenStack
Virtual Clusters
OpenStack
Virtual Clusters
HDFS
Collector Agent
Data Stream
Pattern 2: External - SwiftPattern 1: Internal - HDFS Only
Collector Agent
Collecting Data
Collecting Data
OpenStack use Swift as a data source to store input
and output data. The benefit is to process the data
directly and persist the data via Swift.
OpenStack support to create HDFS on Cinder or
Ephemeral Disk. This method can provide a better
data processing performance via Ephemeral Disk or
to persist the data via Cinder with lower
performance.
Cinder
Ephemeral Disk
MapReduce MapReduce
28
Architecture 29
30
OpenStack + Sahara notes
• CPU:
• Estimated virtualization overhead (KVM): < 10%
• Isolated networks on OpenStack nodes
• Scheduler hints passed by Sahara – place VMs on the same hosts
31
Lab Session 32
Questions? 33

More Related Content

What's hot (18)

PPTX
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 
PPS
Big data hadoop rdbms
Arjen de Vries
 
PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Data Con LA
 
PPT
Hadoop distributions - ecosystem
Jakub Stransky
 
PPTX
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPTX
PPT on Hadoop
Shubham Parmar
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
PDF
Big Data , Big Problem?
Mohammadhasan Farazmand
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PDF
What is hadoop
Asis Mohanty
 
PPTX
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 
Big data hadoop rdbms
Arjen de Vries
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Data Con LA
 
Hadoop distributions - ecosystem
Jakub Stransky
 
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPT on Hadoop
Shubham Parmar
 
Big Data and Hadoop
Flavio Vit
 
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Big Data , Big Problem?
Mohammadhasan Farazmand
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
What is hadoop
Asis Mohanty
 
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Big Data and Hadoop Introduction
Dzung Nguyen
 

Similar to Big data architecture on cloud computing infrastructure (20)

PPTX
Openstack
Bhavna Mor
 
PPTX
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Animesh Singh
 
PPTX
CC -Unit4.pptx
Revathiparamanathan
 
ODP
Deep Dive: OpenStack Summit (Red Hat Summit 2014)
Stephen Gordon
 
PPTX
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
PPTX
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
PPTX
DR_PRESENT 1
Ahmed Salman
 
PDF
Chef and OpenStack Workshop from ChefConf 2013
Matt Ray
 
PPTX
What is the OpenStack Platform? By Peter Dens - Kangaroot
Kangaroot
 
PDF
Cloud Architect Alliance #15: Openstack
Microsoft
 
PPTX
Qubole - Big data in cloud
Dmitry Tolpeko
 
PPTX
HPC and cloud distributed computing, as a journey
Peter Clapham
 
PPTX
Apache Cassandra introduction
fardinjamshidi
 
PPTX
Cloud and OpenStack
Seyed Ehsan Beheshtian
 
PDF
Webinar: What's new in CDAP 3.5?
Cask Data
 
PPTX
UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence
Jonathan Pletzke
 
PDF
Cloud Foundry and OpenStack: How They Fit - Cloud Expo 2014
Jason Anderson
 
PDF
OpenStack 101 update
Kamesh Pemmaraju
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
PDF
TDC2018SP | Trilha Cloud - Why Apache CloudStack
tdc-globalcode
 
Openstack
Bhavna Mor
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Animesh Singh
 
CC -Unit4.pptx
Revathiparamanathan
 
Deep Dive: OpenStack Summit (Red Hat Summit 2014)
Stephen Gordon
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
DR_PRESENT 1
Ahmed Salman
 
Chef and OpenStack Workshop from ChefConf 2013
Matt Ray
 
What is the OpenStack Platform? By Peter Dens - Kangaroot
Kangaroot
 
Cloud Architect Alliance #15: Openstack
Microsoft
 
Qubole - Big data in cloud
Dmitry Tolpeko
 
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Apache Cassandra introduction
fardinjamshidi
 
Cloud and OpenStack
Seyed Ehsan Beheshtian
 
Webinar: What's new in CDAP 3.5?
Cask Data
 
UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence
Jonathan Pletzke
 
Cloud Foundry and OpenStack: How They Fit - Cloud Expo 2014
Jason Anderson
 
OpenStack 101 update
Kamesh Pemmaraju
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
TDC2018SP | Trilha Cloud - Why Apache CloudStack
tdc-globalcode
 
Ad

Recently uploaded (20)

PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Ad

Big data architecture on cloud computing infrastructure

  • 1. Big Data Architecture on Cloud Computing Infrastructure Reza Bakhshayeshi
  • 2. About me • Reza Bakhshayeshi • MSc. Information Technology – Computer Networks • 7 years of experience in Cloud Computing research • 3 years of experience in industry • Email: [email protected] 2
  • 3. Agenda • Cloud Computing • Introduction to OpenStack • Why OpenStack • What is Sahara? • Sahara Architecture • Lab Session 3
  • 6. Service Offering Models • Software as a Service (SaaS) • Platform as a Service (PaaS) • Infrastructure as a Service (IaaS) 6
  • 7. Introduction to OpenStack • OpenStack began in 2010 as a joint project of Rackspace Hosting and NASA. • OpenStack is a free and open-source software platform for cloud computing, mostly deployed as an infrastructure-as-a-service (IaaS) 7
  • 8. Why OpenStack? • OpenStack elevates your business to the cloud. OpenStack is a scalable, open sourced cloud computing platform. • Comprised of modular, scalable, and flexible set of utilities; provides clients with value, efficiency, and agility. 8
  • 9. Why OpenStack? • Open-source; the technology is supported by a large community of developers. • Tried and tested by large businesses. • Interoperability and open-source APIs allow admins to manage hybrid IT environments without the additional overhead layer 9
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. What size organizations use OpenStack? 14
  • 15. Increase Maturity in Deployments 15
  • 17. 17
  • 18. What is Sahara? • Basic Idea comes from Amazon Elastic MapReduce (EMR) • Sahara’s mission is to provide a scalable data processing stack and associated management interfaces. • Provision and operate data processing clusters • Schedule and operate data processing jobs • Data Processing ~ Hadoop, Spark, Storm, etc. 18
  • 19. What is Sahara? • Sahara aims to provide users with a simple means to provision Hadoop, Spark, and Storm clusters by specifying several parameters such as the: oVersion oCluster topology oHardware node details and more. 19
  • 20. Use Cases • Fast provisioning of data processing clusters on OpenStack for development and quality assurance(QA). • Utilization of unused compute power from a general purpose OpenStack IaaS cloud. • “Analytics as a Service” for ad-hoc or bursty analytic workloads (similar to AWS EMR). 20
  • 21. Key Features • Designed as an OpenStack component. • Managed through a REST API with a user interface(UI) available as part of OpenStack Dashboard. • Predefined configuration templates with the ability to modify parameters. 21
  • 22. Key Features • Support for a variety of data processing frameworks: omultiple Hadoop vendor distributions. oApache Spark and Storm. opluggable system of Hadoop installation engines. ointegration with vendor specific management tools, such as Apache Ambari and Cloudera Management Console. 22
  • 23. Key Features - Provision Cluster • Create/Terminate Cluster • Heat API/Nova Direct API • Neutron/Nova Network • Floating IP Management • Anti-affinity • Cluster Scaling • Add Node/Remove Node • Support Plugins • Vanilla/Hortonworks Data Platform/Cloudera/Spark/MapR 23
  • 24. Key Features - Elastic Data Processing • Support Job Type • Hive/Pig/MapReduce/MapReduce Streaming/Java/Spark/Shell/HBase • Support Data Locality • Rack/Hypervisor/Swift • Data Source • Internal: Ephemeral Disk/Cinder • External: Swift • Run Job in Transient Cluster 24
  • 26. Distros • Vanilla Apache Hadoop: 2.6.0, 2.7.1 • Hotonworks Data Platform (HDP): 2.2, 2.3 • Cloudera (CDH): 5.3.x, 5.4.x • MapR: 4.0.x, 5.0.x • Vanilla Apache Spark: 1.0.0, 1.3.1 • Vanilla Apache Storm: 0.9.2 26
  • 27. Fast Cluster Provisioning Select Hadoop Version Select Base Image w/ Hadoop Define Cluster Configuration Provision Cluster Operate Cluster Terminate Cluster Analytic as a Service using Elastic Data Processing Select Hadoop Version Configure Jobs Set Limit for Cluster Execute Jobs Get The Result • Choose type of the job: pig, hive, jar-file, etc. • Select input and output data location (Swift support) • Cluster will be removed automatically after the job completion • Provide the details Hadoop configuration, like size, topology, and others • Sahara will provision VMs, install and configure Hadoop • Support Scale out Cluster to add/remove nodes Work Flow 27
  • 28. Swift OpenStack Virtual Clusters OpenStack Virtual Clusters HDFS Collector Agent Data Stream Pattern 2: External - SwiftPattern 1: Internal - HDFS Only Collector Agent Collecting Data Collecting Data OpenStack use Swift as a data source to store input and output data. The benefit is to process the data directly and persist the data via Swift. OpenStack support to create HDFS on Cinder or Ephemeral Disk. This method can provide a better data processing performance via Ephemeral Disk or to persist the data via Cinder with lower performance. Cinder Ephemeral Disk MapReduce MapReduce 28
  • 30. 30
  • 31. OpenStack + Sahara notes • CPU: • Estimated virtualization overhead (KVM): < 10% • Isolated networks on OpenStack nodes • Scheduler hints passed by Sahara – place VMs on the same hosts 31