SlideShare a Scribd company logo
Dr. Anita Goel
Big Data
and
Cloud Computing
Seminar on Recent Trends in Data Analytics,RKGITM, India
9th September 2017
Dyal Singh College
Department of Computer Science
University of Delhi, India
Presentation Outline
•Introduction to Big Data
•Introduction to Cloud Computing
•Cloud Storage
•Software Defined Storage
•Concept of Virtualization
2
3
Sources of Data
• Earth sciences
• Internet of Things
• Social sciences
• Astronomy
• Business
• Industry
From where data is collected
• Web Browsers
• Search Engines
• Tablets and App
• Mobile devices, tracking systems, RFID
• Sensor networks, social networks, automated
record keeping, video archives, e-commerce
Who is collecting data
• Hospitals & Other Medical Systems
• Banking & Phone Systems
• Credit Card Companies
Why
• Target Marketing
• Targeted Information
Big Data..
Attributes of Big Data
4
Big Data
Features
Volume
Velocity
VeracityValue
Variety
Volume of Data
•Data Sizes
• Exabyte - 1,024 petabytes - 1,152,921,504,606,846,976
• Zettabyte - 1,180,591,620,717,411,303,424
• Yottabyte - 1,208,925,819,614,629,174,706,176
•Volume
• 44x increase from 2009 to 2020
• Data volume increasing exponentially
5
Variety of Data
•Formats, Types and Structures
• Unstructured - text on web, audio, videos, images, pdf file,
text doc (About 85%)
• Semi structured -XML
• Structured - relational databases and spreadsheets
•Platforms - Enterprise, Social media, Sensors
•Flow of data - Static data vs. streaming data
•Single application generating/collecting many types of
data
6
Velocity Veracity and Value of Data
•Velocity
• Data generated fast, Need to be processed fast. Examples, E-
Promotions, Healthcare monitoring
•Veracity
• Diversity of quality, accuracy, Trustworthiness of data
•Value
• All four Vs are important for value specific research and
decision-support applications
7
Big Data
•Relevant technologies and expertise needed to-
8
Generate Collect Store Manage Process Analyze
Present &
utilize data,
information
& knowledge
derived
Attributes of Big Data
9
Big Data
Features
Volume
Velocity
VeracityValue
Variety
Hardware and
Software
Requirement
Real Time
Processing Ability
Technology Enablers for BDA
10
Sharma, S. K., & Wang, X. (2017). Live Data Analytics With Collaborative Edge and Cloud Processing in Wireless IoT Networks. IEEE Access, 5, 4621-4635.
AWS data
science chief
Matt Wood
• Analytics is addictive
• Positive addiction sours if infrastructure can't keep up
• Need platform to move from one scale to next .. Not in data
center frozen in time
• Companies answer original question, business has moved on
Inforworld
Matt Assay
• Picking Spark or Hadoop isn’t key to success.
Picking right infrastructure is
• On-premise solutions are complex, costly, inflexible
• Difficult to keep up with exploding demand for real time actionable information
• Store massive amounts of data and lay infrastructure to perform analytics on it.
Task is both time & resource-intensive.
11
Infrastructure Challenge for BDA
Big Data and Cloud
12
Cloud Computing
Is infrastructure
Offers scalability
Elastic, on-demand, self-service
model
Provides elastic on-demand
computer
Big Data
Represents content
Is big
About extracting VALUE
Needs large on-demand compute
power and distributed storage
Gigaom Research • 53% of large enterprises use cloud resources for BDA
Hortonworks
Connolly
• Data gravity Weather, census, machine & sensor data -
originate outside enterprise, use cloud
• Bulk of data created on premises, analytics on premises
• Stream processing of machine, sensor data; use cloud
Computing Service
13
Computing as a Resource
Computing as a Service
(NIST) What is a Cloud?
•Cloud computing is a model for
• enabling ubiquitous, convenient, on-demand network
access to a shared pool of
• configurable computing resources (e.g., networks,
servers, storage, applications and services)
• that can be rapidly provisioned and released with minimal
management effort or service provider interaction.
National Institute of Standards and Technology's (NIST)
What it means to Users is…
• Accessible everywhere via Internet
• Global access - from thin client, mobile devices, desktops
• No worry - Storage capacity, Upgrading applications
• Local machine has all software without local hosting
• Cost savings in IT investments - Infrastructure, Software, Personnel
• Pay per use utility
• Improved utilisation of compute resources
Cloud Model
•5Essential characteristics
•3Service models
•4Deployment models
16
Cloud Characteristics
• Automatic provisioning – no human
intervention
On-demand Self
Service
• Access cloud from anywhere
Broad Network Access
• Sharing of resources , Location
transparencyResource Pooling
• Can scale from 10 to 100 servers and vice versa
• Resources allocated and released on demandRapid Elasticity
• Pay-as-per-use
Measured Service
17
Source: NIST Working Definition of Cloud Computing
Cloud Service Models
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Infrastructure as a Service (IaaS)
20
Storage
Provides web based,
scalable storage
Allow hiring of storage
space on cloud storage
servers
Manages data
availability and security
Network
Provides resource
sharing - geographically
separate locations
Allows connection to
desired resources
Manages network -
VPN firewalls etc.
Compute
Provides processing
power as a resource
Allows provisioning of
machines
Manages multi-
tenancy issues
Infrastructure as a Service (IaaS)
Google has 450,000 systems
running across 20 datacenters
Microsoft's Windows Live team
is doubling the number of
servers it uses every 14 months
Why buy machines when you can
rent cycles?
Cloud Deployment Models
Public
cloud
Maintained by
3rd party
Available on
subscription
basis (pay as
you go)
Private
cloud
Runs within a
company’s
own Data
Center
For internal
and/or
partners use
Hybrid
cloud
Mixed usage of
private and
public
Leasing public
cloud services
when private
capacity is
insufficient
Community
cloud
Created to meet
needs of a
community
Integrates
services of
different cloud
for community
22
Storage Service
Cloud Storage - Necessary Conditions
Illusion of
infinite
storage
capacity
• Eliminate need
to plan far
ahead for
provisioning
Elimination of
up-front
commitment by
cloud users
• Allow to start small
and increase
capacity as needed
Ability to pay
for use as
needed
• Pay for use on a
short term basis
and release as
needed
-By a report from University of California Berkeley
Cloud Storage Solutions
Storage
Types
File
Storage
Object
Storage
Block
Storage
Cloud Storage Solutions..
Block Storage
Raw physical storage via
a dedicated network
Access protocols
• fiber channel
• iSCSI
OpenStack Cinder,
Amazon Elastic Block
Store (EBS), Ceph RADOS
Block Device(RBD)
File Storage
Data is stored as files
Access protocols
• NFS
• CIFS
GlusterFS, Dropbox
Google Drive
Object Storage
Data is stored as objects
Access protocols
• REST API
• SOAP API
OpenStack Swift, Amazon
S3, Rackspace, Ceph
Variety of Data
•Formats, Types and Structures
• Unstructured - text on web, audio, videos, images, pdf file,
text doc (About 85%)
• Semi structured –XML, Sensors
• Structured - relational databases and spreadsheets
•Platforms - Enterprise, Social media, Sensors
•Flow of data - Static data vs. streaming data
•Single application generating/collecting many types of
data
27
Unstructured Data
•Information that either –
• Does not have a pre-defined data model, or
• Is not organized in a pre-defined manner
•Hard to maintain context and difficult to know
content
•Requirements of unstructured data
–durable
–accessible
–low cost
–manageable
File
File
System Metadata
• Filename: HeraPheri
• Created:
16/12/2013
• Last Modifed:
17/12/2013
Metadata
•Object Description
• Describe the object
• Specifications
• Usage Description
• Access Permissions
• Identify the one needed
• Describes the object
Object
Custom Metadata
• Director: abc
• Producer: dfgh
• Music Director: wert
• Playback singers: ghjl
• Actor: a1, a2, a3
• Release year: 2000
• Type: Comedy
Object
Object = File + Metadata
File Storage
System Metadata
• Filename: HeraPheri
• Created:
16/12/2013
• Last Modifed:
17/12/2013
Object Storage
Custom Metadata
• Director: abc
• Producer: dfgh
• Music Director: wert
• Playback singers:
ghjl
• Actor: a1, a2, a3
• Release year: 2000
• Type: Comedy
Object Storage - Advantages
Support unstructured data
 Descriptive metadata
 Variable sized data containers
 High performance
 High security
 Location independence
 Distributed
-32-
Storage Devices
•Direct Attached Storage - DAS
•Network Attached Storage - NAS
• Expensive, Scaling Issues, NAS Islands
•Storage Area Network - SAN
• Expensive, Scaling Issues, Redundant Array of
Independent Disks (RAID) Recovery time
•Software Defined Storage - SDS
33
SDS Technology
•Does not use NAS/SAN but Commodity Hardware
•Storage Node - Processor + Storage
• Each Node has a Computation power
•Control Plane separate from Data Plane
•Uses Commodity Storage for Data Plane
•Uses Server for Control Plane
•Enabling Technology
• Hadoop Distributed File System (HDFS)
34
HDFS Architecture
35
HDFS
36
Popular Object Storage Cloud Providers
•Commercial Providers
• Amazon Simple Storage Service (S3)
• Window Azure Blob Storage
• EMC Atmos
•Open Source Providers
• OpenStack
• Ceph
• Riak
37
Popular Object Storage Cloud Providers
•Commercial Providers
• Amazon Simple Storage Service (S3)
• Window Azure Blob Storage
• EMC Atmos
•Open Source Providers
• OpenStack
• Ceph
• Riak
38
What is Ceph?
•Open-Source Software
•Software Defined Storage System
•Unified Storage Solution
• Block storage, File storage, Object storage
•Cost effective – Runs on Commodity Hardware
• Provides enterprise - grade highly reliable storage
•Easy to consume - in Linux Kernel
•Integrated with OpenStack, Cinder, Ubuntu
39
Ceph: Architectural Philosophy
•Distributed Storage System
•High Performance System
•Reliable System - No single point of failure
•Massively Scalable - Exabyte levels
•1EB ~ 1000 PB ~ 1 million TB ~ 1billion GB
•Fault tolerant - Data Replication
•Self-manageable, wherever possible
40
41
Key Features
•Decoupled data and metadata – Uses CRUSH
• Files striped onto predictably named objects
• CRUSH maps objects to storage devices
•Dynamic Distributed Metadata Management
• Dynamic subtree partitioning - Distributes metadata among MDSs
•Object-based storage
• OSDs handle migration, replication, failure detection and
recovery
Source: Weil OSDI
Ceph Architecture Overview
42
Ceph Storage Cluster
Underlying Commodity Hardware
Linux OS
Ceph Client Storage Services
File Block Object
Concept of Virtualization
•Decoupling of hardware and software
•Abstract and create a layer of resources
•Uses Hypervisor for abstraction
•Abstracted resources can be
•Can be used, demanded
•Cannot be owned or configured
•Can be sliced, resized, combined, and distributed
43
Traditional Picture
44
Virtualization Architecture
•OS assumes complete control of underlying hardware
•Virtualization provides illusion through VMM
•Hypervisor or VMM is software layer
• Allows multiple VM to run on single physical host
• Provides hardware abstraction to guest OS
• Efficiently multiplexes hardware resources
45
Virtualization
Hardware
Virtual Machine Monitor (VMM) / Hypervisor
Guest OS
(Linux)
Guest OS
(NetBSD)
Guest OS
(Windows)
VM VM VM
App AppApp AppApp
46
Benefits of Virtualization
•Instant Provisioning – Fast Scalability
•Live Migration possible
•Load Balancing and Consolidation in Data Center
possible
•Virtual hardware supports legacy OS efficiently
•Security and Fault Isolation
47
Traditional OS
48
49
VMM and Guest OS
Pre VT-x and Post VT-x
51
VMM ring de-privileging of guest OS VMM executes in VMX root-mode
Guest OS aware its not at Ring 0 Guest OS de-privileging eliminated
Intel Virtualization Technology Processor Virtualization Extensions and Intel Trusted execution Technology
Pre VT-x Post VT-x
Publications
1. An Overview of Data Storage on the Cloud, P. Jain, A. Goel, S. Gupta
In Proceedings of IEEE International Conference on Advanced Research in Engineering
and Technology, India, pp. 318-322, 2013.
2. Object Storage as a Service, P. Jain, A. Goel, S. Gupta
In Proceedings of International Journal of Innovations & Advancement in Computer
Science, Vol. 4, pp. 605-614, 2015.
3. Monitoring Checklist for Ceph Object Storage Infrastructure, P. Jain, A. Goel, S.
Gupta
In Proceedings of 5th IFIP International Conference on Computer Science and Its
Application, Saida, Algeria, pp. 611-623, 2015.
4. Monitoring the Infrastructure of Riak CS, P. Jain, A. Goel, S. Gupta
In Proceedings of 11th International Multi Conference on Information Processing,
Bangalore, India, pp.137-146, 2015.
5. Requirement Checklist for Infrastructure Monitoring of Swift , P. Jain, A. Goel, S.
Gupta
The 2015 International Conference On High Performance Computing & Simulation,
HPCS, Amsterdam, Netherlands
Publications..
6. IaaS as a Service, A. Datt, A. Goel, SC Gupta
In Proceedings of SARC-IRAJ International Conference, New Delhi, India, June 2013,
ISBN: 978-81-927147-6-9, pp. 18-23
7. Comparing Infrastructure Monitoring with CloudStack Compute Services for
Cloud Computing Systems, A. Datt, A. Goel, SC Gupta
In Proceedings of 10th International Workshop - Databases in Networked International
Systems, DNIS (2015) , Japan, LNCS 8999, Springer, 2015, pp. 195-212.
8. Analysis of Infrastructure Monitoring Requirements for OpenStack Nova, A.
Datt, A. Goel, SC Gupta
In Proceedings of Eleventh International Multi Conference on Communication
Networks, ICCN 2015, August 21-23, 2015, Bangalore, India, Volume 54, ISBN: 1877-
0509, pp. 127-136
9. Monitoring list for Compute Infrastructure in Eucalyptus Cloud, A. Datt, A.
Goel, SC Gupta
In Proceedings of The 24th IEEE International Conference on Enabling Technologies:
Infrastructure for Collaborative Enterprise, Cyprus, 2015, Pages: 69 - 71, WETICE
Publications..
10. Infrastructure Monitoring of Compute Cloud, A. Datt, A. Goel, SC Gupta
Published in Journal of Advances in Economics and Business Management (AEBM), ISSN:
2394-1545, vol. 2, issue 5, pp. 439- 444
11. Cloud Service Orchestration Based Architecture of OpenStack Nova and Swift, P.
Jain, A. Datt, A. Goel, S. Gupta
5th International Conference on Advances in Computing, Communications and
Informatics, Jaipur, India September 21-24, 2016
12. Role of Hadoop in Big Data Analytics, A. Goel et al.
In CSI Communications, Vol. 41, Issue 1, April 2017
13. Session on OpenStack, P. Jain, A. Goel
3 hour Session in “Recent Trends in Big Data and Cloud Computing”, Indira Gandhi Delhi
Technical University for Women (IGDTUW), India, 19th December 2013.
14. Software Defined Storage, S.C. Gupta, A. Goel
Half day Tutorial in Asia Pacific Software Engineering Conference (APSEC), 1st December
2015, India.
Thank You
Contact: goel.anita@gmail.com

More Related Content

What's hot (20)

PPTX
Relationship between cloud computing and big data
Jazan University
 
PPTX
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
PPTX
big data and cloud computing
Mohamed Sharique Vellikan
 
PPTX
Cloud computing and big data analytics
hanish93
 
PDF
Big Data & the Cloud
DATAVERSITY
 
PPTX
Big Data in Action : Operations, Analytics and more
Softweb Solutions
 
PDF
Big data storage
Vikram Nandini
 
PPTX
IoT and Big Data - Iot Asia 2014
John Berns
 
PDF
Evolving From Monolithic to Distributed Architecture Patterns in the Cloud
Denodo
 
PDF
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Denodo
 
PDF
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
Jochem van Grondelle
 
PDF
A Journey to the Cloud with Data Virtualization
Denodo
 
PPTX
Introduction to big data
Sitaram Kotnis
 
PDF
NextGen Infrastructure for Big Data
Ed Dodds
 
PDF
Big data trends challenges opportunities
Mohammed Guller
 
PDF
Why Data Virtualization? An Introduction.
Denodo
 
PDF
The Future Of Big Data
Matthew Dennis
 
PPTX
Structuring Big Data
Fujitsu UK
 
PDF
KEYNOTE: Edge optimized architecture for fabric defect detection in real-time
Shuquan Huang
 
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
Relationship between cloud computing and big data
Jazan University
 
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
big data and cloud computing
Mohamed Sharique Vellikan
 
Cloud computing and big data analytics
hanish93
 
Big Data & the Cloud
DATAVERSITY
 
Big Data in Action : Operations, Analytics and more
Softweb Solutions
 
Big data storage
Vikram Nandini
 
IoT and Big Data - Iot Asia 2014
John Berns
 
Evolving From Monolithic to Distributed Architecture Patterns in the Cloud
Denodo
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Denodo
 
To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...
Jochem van Grondelle
 
A Journey to the Cloud with Data Virtualization
Denodo
 
Introduction to big data
Sitaram Kotnis
 
NextGen Infrastructure for Big Data
Ed Dodds
 
Big data trends challenges opportunities
Mohammed Guller
 
Why Data Virtualization? An Introduction.
Denodo
 
The Future Of Big Data
Matthew Dennis
 
Structuring Big Data
Fujitsu UK
 
KEYNOTE: Edge optimized architecture for fabric defect detection in real-time
Shuquan Huang
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 

Similar to Big data and cloud computing 9 sep-2017 (20)

PDF
Cloud computing infrastructure
Dr. Anita Goel
 
PDF
Cloud - NDT - Presentation
Éric Dusablon
 
PPTX
Data Tactics dhs introduction to cloud technologies wtc
DataTactics
 
PPTX
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
PDF
201306 ICEO-SI Keynote speech by Kiwon LEE
kilee011
 
PPTX
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
PDF
Bertenthal
Jesse Lingeman
 
PDF
Equinix Big Data Platform and Cassandra - A view into the journey
Praveen Kumar
 
PPTX
Virtualization and cloud computing
Deep Gupta
 
PPTX
Information Systems
Pasquale Pagano
 
PDF
Data Lake and the rise of the microservices
Bigstep
 
PPTX
What is Cloud computing?
Richard Harvey
 
PPTX
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
PPTX
start_your_datacenter_sds_v3
David Byte
 
PPTX
Scality SDS Day, London, 20 SEP 2017
Chris Evans
 
PPTX
Storage_Technologdawdadsies_Detailed.pptx
2k22it54
 
PPTX
Se training storage grid webscale technical overview
solarisyougood
 
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
PPT
Big data.ppt
IdontKnow66967
 
Cloud computing infrastructure
Dr. Anita Goel
 
Cloud - NDT - Presentation
Éric Dusablon
 
Data Tactics dhs introduction to cloud technologies wtc
DataTactics
 
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
201306 ICEO-SI Keynote speech by Kiwon LEE
kilee011
 
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
Bertenthal
Jesse Lingeman
 
Equinix Big Data Platform and Cassandra - A view into the journey
Praveen Kumar
 
Virtualization and cloud computing
Deep Gupta
 
Information Systems
Pasquale Pagano
 
Data Lake and the rise of the microservices
Bigstep
 
What is Cloud computing?
Richard Harvey
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
start_your_datacenter_sds_v3
David Byte
 
Scality SDS Day, London, 20 SEP 2017
Chris Evans
 
Storage_Technologdawdadsies_Detailed.pptx
2k22it54
 
Se training storage grid webscale technical overview
solarisyougood
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Big data.ppt
IdontKnow66967
 
Ad

Recently uploaded (20)

PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
Ad

Big data and cloud computing 9 sep-2017

  • 1. Dr. Anita Goel Big Data and Cloud Computing Seminar on Recent Trends in Data Analytics,RKGITM, India 9th September 2017 Dyal Singh College Department of Computer Science University of Delhi, India
  • 2. Presentation Outline •Introduction to Big Data •Introduction to Cloud Computing •Cloud Storage •Software Defined Storage •Concept of Virtualization 2
  • 3. 3 Sources of Data • Earth sciences • Internet of Things • Social sciences • Astronomy • Business • Industry From where data is collected • Web Browsers • Search Engines • Tablets and App • Mobile devices, tracking systems, RFID • Sensor networks, social networks, automated record keeping, video archives, e-commerce Who is collecting data • Hospitals & Other Medical Systems • Banking & Phone Systems • Credit Card Companies Why • Target Marketing • Targeted Information Big Data..
  • 4. Attributes of Big Data 4 Big Data Features Volume Velocity VeracityValue Variety
  • 5. Volume of Data •Data Sizes • Exabyte - 1,024 petabytes - 1,152,921,504,606,846,976 • Zettabyte - 1,180,591,620,717,411,303,424 • Yottabyte - 1,208,925,819,614,629,174,706,176 •Volume • 44x increase from 2009 to 2020 • Data volume increasing exponentially 5
  • 6. Variety of Data •Formats, Types and Structures • Unstructured - text on web, audio, videos, images, pdf file, text doc (About 85%) • Semi structured -XML • Structured - relational databases and spreadsheets •Platforms - Enterprise, Social media, Sensors •Flow of data - Static data vs. streaming data •Single application generating/collecting many types of data 6
  • 7. Velocity Veracity and Value of Data •Velocity • Data generated fast, Need to be processed fast. Examples, E- Promotions, Healthcare monitoring •Veracity • Diversity of quality, accuracy, Trustworthiness of data •Value • All four Vs are important for value specific research and decision-support applications 7
  • 8. Big Data •Relevant technologies and expertise needed to- 8 Generate Collect Store Manage Process Analyze Present & utilize data, information & knowledge derived
  • 9. Attributes of Big Data 9 Big Data Features Volume Velocity VeracityValue Variety Hardware and Software Requirement Real Time Processing Ability
  • 10. Technology Enablers for BDA 10 Sharma, S. K., & Wang, X. (2017). Live Data Analytics With Collaborative Edge and Cloud Processing in Wireless IoT Networks. IEEE Access, 5, 4621-4635.
  • 11. AWS data science chief Matt Wood • Analytics is addictive • Positive addiction sours if infrastructure can't keep up • Need platform to move from one scale to next .. Not in data center frozen in time • Companies answer original question, business has moved on Inforworld Matt Assay • Picking Spark or Hadoop isn’t key to success. Picking right infrastructure is • On-premise solutions are complex, costly, inflexible • Difficult to keep up with exploding demand for real time actionable information • Store massive amounts of data and lay infrastructure to perform analytics on it. Task is both time & resource-intensive. 11 Infrastructure Challenge for BDA
  • 12. Big Data and Cloud 12 Cloud Computing Is infrastructure Offers scalability Elastic, on-demand, self-service model Provides elastic on-demand computer Big Data Represents content Is big About extracting VALUE Needs large on-demand compute power and distributed storage Gigaom Research • 53% of large enterprises use cloud resources for BDA Hortonworks Connolly • Data gravity Weather, census, machine & sensor data - originate outside enterprise, use cloud • Bulk of data created on premises, analytics on premises • Stream processing of machine, sensor data; use cloud
  • 13. Computing Service 13 Computing as a Resource Computing as a Service
  • 14. (NIST) What is a Cloud? •Cloud computing is a model for • enabling ubiquitous, convenient, on-demand network access to a shared pool of • configurable computing resources (e.g., networks, servers, storage, applications and services) • that can be rapidly provisioned and released with minimal management effort or service provider interaction. National Institute of Standards and Technology's (NIST)
  • 15. What it means to Users is… • Accessible everywhere via Internet • Global access - from thin client, mobile devices, desktops • No worry - Storage capacity, Upgrading applications • Local machine has all software without local hosting • Cost savings in IT investments - Infrastructure, Software, Personnel • Pay per use utility • Improved utilisation of compute resources
  • 16. Cloud Model •5Essential characteristics •3Service models •4Deployment models 16
  • 17. Cloud Characteristics • Automatic provisioning – no human intervention On-demand Self Service • Access cloud from anywhere Broad Network Access • Sharing of resources , Location transparencyResource Pooling • Can scale from 10 to 100 servers and vice versa • Resources allocated and released on demandRapid Elasticity • Pay-as-per-use Measured Service 17 Source: NIST Working Definition of Cloud Computing
  • 18. Cloud Service Models Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS)
  • 19. Infrastructure as a Service (IaaS) 20 Storage Provides web based, scalable storage Allow hiring of storage space on cloud storage servers Manages data availability and security Network Provides resource sharing - geographically separate locations Allows connection to desired resources Manages network - VPN firewalls etc. Compute Provides processing power as a resource Allows provisioning of machines Manages multi- tenancy issues
  • 20. Infrastructure as a Service (IaaS) Google has 450,000 systems running across 20 datacenters Microsoft's Windows Live team is doubling the number of servers it uses every 14 months Why buy machines when you can rent cycles?
  • 21. Cloud Deployment Models Public cloud Maintained by 3rd party Available on subscription basis (pay as you go) Private cloud Runs within a company’s own Data Center For internal and/or partners use Hybrid cloud Mixed usage of private and public Leasing public cloud services when private capacity is insufficient Community cloud Created to meet needs of a community Integrates services of different cloud for community 22
  • 23. Cloud Storage - Necessary Conditions Illusion of infinite storage capacity • Eliminate need to plan far ahead for provisioning Elimination of up-front commitment by cloud users • Allow to start small and increase capacity as needed Ability to pay for use as needed • Pay for use on a short term basis and release as needed -By a report from University of California Berkeley
  • 25. Cloud Storage Solutions.. Block Storage Raw physical storage via a dedicated network Access protocols • fiber channel • iSCSI OpenStack Cinder, Amazon Elastic Block Store (EBS), Ceph RADOS Block Device(RBD) File Storage Data is stored as files Access protocols • NFS • CIFS GlusterFS, Dropbox Google Drive Object Storage Data is stored as objects Access protocols • REST API • SOAP API OpenStack Swift, Amazon S3, Rackspace, Ceph
  • 26. Variety of Data •Formats, Types and Structures • Unstructured - text on web, audio, videos, images, pdf file, text doc (About 85%) • Semi structured –XML, Sensors • Structured - relational databases and spreadsheets •Platforms - Enterprise, Social media, Sensors •Flow of data - Static data vs. streaming data •Single application generating/collecting many types of data 27
  • 27. Unstructured Data •Information that either – • Does not have a pre-defined data model, or • Is not organized in a pre-defined manner •Hard to maintain context and difficult to know content •Requirements of unstructured data –durable –accessible –low cost –manageable
  • 28. File File System Metadata • Filename: HeraPheri • Created: 16/12/2013 • Last Modifed: 17/12/2013
  • 29. Metadata •Object Description • Describe the object • Specifications • Usage Description • Access Permissions • Identify the one needed • Describes the object Object Custom Metadata • Director: abc • Producer: dfgh • Music Director: wert • Playback singers: ghjl • Actor: a1, a2, a3 • Release year: 2000 • Type: Comedy
  • 30. Object Object = File + Metadata File Storage System Metadata • Filename: HeraPheri • Created: 16/12/2013 • Last Modifed: 17/12/2013 Object Storage Custom Metadata • Director: abc • Producer: dfgh • Music Director: wert • Playback singers: ghjl • Actor: a1, a2, a3 • Release year: 2000 • Type: Comedy
  • 31. Object Storage - Advantages Support unstructured data  Descriptive metadata  Variable sized data containers  High performance  High security  Location independence  Distributed -32-
  • 32. Storage Devices •Direct Attached Storage - DAS •Network Attached Storage - NAS • Expensive, Scaling Issues, NAS Islands •Storage Area Network - SAN • Expensive, Scaling Issues, Redundant Array of Independent Disks (RAID) Recovery time •Software Defined Storage - SDS 33
  • 33. SDS Technology •Does not use NAS/SAN but Commodity Hardware •Storage Node - Processor + Storage • Each Node has a Computation power •Control Plane separate from Data Plane •Uses Commodity Storage for Data Plane •Uses Server for Control Plane •Enabling Technology • Hadoop Distributed File System (HDFS) 34
  • 36. Popular Object Storage Cloud Providers •Commercial Providers • Amazon Simple Storage Service (S3) • Window Azure Blob Storage • EMC Atmos •Open Source Providers • OpenStack • Ceph • Riak 37
  • 37. Popular Object Storage Cloud Providers •Commercial Providers • Amazon Simple Storage Service (S3) • Window Azure Blob Storage • EMC Atmos •Open Source Providers • OpenStack • Ceph • Riak 38
  • 38. What is Ceph? •Open-Source Software •Software Defined Storage System •Unified Storage Solution • Block storage, File storage, Object storage •Cost effective – Runs on Commodity Hardware • Provides enterprise - grade highly reliable storage •Easy to consume - in Linux Kernel •Integrated with OpenStack, Cinder, Ubuntu 39
  • 39. Ceph: Architectural Philosophy •Distributed Storage System •High Performance System •Reliable System - No single point of failure •Massively Scalable - Exabyte levels •1EB ~ 1000 PB ~ 1 million TB ~ 1billion GB •Fault tolerant - Data Replication •Self-manageable, wherever possible 40
  • 40. 41 Key Features •Decoupled data and metadata – Uses CRUSH • Files striped onto predictably named objects • CRUSH maps objects to storage devices •Dynamic Distributed Metadata Management • Dynamic subtree partitioning - Distributes metadata among MDSs •Object-based storage • OSDs handle migration, replication, failure detection and recovery Source: Weil OSDI
  • 41. Ceph Architecture Overview 42 Ceph Storage Cluster Underlying Commodity Hardware Linux OS Ceph Client Storage Services File Block Object
  • 42. Concept of Virtualization •Decoupling of hardware and software •Abstract and create a layer of resources •Uses Hypervisor for abstraction •Abstracted resources can be •Can be used, demanded •Cannot be owned or configured •Can be sliced, resized, combined, and distributed 43
  • 44. Virtualization Architecture •OS assumes complete control of underlying hardware •Virtualization provides illusion through VMM •Hypervisor or VMM is software layer • Allows multiple VM to run on single physical host • Provides hardware abstraction to guest OS • Efficiently multiplexes hardware resources 45
  • 45. Virtualization Hardware Virtual Machine Monitor (VMM) / Hypervisor Guest OS (Linux) Guest OS (NetBSD) Guest OS (Windows) VM VM VM App AppApp AppApp 46
  • 46. Benefits of Virtualization •Instant Provisioning – Fast Scalability •Live Migration possible •Load Balancing and Consolidation in Data Center possible •Virtual hardware supports legacy OS efficiently •Security and Fault Isolation 47
  • 49. Pre VT-x and Post VT-x 51 VMM ring de-privileging of guest OS VMM executes in VMX root-mode Guest OS aware its not at Ring 0 Guest OS de-privileging eliminated Intel Virtualization Technology Processor Virtualization Extensions and Intel Trusted execution Technology Pre VT-x Post VT-x
  • 50. Publications 1. An Overview of Data Storage on the Cloud, P. Jain, A. Goel, S. Gupta In Proceedings of IEEE International Conference on Advanced Research in Engineering and Technology, India, pp. 318-322, 2013. 2. Object Storage as a Service, P. Jain, A. Goel, S. Gupta In Proceedings of International Journal of Innovations & Advancement in Computer Science, Vol. 4, pp. 605-614, 2015. 3. Monitoring Checklist for Ceph Object Storage Infrastructure, P. Jain, A. Goel, S. Gupta In Proceedings of 5th IFIP International Conference on Computer Science and Its Application, Saida, Algeria, pp. 611-623, 2015. 4. Monitoring the Infrastructure of Riak CS, P. Jain, A. Goel, S. Gupta In Proceedings of 11th International Multi Conference on Information Processing, Bangalore, India, pp.137-146, 2015. 5. Requirement Checklist for Infrastructure Monitoring of Swift , P. Jain, A. Goel, S. Gupta The 2015 International Conference On High Performance Computing & Simulation, HPCS, Amsterdam, Netherlands
  • 51. Publications.. 6. IaaS as a Service, A. Datt, A. Goel, SC Gupta In Proceedings of SARC-IRAJ International Conference, New Delhi, India, June 2013, ISBN: 978-81-927147-6-9, pp. 18-23 7. Comparing Infrastructure Monitoring with CloudStack Compute Services for Cloud Computing Systems, A. Datt, A. Goel, SC Gupta In Proceedings of 10th International Workshop - Databases in Networked International Systems, DNIS (2015) , Japan, LNCS 8999, Springer, 2015, pp. 195-212. 8. Analysis of Infrastructure Monitoring Requirements for OpenStack Nova, A. Datt, A. Goel, SC Gupta In Proceedings of Eleventh International Multi Conference on Communication Networks, ICCN 2015, August 21-23, 2015, Bangalore, India, Volume 54, ISBN: 1877- 0509, pp. 127-136 9. Monitoring list for Compute Infrastructure in Eucalyptus Cloud, A. Datt, A. Goel, SC Gupta In Proceedings of The 24th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprise, Cyprus, 2015, Pages: 69 - 71, WETICE
  • 52. Publications.. 10. Infrastructure Monitoring of Compute Cloud, A. Datt, A. Goel, SC Gupta Published in Journal of Advances in Economics and Business Management (AEBM), ISSN: 2394-1545, vol. 2, issue 5, pp. 439- 444 11. Cloud Service Orchestration Based Architecture of OpenStack Nova and Swift, P. Jain, A. Datt, A. Goel, S. Gupta 5th International Conference on Advances in Computing, Communications and Informatics, Jaipur, India September 21-24, 2016 12. Role of Hadoop in Big Data Analytics, A. Goel et al. In CSI Communications, Vol. 41, Issue 1, April 2017 13. Session on OpenStack, P. Jain, A. Goel 3 hour Session in “Recent Trends in Big Data and Cloud Computing”, Indira Gandhi Delhi Technical University for Women (IGDTUW), India, 19th December 2013. 14. Software Defined Storage, S.C. Gupta, A. Goel Half day Tutorial in Asia Pacific Software Engineering Conference (APSEC), 1st December 2015, India.